Keywords

1 Introduction

Fuzzy modeling is regarded to be one of the possible classification architecture of machine learning and data mining. There have been a significant number of studies devoted to generating fuzzy decision rules from sample cases or examples. These include attempts to extend many classical machine learning methods to learn fuzzy rules. One very popular approach is decision trees [10]. Since the inception of this concept, it has been extended for the construction and interpretation of more advanced decision trees [3, 57, 9, 13, 15, 18]. Although the decision trees based methods can extract a set of fuzzy rules which works well, a problem is that the lack of backtracking in splitting the node leads to lower learning accuracy comparing to other machine learning methods. Another widely used machine learning method is artificial neural network. In recent years enormous work has been done in attempt to combine the advantages of neural network and fuzzy sets [14]. Hayashi [4] has proposed to extract fuzzy rules from trained neural net. Lin [8], on the other hand, introduced a method of directly generating fuzzy rules from self-organized neural network. The common weakness of neural network, however, is a problem of determination of the optimal size of a network configuration, as this has a significant impact on the effectiveness of its performance.

The objective of this paper is to employ the Dempster-Shafer theory (DST) as a vehicle supporting the generation of fuzzy decision rules. More specifically, we concentrate on the role of fuzzy operators, and on the problem of discretization of continuous attributes. We show how they can be effectively used in the quantization of attributes for the generation of fuzzy rules.

The material is arranged in the following way. First, we summarize the underlying concepts of the Dempster-Shafer theory and briefly discuss the nature of the underlying construction. By doing so, the intension is to make the paper self-contained and help identify some outstanding design problems emerging therein. In Sect. 4 we explain essentials of our model. Finally, in Sect. 5, we report exhaustive experimental studies.

This paper is a continuation of our earlier work [12]. Here we apply theoretical vehicle, introduced in previous research, to the new input data in order to find possible area of application. Our important objective here is to reveal a way in which this approach becomes essential to a more comprehensive treatment of continuous attributes.

2 Dempster-Shafer Theory

The Dempster-Shafer theory starts by assuming a Universe of Discourse Θ also called Frame of Discernment, which is a finite set of mutually exclusive alternatives. The frame of discernment may consist of the possible values of an attribute. For example, if we are trying to determine the disease of a patient, we may consider Θ being the set consisting of all possible diseases.

For each subset S of Θ it is associated:

  • a basic probability assignment m(S)

  • a belief Bel(S)

  • a plausible belief Pla(S)

m(S), Bel(S) and Pla(S) have value in the interval [0,1], and Bel(S) is not greater than Pla(S).

In particular, m represents the strength of some evidence. For example, in rule-based expert system, m may represent the effect of applying of a rule. Bel(S) summarizes all our reasons to believe S. Pla(S) expresses how much we should believe in S if all currently unknown facts were to support S. Thus the true belief in S will be somewhere in the interval [Bel(S), Pla(S)]. More formally, a map

$$ m: 2^{\Theta } \to \left[ {0, 1} \right] $$
(1)

such that for each \( {\text{A}} \in 2^{\Theta } \) (where \( 2^{\Theta } \) is set of all subsets of \( \Theta \))

  1. 1.

    \( m(\emptyset ) = 0 \)

  2. 2.

    \( \sum\limits_{{A \subseteq\Theta }} {m(A) = 1} \)

is called a basic probability assignment for \( \Theta \).

Subset A is called a focal element of m if m(A) > 0.

For a given basic probability assignment m, the Belief of a subset A of \( \Theta \) is the sum of m(B) for all subsets B of A, so

$$ Bel: 2^{\Theta } \to \left[ {0, 1} \right] $$
(2)

such that \( Bel(A) = \sum\limits_{B \subseteq A} {m(B)} \).

The Plausibility of a subset A of \( \Theta \) is defined as Pla(A) = 1 − Bel(A’), where A’ is the complement of A in \( \Theta \).

If we are given two basic probability assignments m1 and m2, we can combine them into a third basic probability assignment \( m: 2^{\Theta } \to \left[ {0, 1} \right] \) in the following way.

Let us consider frame of discernment \( \Theta \) and two belief functions Bel1 and Bel2. We denote focal elements of Bel1 as A1…AK and focal elements of Bel2 as B1…BL respectively and the basic probability assignments as m1 and m2. Then we can show graphically this combination as an orthogonal sum of m1 and m2.

The mass of probability of the interval Ai ∩ Bj expressed as a measure m1(Ai)·m2(Bj) is illustrated in Fig. 1.

Fig. 1.
figure 1

Orthogonal sum of m1 and m2

Of course, more intersections can give the same focal element A. In general the mass of probability of set A is defined as:

$$ m(A) = \sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = A} \\ \end{array} }} {m_{1} (A_{i} )} \cdot m_{2} (B_{j} ) $$
(3)

Then we find a problem with an empty set. It is possible that sets with empty intersection exist. We can meet this normal situation in many combinations. Then the mass of ∅, according to above definition, will be greater than zero, but according to the definition of basic probability assignment, it is not possible.

We assume that

$$ \sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = \emptyset } \\ \end{array} }} {m_{1} (A_{i} )} \cdot m_{2} (B_{j} ) < 1 $$
(4)

to define the orthogonal sum m1 and m2, and denote it as \( m_{1} \oplus m_{2} \).

Then it is necessary to change the definition (3) of the formula of the basic probability assignment of the combination as follows:

  1. 1.

    \( m(\emptyset ) = 0 \)

  2. 2.

    \( m(A) = \frac{{\sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = A} \\ \end{array} }} {m_{1} (A_{i} ) \cdot m_{2} (B_{j} )} }}{{1 - \sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = \emptyset } \\ \end{array} }} {m_{1} (A_{i} ) \cdot m_{2} (B_{j} )} }} \;{\text{for}}\,{\text{non-empty}}\;{\text{A}} \subset \Theta \)

We call this as an orthogonal sum of Bel1 and Bel2, and denote as Bel1Bel2.

This is the Dempster Rule for Combining of Beliefs [11].

For \( \Theta \) we have \( m\left(\Theta \right) = \sum\limits_{{A \subseteq\Theta }} {m(A) = 1} \) and for combination if

$$ \sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = \emptyset } \\ \end{array} }} {m_{1} (A_{i} )} \cdot m_{2} (B_{j} ) = 0 $$

we have

$$ m(\Theta ) = \sum\limits_{{A \subseteq\Theta }} {m(A)} = \sum\limits_{{A \subseteq\Theta }} {\sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = A} \\ \end{array} }} {m_{1} (A_{i} )} \cdot m_{2} (B_{j} )} = \sum\limits_{i,j} {m_{1} (A_{i} )} \cdot m_{2} (B_{j} ) = 1 $$

3 Fuzzy Modelling

Fuzzy set theory is widely known and we do not introduce its underlying concepts essential to understand this framework. Readers interest themselves we refer to [16] and [17].

Fuzzy Modeling is applied in those areas where the model of the system cannot be described precisely because of many reasons. The input data received by the system may not be completely reliable, may contain noise, or may be inconsistent with other data or with expectation about these data. The system is described by the set of linguistic rules. Let D denotes an output variable of the system, and X1, X2, …, Xn denote an input variables. The linguistic rules have the following format:

$$ If\textit{(}X_{1} \;is\;A_{k,1,j1}\textit{)}\;And\; \ldots And\;\textit{(}X_{n} \;is\;A_{k,n,jn}\textit{)}\;Then\;\textit{(}D\;is\;S_{k,p}\textit{)} $$
(5)

where (Xi is Ak,i,ji) are the fuzzy antecedents, Ak,i,ji (1 ≤ ji ≤ |Ai|) are values of the i-th input variable, and Sk,p (1 ≤ p ≤ |S|) is the value of output variable in k-th rule.

The rules are implemented as fuzzy relation according to the formula:

$$ R_{k} = A_{k,1,jl} \times A_{k,2,jl} \times \cdots \times A_{k,n,jl} \times S_{k,p} $$
(6)

where × denotes the fuzzy Cartesian product.

Then all rules are aggregated to relation R described as:

$$ R = \bigcup\limits_{k = 1}^{M} {R_{k} } $$
(7)

where M is the number of rules.

The conclusion is based on the compositional rule of inference

$$ S' = \left( {A_{1,j1}^{'} \times A_{2,j2}^{'} \times \; \cdots \; \times A_{n,jn}^{'} } \right) \circ R $$
(8)

where \( A_{1,j1}^{'} ,A_{2,j2}^{'} , \cdots ,A_{n,jn}^{'} \) are input values, \( S' \) is a conclusion (decision class), and \( \circ \) denotes the composition of fuzzy relation.

In fuzzy modeling we can assume that expert defines the rule set, or we can automatically generate them from the set of samples describing the behavior of the system being modeled.

4 Fuzzy Dempster-Shafer Model

In Fuzzy Dempster-Shafer (FDS) model [2] we consider rules Rr as:

$$ If\;\left( {X_{1} is\;A_{{r,1,j_{1} }} } \right)\; \ldots \;And\; \ldots \;\;\left( {X_{n} is\;A_{{r,n,j_{n} }} } \right)\quad Then\quad \left( {D\;is\;m_{r} } \right) $$
(9)

where Xi and D stand for input and output respectively, and mr is a fuzzy belief structure, that is a standard belief structure with focal elements Sr,p as fuzzy subset of frame of discernment Θ with basic probability assignment mr (Sr,p), and mr (Sr,p) is the believe that the conclusion should be represented as class Sr,p.

4.1 Learning – Rules Construction

In antecedent construction, let us assume that we have n features (attributes) in antecedents of testing example. We consider a collection of m generic linguistic terms characterized by membership functions defined in a universe of discourse being a domain of each attribute. The conclusion belongs to decision class S.

For each element of data t we build a collection:

$$ \begin{array}{*{20}l} {A_{1,1,t} } \hfill & {A_{2,1,t} } \hfill & \ldots \hfill & {A_{n,1,t} } \hfill \\ {A_{1,2,t} } \hfill & {A_{2,2,t} } \hfill & \ldots \hfill & {A_{n,2,t} } \hfill \\ \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {A_{1,m,t} } \hfill & {A_{2,m,t} } \hfill & \ldots \hfill & {A_{n,m,t} } \hfill \\ \end{array} $$
(10)

where Ai,j,t are the values of j-th membership function for i-th feature and for t-th element of data.

Example 1.

We demonstrate the calculations on the set of synthetic data presented in Table 1.

Table 1. Sample data set

First six rows (L1–L6) will constitute learning data, while the remaining ones (T1–T4) will form testing data. All the features are numbers from the <0; 9> interval. The last column represents the decision class equal to 1 or 2. We will consider four membership quadratic functions uniformly distributed along the space of all attributes. Other membership functions will be discussed in the next section.

According to (10) for row T1 we have:

$$ \begin{array}{*{20}l} { 1.0} \hfill & { 1.0} \hfill & {0.0 6 2 5} \hfill & {0.0 1 5 6} \hfill \\ 0 \hfill & 0 \hfill & {0. 9 3 7 5} \hfill & {0. 9 8 4 4} \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ 0 \hfill & 0 \hfill & 0 \hfill & 0 \hfill \\ \end{array} $$

■ On the base of (10), for a given data point t we can calculate two vectors:

$$ \begin{array}{*{20}c} {A_{\mu ,t} :} & {A_{{1,\max_{1} ,t}} } & {A_{{2,\max_{2} ,t}} } & \ldots & {A_{{n,\max_{n} ,t}} } \\ \end{array} $$
(11)

and index of membership functions

$$ \begin{array}{*{20}c} {I_{c,t} :} & {I_{{1,\max_{1} ,t}} } & {I_{{2,\max_{2} ,t}} } & \ldots & {I_{{n,\max_{n} ,t}} } \\ \end{array} $$
(12)

where \( A_{{i,\max_{i} ,t}} \) is a maximum value of all membership functions designed for the feature i and \( I_{{i,\max_{i} ,t}} \) is the number of the best membership function for feature i.

Then we have the following candidate for a rule

$$ \begin{array}{*{20}c} {R_{t} :} & {I_{{1,\max_{1} ,t}} } & {I_{{2,\max_{2} ,t}} } & \ldots & {I_{{n,\max_{2} ,t}} } \\ \end{array} $$
(13)

The firing level of the rule is calculated according to the following formula

$$ \tau_{t} = \mathop \phi \limits_{i = 1}^{n} \left( {A_{{i,\max_{i} ,t}} } \right) $$
(14)

where \( \upphi \) means the operator of fuzzy matching. See Sect. 5.2. for details.

The rule candidate is added to rules set if \( \phi \left[ {\tau_{r} ,m_{r} } \right] \ge Th \) (where Th threshold value, and \( \phi \) matching operator). This can help to eliminate bad rules from the final rule set.

More ten one rule can have the same antecedent part but it is also possible that conclusion of these rules are different. Then we have to use appropriate counters ct,1,.., ct,|S|, where |S| denotes the power of decision class set. These counters can show us how many data, according to rule pattern, vote for each decision class.

Example 2.

In our sample (T1) the vectors are:

  • Aμ,1: 1.0000, 1.0000, 0.9375, 0.9844

  • Ic,1: 1, 1, 2, 2 with counters vector 1, 0

In our sample matching value equals to 0.9229, were multiplication was used as the matching operator (1.0000* 1.0000* 0.9375* 0.9844 = 0.9229). For the threshold set on 0.75, we obtain a new rule.

The product is a new belief structure on X

$$ \hat{m}_{r} = \tau_{r} \wedge m_{r} $$
(15)

Focal elements are fuzzy subset given as

$$ F_{r,p} (x) = \tau_{r} \wedge S_{r,p} (x) $$
(16)

and appropriate distributions of new focal elements are defined as:

$$ \hat{m}_{r} (F_{r,p} ) = m_{r} (S_{r,p} ) $$
(17)

So we can build an aggregate:

$$ m = \bigcup\limits_{r = 1}^{R} {\hat{m}_{r} } $$
(18)

Then for each collection

$$ \Im = \left\{ {F_{{r_{1} ,p_{1} }} ,F_{{r_{2} ,p_{2} }} , \ldots ,\quad F_{{r_{R} ,p_{R} }} } \right\} $$
(19)

where \( F_{{r_{t} ,p_{t} }} \) are focal elements of \( \hat{m}_{r} \) we have focal element E of m described as

$$ E = \bigcup\limits_{t = 1}^{R} {F_{{r_{t} ,p_{t} }} } $$
(20)

with appropriate probability distribution

$$ m\left( E \right) = \prod\limits_{t = 1}^{R} {m\left( {F_{{r_{t} ,p_{t} }} } \right)} $$
(21)

At this point, the rule generation process is complete.

Example 3.

Our sample data produce the following rule set.

$$ \begin{array}{*{20}l} {} \hfill & {{\text{I}}_{{ 1 {\text{max}}}} } \hfill & {{\text{I}}_{{ 2 {\text{max}}}} } \hfill & {{\text{I}}_{{ 3 {\text{max}}}} } \hfill & {{\text{I}}_{{ 4 {\text{max}}}} } \hfill & {{\text{C}}_{ 1} } \hfill & {{\text{C}}_{ 2} } \hfill & m \hfill \\ {{\text{R1}}:} \hfill & 1\hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 0 \hfill & { 2. 5000} \hfill \\ {{\text{R2}}:} \hfill & 1\hfill & 1 \hfill & 4 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & { 2.0 8 3 3} \hfill \\ {{\text{R3}}:} \hfill & 4\hfill & 4 \hfill & 4 \hfill & 4 \hfill & 0 \hfill & 1 \hfill & { 1. 2 500} \hfill \\ {{\text{R4}}:} \hfill & 2 \hfill & 3\hfill & 3 \hfill & 4 \hfill & 0 \hfill & 1 \hfill & { 1. 2 500} \hfill \\ \end{array} $$

The first four elements are numbers of the best membership function for proper features, the next two are counters for decision classes and the last one is a probability distribution.

Let us observe that rule R2 covers the data L1, L4 and L5. L1 and L4 produce decision class C1 but L5 decision class C2.

Now we can move to the testing of new rules.

4.2 Test

In testing we ignore the value from the last column in Table 1, that is decision class number, because our goal is to calculate it.

To compute the firing level of a rule k for a given data

$$ \begin{array}{*{20}c} {X_{k} :} & {X_{1,k} } & {X_{2,k} } & \ldots & {X_{n,k} } & {D_{k} } \\ \end{array} $$
(22)

where Xi,k – feature’s value, Dk – conclusion decision class that we have to compare with the result of inference; we build a rule matrix

$$ \mu_{k,t} = \mathop\Phi \limits_{i = 1}^{n} \left( {A_{i,l,k} \left( {X_{i,t} } \right)} \right),\quad l = I_{i,\hbox{max} ,k} $$
(23)

We are interested only in active rules i.e. rows with matching value \( \upmu_{k,t} > 0 \).

Example 4.

In test we will demonstrate calculations on

$$ \begin{array}{*{20}l} {{\text{L}}5.\;2} \hfill & 2 \hfill & 2 \hfill & 2 \hfill \\ {{\text{T}}1.\;3} \hfill & 3 \hfill & 3 \hfill & 2 \hfill \\ \end{array} $$

For sample data L5 we have two active rules:

$$ \begin{array}{*{20}l} {{\text{R1:}} 1\; 1\; 2\; 2\quad } \hfill & {\quad 1\;0\quad } \hfill & {0. 8 5 9 3 7 5\;0. 8 5 9 3 7 5\;0. 60 9 3 7 5\;0. 60 9 3 7 5\;0. 2 7 4 2 4 2} \hfill \\ {{\text{R2:}} 1\; 1\; 1\; 1\quad } \hfill & {\quad 2\;1\quad } \hfill & {0. 8 5 9 3 7 5\;0. 8 5 9 3 7 5\;0. 8 5 9 3 7 5\;0. 8 5 9 3 7 5\;0.54542} \hfill \\ \end{array} $$

The first four elements are the rule pattern, the next two are the counters for decision classes. The next four numbers are the values of appropriate membership function. The number 0.859375 is the value of the first membership function, according to the first number in the rule, on the first feature. The next three numbers are calculated in the same way.

The last numbers in the above rows are the matching value for the rule. It has been calculated by matching operator for the values of membership function.

We focused only on the rows with matching value grater then zero.

For sample data T1 we have:

$$ \begin{array}{*{20}l} {{\text{R1:}}\; 1\; 1\; 2\; 2\quad } \hfill & {\quad 1\;0\quad } \hfill & {0. 4 3 7 5\;0. 4 3 7 5\;0. 9 3 7 5\;0. 60 9 4\;0. 10 9 3} \hfill \\ {{\text{R2:}}\; 1\; 1\; 1\; 1\quad } \hfill & {\quad 2\;0\quad } \hfill & {0. 4 3 7 5\;0. 4 3 7 5 \;0. 4 3 7 5\;0. 8 5 9 4\;0.0 7 20} \hfill \\ \end{array} $$

For each collection of \( F_{{r_{t} ,p_{t} }} \) focal elements \( \hat{m}_{r} \) we define an aggregate

$$ E = \mathop \cup \limits_{t = 1}^{R} F_{{r_{t} ,p_{t} }} $$
(24)

with basic probability assignment

$$ m(E) = \prod\limits_{t = 1}^{R} {m\left( {F_{{r_{t} ,p_{t} }} } \right)} $$
(25)

The results of classification are D is m, with focal elements Ek(k = 1,…,R|S|) and distribution m(Ek). That results are calculated using focal elements and appropriate counters ct,1,.., ct,|S|.

Example 5.

For sample point L5 and T1 the counters are 3, 1, and 3, 0 respectively

Then we perform defuzzification according to COA method [1].

$$ \bar{y} = \sum\limits_{k = 1}^{{R^{|S|} }} {\bar{y}_{k} m(E_{k} )} $$
(26)

where \(\bar{y}\)k are defuzzified values for focal element Ek defined as

$$ \bar{y}_{k} = \frac{{\sum\limits_{1 \le t \le n} {x_{t} \mu_{k,t} (x_{t} )} }}{{\sum\limits_{1 \le t \le n} {\mu_{k,t} (x_{t} )} }} $$
(27)

In the next step, the rules structure is simplified to

$$ If\,Antecedent_{r} \,Then\,\textit{(} {D\,is\,H_{r} } \textit{)}, $$

where \( H_{r} = \left\{ {\tfrac{1}{{\gamma_{r} }}} \right\} \) is a singleton fuzzy set for factor \( \gamma_{r} = \sum\limits_{p = 1}^{|S|} {\bar{y}_{p} m_{r} (S_{r,p} )} \).

Example 6.

For both L5 and T1 we calculate decision class 1. It is correct in case of T1, but wrong for L5. The values of Hr are 0.4283 and 0.4800 respectively. ■

5 Empirical Learning for FDS Model

In this section we compare and analyze the performance of several membership functions and matching operators. We start from a standard solution used in the introduction to fuzzy modeling, then we consider more complicated models. We compute results for the following membership function: Linear, Quadratic, Gaussian, and FCM. We concentrate on Minimum, Multiply and Implication as a matching operators. The most valuable is comparing the results of all calculations. In the end of this section we show some results of experimental research.

5.1 Membership Functions

The membership function makes possible the division of data into n intervals. It is a way of discretization of the input data. Hence, we get the best result for continuous data or for data with several discrete (nominal) values. If we have discrete or binary data then results of the proposed model are not good enough.

The choice of membership function has a great influence on the quality of rules. Although the quantity of rules is different, the quality of classification is comparable.

The most interesting membership function was generated by Fuzzy c-Means (FCM) algorithm [1]. The results of experimental research with membership functions have been summarized in Table 2.

Table 2. Membership functions

5.2 Matching Operators

It was shown in our experiments that a matching operator applied to data sample with existing rules plays the very important role in the accuracy of diagnoses. It occurred as early as the rules were generated. A matching operator influences the quality of the generated rules. Of course, this quality has secondary means, but in general, the more rules the better accuracy.

From the analysis of the results of experiments in Table 4, we can infer that the most powerful operator is implication. This is not all the true, because Table 4 shows the results only for one fixed threshold value. It is not optimal in all instances, especially in multiply operator. The change of the threshold value (e.g. to 0.25) gives almost the same results as for implication operator. Anyway, the choice of the threshold value is of minor importance here, but it can have influence on the result of the receiving of rules. Of course, we cannot analyze the threshold value without keeping in mind the features of the membership function and the number of intervals. The choice of the threshold value will be subject of future works.

The results of the investigation of various matching operators are collected in Table 3.

Table 3. Matching operators

6 Experimental Studies

Some results of experimental research are shown in Table 4. We fixed here count of membership functions on 6 and threshold value on 0.75. All data sets have been divided into two parts: learning (training) data (about 2/3 of the entire data set) and testing data (remaining 1/3). The learning data has been used to generate the rule set. Testing data has been applied to test the produced rule set. To obtain reliable results, we carried out the experiment several times.

Table 4. Experimental results

That formulation does not show the most favorable case but it is allowed to see the part of real results with using different methods for generalizing rules. These results can be comparable.

The methods of automatically generated decision rules that are described in this paper have the best results on Iris data set. In a few points, reach a destination 100 % of the decision accuracy. This set consists of all data as continuous values. In other cases, the construction of the data discretization caused a little worse results. In spite of the fact that in the case of Ulcers data set, for which over the half features was discrete, the results on the learning data was nearly the same like in the case of Iris data.

Diabetes data set is a sample of testing proposed algorithms on discrete data. The results of rules accuracy are not satisfactory. It has shown the case when the method of generating and verifying decision rules ineffectual as only one.

Another observation is that, the smaller number of binary data in antecedent, then the better accuracy of our rules. If the features have no binary data or if number of it is strongly less, then others then our rules can be applicable. We can see that during analyzing Breast Cancer Wisconsin and Dermatology data sets. In Echocardiogram data set, when number of binary data is equal to three the results are worse.

In all the instances, intermediate reports were stored in disk data files. It was used to compute results with the FCM algorithm. It also made possible to connect described method with others.

In our research, the proper choice of threshold value gave us information if rule is valuable or not. The importance of this value has been shown in the case of Multiply operator. The result presented in Table 4 could suggest less “weight” of this operator, but it is not true. If we change threshold value, we notice that it has almost the same occurrence as more complied in calculations Implication operator. Implication is a sample of a very interesting and powerful operator of fuzzy relation.

We compare the results of our research with standard decision trees algorithm [10]. For all data sets, we get better results using Gaussian function or FCM algorithm [1], and in a few points of Quadratic function, we obtained also better accuracy.

In Dermatology and Echocardiogram data sets, we removed records with missing input features, because we concentrate only on complete data.

7 Conclusions

The study has focused on the use of Fuzzy Dempster-Shafer model for generating of fuzzy decision rules. Fuzzy sets are useful in discretization of continuous attributes. The approach is discussed in the concrete applications of two real medical data sets (especially to problems of identification of diseases) and several well-known data sets available on the Web. The results are used to classify objects. It has been found through series of experiments that this approach outperforms the results of the C4.5 algorithm in terms of higher classification accuracy.