Abstract
In this study, we discuss the use of Dempster-Shafer theory as a well-rounded algorithmic vehicle in the construction of fuzzy decision rules. The concept of fuzzy granulation realized via fuzzy clustering is aimed at the discretization of continuous attributes. Detailed experimental studies are presented concerning well-known medical data sets available on the Web.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Fuzzy modeling is regarded to be one of the possible classification architecture of machine learning and data mining. There have been a significant number of studies devoted to generating fuzzy decision rules from sample cases or examples. These include attempts to extend many classical machine learning methods to learn fuzzy rules. One very popular approach is decision trees [10]. Since the inception of this concept, it has been extended for the construction and interpretation of more advanced decision trees [3, 5–7, 9, 13, 15, 18]. Although the decision trees based methods can extract a set of fuzzy rules which works well, a problem is that the lack of backtracking in splitting the node leads to lower learning accuracy comparing to other machine learning methods. Another widely used machine learning method is artificial neural network. In recent years enormous work has been done in attempt to combine the advantages of neural network and fuzzy sets [14]. Hayashi [4] has proposed to extract fuzzy rules from trained neural net. Lin [8], on the other hand, introduced a method of directly generating fuzzy rules from self-organized neural network. The common weakness of neural network, however, is a problem of determination of the optimal size of a network configuration, as this has a significant impact on the effectiveness of its performance.
The objective of this paper is to employ the Dempster-Shafer theory (DST) as a vehicle supporting the generation of fuzzy decision rules. More specifically, we concentrate on the role of fuzzy operators, and on the problem of discretization of continuous attributes. We show how they can be effectively used in the quantization of attributes for the generation of fuzzy rules.
The material is arranged in the following way. First, we summarize the underlying concepts of the Dempster-Shafer theory and briefly discuss the nature of the underlying construction. By doing so, the intension is to make the paper self-contained and help identify some outstanding design problems emerging therein. In Sect. 4 we explain essentials of our model. Finally, in Sect. 5, we report exhaustive experimental studies.
This paper is a continuation of our earlier work [12]. Here we apply theoretical vehicle, introduced in previous research, to the new input data in order to find possible area of application. Our important objective here is to reveal a way in which this approach becomes essential to a more comprehensive treatment of continuous attributes.
2 Dempster-Shafer Theory
The Dempster-Shafer theory starts by assuming a Universe of Discourse Θ also called Frame of Discernment, which is a finite set of mutually exclusive alternatives. The frame of discernment may consist of the possible values of an attribute. For example, if we are trying to determine the disease of a patient, we may consider Θ being the set consisting of all possible diseases.
For each subset S of Θ it is associated:
-
a basic probability assignment m(S)
-
a belief Bel(S)
-
a plausible belief Pla(S)
m(S), Bel(S) and Pla(S) have value in the interval [0,1], and Bel(S) is not greater than Pla(S).
In particular, m represents the strength of some evidence. For example, in rule-based expert system, m may represent the effect of applying of a rule. Bel(S) summarizes all our reasons to believe S. Pla(S) expresses how much we should believe in S if all currently unknown facts were to support S. Thus the true belief in S will be somewhere in the interval [Bel(S), Pla(S)]. More formally, a map
such that for each \( {\text{A}} \in 2^{\Theta } \) (where \( 2^{\Theta } \) is set of all subsets of \( \Theta \))
-
1.
\( m(\emptyset ) = 0 \)
-
2.
\( \sum\limits_{{A \subseteq\Theta }} {m(A) = 1} \)
is called a basic probability assignment for \( \Theta \).
Subset A is called a focal element of m if m(A) > 0.
For a given basic probability assignment m, the Belief of a subset A of \( \Theta \) is the sum of m(B) for all subsets B of A, so
such that \( Bel(A) = \sum\limits_{B \subseteq A} {m(B)} \).
The Plausibility of a subset A of \( \Theta \) is defined as Pla(A) = 1 − Bel(A’), where A’ is the complement of A in \( \Theta \).
If we are given two basic probability assignments m1 and m2, we can combine them into a third basic probability assignment \( m: 2^{\Theta } \to \left[ {0, 1} \right] \) in the following way.
Let us consider frame of discernment \( \Theta \) and two belief functions Bel1 and Bel2. We denote focal elements of Bel1 as A1…AK and focal elements of Bel2 as B1…BL respectively and the basic probability assignments as m1 and m2. Then we can show graphically this combination as an orthogonal sum of m1 and m2.
The mass of probability of the interval Ai ∩ Bj expressed as a measure m1(Ai)·m2(Bj) is illustrated in Fig. 1.
Of course, more intersections can give the same focal element A. In general the mass of probability of set A is defined as:
Then we find a problem with an empty set. It is possible that sets with empty intersection exist. We can meet this normal situation in many combinations. Then the mass of ∅, according to above definition, will be greater than zero, but according to the definition of basic probability assignment, it is not possible.
We assume that
to define the orthogonal sum m1 and m2, and denote it as \( m_{1} \oplus m_{2} \).
Then it is necessary to change the definition (3) of the formula of the basic probability assignment of the combination as follows:
-
1.
\( m(\emptyset ) = 0 \)
-
2.
\( m(A) = \frac{{\sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = A} \\ \end{array} }} {m_{1} (A_{i} ) \cdot m_{2} (B_{j} )} }}{{1 - \sum\limits_{{\begin{array}{*{20}c} {i,j} \\ {A_{i} \cap B_{j} = \emptyset } \\ \end{array} }} {m_{1} (A_{i} ) \cdot m_{2} (B_{j} )} }} \;{\text{for}}\,{\text{non-empty}}\;{\text{A}} \subset \Theta \)
We call this as an orthogonal sum of Bel1 and Bel2, and denote as Bel1 ⊕ Bel2.
This is the Dempster Rule for Combining of Beliefs [11].
For \( \Theta \) we have \( m\left(\Theta \right) = \sum\limits_{{A \subseteq\Theta }} {m(A) = 1} \) and for combination if
we have
3 Fuzzy Modelling
Fuzzy set theory is widely known and we do not introduce its underlying concepts essential to understand this framework. Readers interest themselves we refer to [16] and [17].
Fuzzy Modeling is applied in those areas where the model of the system cannot be described precisely because of many reasons. The input data received by the system may not be completely reliable, may contain noise, or may be inconsistent with other data or with expectation about these data. The system is described by the set of linguistic rules. Let D denotes an output variable of the system, and X1, X2, …, Xn denote an input variables. The linguistic rules have the following format:
where (Xi is Ak,i,ji) are the fuzzy antecedents, Ak,i,ji (1 ≤ ji ≤ |Ai|) are values of the i-th input variable, and Sk,p (1 ≤ p ≤ |S|) is the value of output variable in k-th rule.
The rules are implemented as fuzzy relation according to the formula:
where × denotes the fuzzy Cartesian product.
Then all rules are aggregated to relation R described as:
where M is the number of rules.
The conclusion is based on the compositional rule of inference
where \( A_{1,j1}^{'} ,A_{2,j2}^{'} , \cdots ,A_{n,jn}^{'} \) are input values, \( S' \) is a conclusion (decision class), and \( \circ \) denotes the composition of fuzzy relation.
In fuzzy modeling we can assume that expert defines the rule set, or we can automatically generate them from the set of samples describing the behavior of the system being modeled.
4 Fuzzy Dempster-Shafer Model
In Fuzzy Dempster-Shafer (FDS) model [2] we consider rules Rr as:
where Xi and D stand for input and output respectively, and mr is a fuzzy belief structure, that is a standard belief structure with focal elements Sr,p as fuzzy subset of frame of discernment Θ with basic probability assignment mr (Sr,p), and mr (Sr,p) is the believe that the conclusion should be represented as class Sr,p.
4.1 Learning – Rules Construction
In antecedent construction, let us assume that we have n features (attributes) in antecedents of testing example. We consider a collection of m generic linguistic terms characterized by membership functions defined in a universe of discourse being a domain of each attribute. The conclusion belongs to decision class S.
For each element of data t we build a collection:
where Ai,j,t are the values of j-th membership function for i-th feature and for t-th element of data.
Example 1.
We demonstrate the calculations on the set of synthetic data presented in Table 1.
First six rows (L1–L6) will constitute learning data, while the remaining ones (T1–T4) will form testing data. All the features are numbers from the <0; 9> interval. The last column represents the decision class equal to 1 or 2. We will consider four membership quadratic functions uniformly distributed along the space of all attributes. Other membership functions will be discussed in the next section.
According to (10) for row T1 we have:
■ On the base of (10), for a given data point t we can calculate two vectors:
and index of membership functions
where \( A_{{i,\max_{i} ,t}} \) is a maximum value of all membership functions designed for the feature i and \( I_{{i,\max_{i} ,t}} \) is the number of the best membership function for feature i.
Then we have the following candidate for a rule
The firing level of the rule is calculated according to the following formula
where \( \upphi \) means the operator of fuzzy matching. See Sect. 5.2. for details.
The rule candidate is added to rules set if \( \phi \left[ {\tau_{r} ,m_{r} } \right] \ge Th \) (where Th threshold value, and \( \phi \) matching operator). This can help to eliminate bad rules from the final rule set.
More ten one rule can have the same antecedent part but it is also possible that conclusion of these rules are different. Then we have to use appropriate counters ct,1,.., ct,|S|, where |S| denotes the power of decision class set. These counters can show us how many data, according to rule pattern, vote for each decision class.
Example 2.
In our sample (T1) the vectors are:
-
Aμ,1: 1.0000, 1.0000, 0.9375, 0.9844
-
Ic,1: 1, 1, 2, 2 with counters vector 1, 0
In our sample matching value equals to 0.9229, were multiplication was used as the matching operator (1.0000* 1.0000* 0.9375* 0.9844 = 0.9229). For the threshold set on 0.75, we obtain a new rule.
■
The product is a new belief structure on X
Focal elements are fuzzy subset given as
and appropriate distributions of new focal elements are defined as:
So we can build an aggregate:
Then for each collection
where \( F_{{r_{t} ,p_{t} }} \) are focal elements of \( \hat{m}_{r} \) we have focal element E of m described as
with appropriate probability distribution
At this point, the rule generation process is complete.
Example 3.
Our sample data produce the following rule set.
The first four elements are numbers of the best membership function for proper features, the next two are counters for decision classes and the last one is a probability distribution.
Let us observe that rule R2 covers the data L1, L4 and L5. L1 and L4 produce decision class C1 but L5 decision class C2.
■
Now we can move to the testing of new rules.
4.2 Test
In testing we ignore the value from the last column in Table 1, that is decision class number, because our goal is to calculate it.
To compute the firing level of a rule k for a given data
where Xi,k – feature’s value, Dk – conclusion decision class that we have to compare with the result of inference; we build a rule matrix
We are interested only in active rules i.e. rows with matching value \( \upmu_{k,t} > 0 \).
Example 4.
In test we will demonstrate calculations on
For sample data L5 we have two active rules:
The first four elements are the rule pattern, the next two are the counters for decision classes. The next four numbers are the values of appropriate membership function. The number 0.859375 is the value of the first membership function, according to the first number in the rule, on the first feature. The next three numbers are calculated in the same way.
The last numbers in the above rows are the matching value for the rule. It has been calculated by matching operator for the values of membership function.
We focused only on the rows with matching value grater then zero.
For sample data T1 we have:
■
For each collection of \( F_{{r_{t} ,p_{t} }} \) focal elements \( \hat{m}_{r} \) we define an aggregate
with basic probability assignment
The results of classification are D is m, with focal elements Ek(k = 1,…,R|S|) and distribution m(Ek). That results are calculated using focal elements and appropriate counters ct,1,.., ct,|S|.
Example 5.
For sample point L5 and T1 the counters are 3, 1, and 3, 0 respectively
■
Then we perform defuzzification according to COA method [1].
where \(\bar{y}\)k are defuzzified values for focal element Ek defined as
In the next step, the rules structure is simplified to
where \( H_{r} = \left\{ {\tfrac{1}{{\gamma_{r} }}} \right\} \) is a singleton fuzzy set for factor \( \gamma_{r} = \sum\limits_{p = 1}^{|S|} {\bar{y}_{p} m_{r} (S_{r,p} )} \).
Example 6.
For both L5 and T1 we calculate decision class 1. It is correct in case of T1, but wrong for L5. The values of Hr are 0.4283 and 0.4800 respectively. ■
5 Empirical Learning for FDS Model
In this section we compare and analyze the performance of several membership functions and matching operators. We start from a standard solution used in the introduction to fuzzy modeling, then we consider more complicated models. We compute results for the following membership function: Linear, Quadratic, Gaussian, and FCM. We concentrate on Minimum, Multiply and Implication as a matching operators. The most valuable is comparing the results of all calculations. In the end of this section we show some results of experimental research.
5.1 Membership Functions
The membership function makes possible the division of data into n intervals. It is a way of discretization of the input data. Hence, we get the best result for continuous data or for data with several discrete (nominal) values. If we have discrete or binary data then results of the proposed model are not good enough.
The choice of membership function has a great influence on the quality of rules. Although the quantity of rules is different, the quality of classification is comparable.
The most interesting membership function was generated by Fuzzy c-Means (FCM) algorithm [1]. The results of experimental research with membership functions have been summarized in Table 2.
5.2 Matching Operators
It was shown in our experiments that a matching operator applied to data sample with existing rules plays the very important role in the accuracy of diagnoses. It occurred as early as the rules were generated. A matching operator influences the quality of the generated rules. Of course, this quality has secondary means, but in general, the more rules the better accuracy.
From the analysis of the results of experiments in Table 4, we can infer that the most powerful operator is implication. This is not all the true, because Table 4 shows the results only for one fixed threshold value. It is not optimal in all instances, especially in multiply operator. The change of the threshold value (e.g. to 0.25) gives almost the same results as for implication operator. Anyway, the choice of the threshold value is of minor importance here, but it can have influence on the result of the receiving of rules. Of course, we cannot analyze the threshold value without keeping in mind the features of the membership function and the number of intervals. The choice of the threshold value will be subject of future works.
The results of the investigation of various matching operators are collected in Table 3.
6 Experimental Studies
Some results of experimental research are shown in Table 4. We fixed here count of membership functions on 6 and threshold value on 0.75. All data sets have been divided into two parts: learning (training) data (about 2/3 of the entire data set) and testing data (remaining 1/3). The learning data has been used to generate the rule set. Testing data has been applied to test the produced rule set. To obtain reliable results, we carried out the experiment several times.
That formulation does not show the most favorable case but it is allowed to see the part of real results with using different methods for generalizing rules. These results can be comparable.
The methods of automatically generated decision rules that are described in this paper have the best results on Iris data set. In a few points, reach a destination 100 % of the decision accuracy. This set consists of all data as continuous values. In other cases, the construction of the data discretization caused a little worse results. In spite of the fact that in the case of Ulcers data set, for which over the half features was discrete, the results on the learning data was nearly the same like in the case of Iris data.
Diabetes data set is a sample of testing proposed algorithms on discrete data. The results of rules accuracy are not satisfactory. It has shown the case when the method of generating and verifying decision rules ineffectual as only one.
Another observation is that, the smaller number of binary data in antecedent, then the better accuracy of our rules. If the features have no binary data or if number of it is strongly less, then others then our rules can be applicable. We can see that during analyzing Breast Cancer Wisconsin and Dermatology data sets. In Echocardiogram data set, when number of binary data is equal to three the results are worse.
In all the instances, intermediate reports were stored in disk data files. It was used to compute results with the FCM algorithm. It also made possible to connect described method with others.
In our research, the proper choice of threshold value gave us information if rule is valuable or not. The importance of this value has been shown in the case of Multiply operator. The result presented in Table 4 could suggest less “weight” of this operator, but it is not true. If we change threshold value, we notice that it has almost the same occurrence as more complied in calculations Implication operator. Implication is a sample of a very interesting and powerful operator of fuzzy relation.
We compare the results of our research with standard decision trees algorithm [10]. For all data sets, we get better results using Gaussian function or FCM algorithm [1], and in a few points of Quadratic function, we obtained also better accuracy.
In Dermatology and Echocardiogram data sets, we removed records with missing input features, because we concentrate only on complete data.
7 Conclusions
The study has focused on the use of Fuzzy Dempster-Shafer model for generating of fuzzy decision rules. Fuzzy sets are useful in discretization of continuous attributes. The approach is discussed in the concrete applications of two real medical data sets (especially to problems of identification of diseases) and several well-known data sets available on the Web. The results are used to classify objects. It has been found through series of experiments that this approach outperforms the results of the C4.5 algorithm in terms of higher classification accuracy.
References
Bezdek, J.C., Sabin, M.J., Tucker, W.T.: Convergence theory for fuzzy c-means: counterexamples and repairs. IEEE Trans. Syst. Man Cybern. smc-17(5), 873–877 (1987)
Binaghi, E., Gallo, I., Madella, P.: A neural model for fuzzy Dempster-Shafer classifiers. Int. J. Approx. Reason. 25, 89–121 (2000)
Chiang, I.J., Hsu, J.Y.J.: Integration of fuzzy classifiers with decision trees. In: Proceedings of 1996 Asian Fuzzy Systems Symposium, pp. 266–271 (1996)
Hayashi, Y., Imura, A.: Fuzzy neural expert system with automated extraction of fuzzy if-then rules from a trained neural network. In: 1st International Symposium on Uncertainty Modeling and Analysis, pp. 489–494, December 1990
Hayashi, A., Maeda, T., Bastian, A., Jain, L.C.: Generation of fuzzy decision trees by fuzzy ID3 with adjusting mechanism of AND/OR operators. In: Proceedings of 1988 International Conference on Fuzzy Systems (FUZZ-IEEE 1998), pp. 681–685 (1998)
Ichihashi, H., Shirai, T., Nagasaka, K., Miyoshi, T.: Nero-fuzzy ID3: a method of inducing fuzzy decision trees with linear programming for maximizing entropy and an algebraic method for incremental learning. Fuzzy Sets Syst. 81(1), 157–167 (1996)
Janikow, C.Z.: Fuzzy decision trees: issues and methods. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 28(1), 1–14 (1998)
Lin, C.T., Lee, C.S.: Neural-network-based fuzzy logic control and decision system. IEEE Trans. Comput. 12, 1320–1336 (1991)
Pedrycz, W., Sosnowski, Z.A.: C-fuzzy decision trees. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 35(4), 498–511 (2005)
Quinlann, J.R.: Induction of decision trees. Mach. Learn. 1, 81–106 (1986)
Shafer, G.: Mathematical Theory of Evidence. Princeton Univ. Press, Princeton (1976)
Sosnowski, Z.A., Walijewski, J.S.: Generating fuzzy decision rules with the use of Dempster-Shafer theory. In: ESM 1999, pp. 419–426, Warsaw (1999)
Umano, M., Okamoto, H., Hatono, I., Tamura, H., Kawachi, F., Umedzu, S., Kinoshita, J.: Generation of fuzzy decision trees by fuzzy ID3 algorithm and its application to diagnosis by gas in oil. In: Proceedings of 1994 Japan-USA Symposium on Flexible Automation, pp. 1445–1448 (1994)
Viharos, Z.J., Kis, K.B.: Survey on neuro fuzzy systems and their applications in technical diagnostics. Measurement 67, 126–136 (2015)
Weber, R.: Fuzzy ID3: a class of methods for automatic knowledge acquisition. In: Proceedings of 2nd International Conference on Fuzzy Logic and Neural Networks, Iizuka, Japan, pp. 265–268 (1992)
Yager, R., Filev, D.: Essentials of Fuzzy Modeling Control. Wiley, New York (1994)
Yager, R., Filev, D.: Including probabilistic uncertainty in fuzzy logic controller modeling using Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 25(8), 1221–1230 (1994)
Yun, J., won Seo, J., Yoon, T.: The new approach on fuzzy decision trees. Int. J. Fuzzy Logic Syst. 4(3) (2014)
Acknowledgement
This work was supported by the grant S/WI/1/2013 from Bialystok University of Technology founded by Ministry of Science and Higher Education.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 IFIP International Federation for Information Processing
About this paper
Cite this paper
Sosnowski, Z.A., Walijewski, J.S. (2016). Fuzzy Dempster-Shafer Modelling and Decision Rules. In: Saeed, K., Homenda, W. (eds) Computer Information Systems and Industrial Management. CISIM 2016. Lecture Notes in Computer Science(), vol 9842. Springer, Cham. https://doi.org/10.1007/978-3-319-45378-1_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-45378-1_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45377-4
Online ISBN: 978-3-319-45378-1
eBook Packages: Computer ScienceComputer Science (R0)