Keywords

1 Introduction

Medical knowledge extraction is the second most popular area in data mining applications [2]. A diagnosis support tool should deal with the diagnosis uncertainty and medical data imprecision. Simultaneously, the support tool should offer not only diagnosis efficiency but also an inference that is easy to understand for physicians and a reasonable size of extracted knowledge [3]. These are inevitable requirements of a knowledge engineer and a diagnostician cooperation. In medicine even reliable tools can not substitute a diagnostician, they can only support him. The most popular medical knowledge extraction approaches are decision trees [5], support vector machines [4] and neural networks [1]. Their inference parameters (weights or structure) can be translated to the readable IF-THEN rules with additional methods. In this paper the connection of the Dempster-Shafer theory (DST) and the fuzzy set theory (FST) is used to perform diagnostic inference. With the basic probability assignment (critical component of DST) it is possible to represent the knowledge uncertainty. On the other hand, FST is used as well-known method to represent data imprecision in many fields. As a result of a study on these methods, new rule extraction algorithm is described. The main requirement in the proposed algorithm is to provide a consistent and easy to interpret diagnostic rule base, also paying attention to its reliability. Experimental results evaluate the approach from the point of view of a diagnostician and knowledge engineer cooperation.

2 Medical Diagnosis Support Based on the DST and FST

Medical diagnosis support is usually based on with the IF-THEN rules [2], which shape the inference from symptoms (conditions in a premise) and diagnosis (conclusion). Let us denote \(\varvec{x}\) as the medical data case represented with p symptoms:

$$\begin{aligned} \varvec{x} = [x_1, x_2, ..., x_j, ..., x_p]^\top . \end{aligned}$$
(1)

A simultaneous use of the DST and FST [8] makes it possible to define the fuzzy focal element, which is a simple or complex logical condition related to the specific medical data symptoms, which can be test results or natural language statements like “high temperature” [8]. The simplest focal element considers only one symptom and the most complex focal element considers p symptoms. It is clear that the possible number of fuzzy focal elements for one diagnosis equals \(2^p-1\). General form of the fuzzy focal element can be represented in the following way:

$$\begin{aligned} s^{(r_l)}:\,X_{j}\text { is }A_{j}^{(r_l)} \wedge \cdots \wedge X_{k}\text { is }A_{k}^{(r_l)}, \end{aligned}$$
(2)

where \(A_j^{(r_l)}\) is the fuzzy set described by the fuzzy membership function \(\mu ^{(r_l)}_j\), (j = 1, ..., p, \(r_l\) = 1, ..., \(n^{(l)}\), l = 1, ..., C). This membership function describes j-th symptom values characteristic for the l-th diagnosis. The \(m_l(s^{(r_l)})\) value is assigned to this focal element. The \(m_l(s^{r_l})\), \(r_l\) = 1, ..., \(n^{(l)}\), make the basic probability assignment. In the decision support area the latter can be treated as the uncertainty of diagnostic rules.

2.1 Basic Probability Assignment

The BPA can be given by a human expert, i.e. diagnostician, but it can be also calculated with the training data. It must fulfil the following conditions:

$$\begin{aligned} m_l(f) = 0,\quad \sum \limits _{s^{(r_l)}\in \varvec{S}_l} m_l(s^{(r_l)}) = 1, \end{aligned}$$
(3)

where f is a false premise and \(\varvec{S}_l\) is the fuzzy focal element set for the l-th diagnosis. In the [6] the basic probability assignment is calculated as a measure of correspondence between \(s^{(r_l)}\) focal element and suitable l-th diagnosis cases. Its calculation can be briefly described by the following formula:

$$\begin{aligned} m_l(s^{(r_l)}) = \frac{\sum \limits _{i = 1}^{n_l}{\gamma ^{(r_l)}_i}}{\sum \limits _{r_l = 1}^{n^{(l)}}\left( {\sum \limits _{i = 1}^{n_l}{\gamma ^{(r_l)}_i}}\right) }, \end{aligned}$$
(4)

where \(n_l\) is l-th diagnosis data set size, \(n^{(l)}\) is the \(\varvec{S}_l\) size and

$$\begin{aligned} \gamma _{i}^{(r_l)} = {\left\{ \begin{array}{ll} 1&{}\text { for }\eta ^{(r_l)}>\eta _{BPA},\\ 0&{}\text { for }\eta ^{(r_l)}\le \eta _{BPA}, \end{array}\right. } \end{aligned}$$
(5)

where \(\eta ^{(r_l)}\) is the matching level of the \(s^{(r_l)}\) with the data case. According to (4), the basic probability value is the number of cases conformed with the l-th diagnosis cases. This number is normalized with the sum of all \(s^{(r_l)}\) conformation numbers. This operation provides truth of (3). The conformation condition is based on the matching level, which can be calculated in the following way:

$$\begin{aligned} \eta ^{(r_l)} = \min \limits _{j\in \mathbb {J}^{(r_l)}}\mu _j^{(r_l)}(x_{j}), \end{aligned}$$
(6)

where \(\mathbb {J}^{(r_l)}\) is the set of considered symptoms in the \(s^{(r_l)}\) and \(x_j\) is the j-th symptom value of data case \(\varvec{x}\). According to (5), the \(\eta _{BPA}\) influences the basic probability assignment. Lower values of the threshold allow more uncertainty of the focal elements, which are treated as conformed. On the other hand, higher threshold values cause that only the best adjusted focal elements can obtain the highest BPA value.

2.2 Diagnosis Belief Measure

The DST-based diagnosis support is related to the diagnosis belief calculation, which adds the basic probability values of the \(s^{(r_l)}\) elements, which are sufficiently conformed with the medical data case. The belief measure is calculated as:

$$\begin{aligned} Bel_l(\varvec{x}) = \sum \limits _{r_l = 1}^{n^{(l)}} \gamma _{Bel}^{(r_l)}m_l(s^{(r_l)}), \end{aligned}$$
(7)

where

$$\begin{aligned} \gamma _{Bel}^{(r_l)} = {\left\{ \begin{array}{ll} 1&{}\text { for }\eta ^{(r_l)}>\eta _{T},\\ 0&{}\text { for }\eta ^{(r_l)}\le \eta _{T}. \end{array}\right. } \end{aligned}$$
(8)

According to (7) and (8), \(m_l(s^{(r_l)})\) is added to \(Bel_l\), if the \(s^{(r_l)}\) matching level with data case is higher than determined threshold \(\eta _T\). Belief measure should be calculated for all considered diagnoses (\(D_l\)). Final diagnosis (denoted as D) is chosen according to the following formula:

$$\begin{aligned} D(\varvec{x})= D_l\text { for } l = \mathop {\mathrm{argmax}}\limits _{1\le l \le C}\,{(Bel_l(\varvec{x}))}. \end{aligned}$$
(9)

It is possible that the highest Bel value can be obtained for more than one diagnosis. In this situation medical diagnosis should not be stated arbitrarily. Hence the data case is treated as undiagnosed. Similarly to the \(\eta _{BPA}\), the \(\eta _T\) threshold is the parameter, which tunes the inference and allow to use in the belief calculations more or less suited focal elements.

3 The Design of Membership Functions

Membership functions for fuzzy focal elements can be designed through statistical analysis of training data. In [6] different membership function shapes are considered and discussed. In this paper, trapezoidal membership functions are used. Their shape can be described as:

$$\begin{aligned} \mu (x,[a,b,c,d]) = {\left\{ \begin{array}{ll} 0,&{}x\le a,\\ \frac{x-a}{b-a},&{}a< x \le b,\\ 1,&{}b< x \le c,\\ \frac{d-x}{d-c},&{}c < x \le d,\\ 0,&{}x > d. \end{array}\right. } \end{aligned}$$
(10)

Let us denote \(\varvec{X}_l\) as the subset of the medical data set \(\varvec{X}\), for which every data case is assigned to the l-th diagnosis. Hence, \(\varvec{X}_l\) can be denoted as:

$$\begin{aligned} \varvec{X}_l = \{\varvec{x}_{l1},\,\cdots ,\,\varvec{x}_{li},\,\cdots ,\,\varvec{x}_{ln_l}\}, \end{aligned}$$
(11)

where \(n_l\) is the \(\varvec{X}_l\) set size. Moreover, let us denote \(\varvec{x}_{jl}\) as j-th symptom values from \(\varvec{X}_l\) as:

$$\begin{aligned} \varvec{x}_{jl} = \{x_{j1}\,\cdots ,\,x_{ji}\,\cdots ,x_{jn_l}\}. \end{aligned}$$
(12)

In the research one membership function is designed for each j-th symptom and l-th diagnosis data subset. Hence \(C\cdot p\) membership functions are designed. To this end, statistical measures of the \(\varvec{x}_{jl}\) should be found: \(\bar{x}_{jl}\) and \(\sigma _{jl}\) as the mean and standard deviation values. This values can be used to treat the \(\varvec{x}_{jl}\) as normally distributed data. Remaining values, which are used for the trapezoidal function design are: lower and upper quartiles (\(Q_{1_{jl}}\), \(Q_{3_{jl}}\)) of normal distribution and the intersection point \(x_c^{(l)}\) between two neighbouring (l-th and \((l+1)\)-th) distributions. It is clear that the trapezoidal function shapes should be designed individually for data subsets with the lowest, intermediate and highest mean values. The equations used to trapezoidal membership function parameter calculation are presented in the Table 1. The z and Z in Table 1 are sufficiently small and big numbers according to minimal and maximal values of \(\varvec{x}_{jl}\), respectively. This operation provides, that trapezoidal function are left- and right-opened for the lowest and the highest value of training data, respectively. However, for few data couples it is possible that lower or upper quartile outreach suitable \(x_c^{(l)}\) value. Hence the design must be protected from incorrectly parameter calculation. It can be done with two conditions:

  • if \(b \le x_{c}^{(l-1)}\):

    • \(a = 0.99\cdot x_{c}^{(l-1)}\),

    • \(b = 1.01\cdot x_{c}^{(l-1)}\),

  • if \(c \ge x_{c}^{(l)}\):

    • \(c = 0.99\cdot x_{c}^{(l)}\),

    • \(d = 1.01\cdot x_{c}^{(l)}\),

which recalculate function parameters to obtain sufficiently steep slopes [7].

Table 1. Trapezoidal membership function parameters calculated according to the data statistical analysis

4 Rule Selection Algorithm

The main aim of the rule extraction is to find the best subset of \(2^p-1\) possible focal elements of the l-th diagnosis. The solution should provide satisfactory results in the view of the diagnosis efficiency and rule number. The rule elimination algorithm was used to obtain simple rules in the [6, 8]. According to these results a different way to the rule extraction can be proposed when an acceptable effectiveness must go along with the simplest rules. Here, the rule selection algorithm is described. Its performance is based on iterative focal element adding to the created fuzzy focal element set to obtain the best diagnosis efficiency. The rule selection algorithm can be described in the following steps:

  1. 1.

    Choose training data \(\varvec{X}=\{\varvec{x}_1,...,\varvec{x}_i,...,\varvec{x}_N\}\) as well as the \(\eta _{BPA}\) and \(\eta _T\) thresholds.

  2. 2.

    Design fuzzy membership functions for each symptom and each diagnosis \(\mu _j^{(l)}(x_j)\), l = 1, \(\cdots \), C, j = 1, \(\cdots \), p through statistical analysis of \(\varvec{X}\).

  3. 3.

    Create initial fuzzy focal element sets \(\hat{\varvec{S}}_l\) and calculate the basic probability assignment for each set of diagnostic rules \(s^{(r_l)}\), \(r_l\) = 1, \(\cdots \), \(n^{(l)}\)l = 1, \(\cdots \)C.

  4. 4.

    For each diagnosis, find the first focal element with the maximal \(m_l(s^{(r_l)})\) value in \(\hat{\varvec{S}_l}\) and move it to the new created focal element set \(\varvec{S}_l\).

  5. 5.

    Recalculate the basic probability assignment for every fuzzy focal element \(s^{(r_l)}\) in \(\varvec{S}_l\)l = 1, \(\cdots \)C.

  6. 6.

    Calculate the belief measure \(Bel_l(\varvec{x}_i)\) for every data case, i = 1, ..., N. Select the final diagnosis, l = 1, \(\cdots \)C.

  7. 7.

    Calculate the diagnosis error as the number of cases misdiagnosed and cases, for which the diagnosis can not be selected.

  8. 8.

    If it is the first iteration, go to step 11. Otherwise go to step 9.

  9. 9.

    If the diagnosis error increased, go to the step 13. Otherwise go to step 10.

  10. 10.

    If yet achieved error is the lowest, go to step 11. Otherwise go to step 12.

  11. 11.

    Save actual focal element sets \(\varvec{S}_l\) as the best (\(\varvec{S}_l'\)), l = 1, \(\cdots \)C.

  12. 12.

    For each diagnosis where error occurs, find the focal element with the maximal \(m_l(s^{(r_l)})\) value in \(\hat{\varvec{S}_l}\) and move them to the created focal element set \(\varvec{S}_l\) and go to the step 5.

  13. 13.

    Terminate the algorithm. Final focal element sets are \(\varvec{S}_l'\).

In this algorithm the “most valuable” focal elements are gradually moved into the final rule set. This allows for prevalence of focal elements of the best differentiation force among diagnoses. The initial fuzzy focal element set \(\varvec{\hat{S}_l}\) (l = 1, ..., C) can contain all possible combinations of p symptoms, hence its size grows exponentially with the symptom number. On the other hand, the objective of the algorithm is to extract simple rules. Hence, the \(\varvec{\hat{S}_l}\) size can be limited. It is possible to specify the maximum symptom number in the focal element. Thus, focal element with too many symptom conditions (which have bad influence on their readability) are not created. This detail also affects the reduction of calculations when the high-dimensional data are proceed.

5 Experiments

Thyroid Disease [9] is commonly known medical data set and often used in the diagnostic rule extraction methods. This source provides 10 different data sets related to the thyroid gland functionality. In this paper the “new-thyroid” set is considered. The set contains 215 medical data cases divided into three thyroid diagnoses: normal activity (150 cases), hypothyroidism (30 cases) and hyperthyroidism (35 cases). Each data case is also the set of 5 medical test results. All of them are numerical values related to following hormones:

  1. 1.

    triiodothyronine-resin uptake test (percentage value),

  2. 2.

    total serum thyroxin (T4),

  3. 3.

    total serum thiiodothyronine (T3),

  4. 4.

    total serum thyrotropine (TSH),

  5. 5.

    maximal TSH serum difference after 200 mg of TSH-releasing hormone injection.

Trapezoidal membership functions designed for the Thyroid Disease data are presented in Fig. 1. The \(\eta _{BPA}\) and \(\eta _T\) thresholds influence the rule extraction performance. The couple that provides the highest efficiency is: \(\eta _{BPA}\) = 0.4, \(\eta _T\) = 0.25. The performance of the algorithm of rule selection is presented in Fig. 2. Diagnosis error and rule number in every iteration are depicted with points and bars, respectively. The former are connected by line for readability. Since data symptom number is low, size of the initial fuzzy focal element sets does not have be limited. Initial focal element sets (\(\varvec{\hat{S}}_l\), l = 1, ..., C) include 31 focal elements for each diagnosis. According to the Fig. 2, fuzzy focal element set obtained in the eighth iteration is chosen as the best set due to only two misdiagnosed cases. There were no further iteration, for which the diagnosis error has decreased. The efficiency of the fuzzy focal element set can be calculated in the following way:

$$\begin{aligned} E = \frac{N-\epsilon }{N}\cdot 100\% \end{aligned}$$
(13)
Fig. 1.
figure 1

Trapezoidal membership functions designed for the Thyroid Disease data set. Leftmost, middle and rightmost functions describe “low”, “normal” and “high” symptom value, respectively.

Fig. 2.
figure 2

Rule selection algorithm performance with the Thyroid Disease data set

Table 2. Fuzzy focal element sets obtained as a result of the rule selection algorithm. The \(\eta _{BPA}\) and \(\eta _T\) thresholds are set to 0.5 and 0.25, respectively.
Fig. 3.
figure 3

Diagnosis efficiencies obtained for all combination of \(\eta _{BPA}\) and \(\eta _T\) thresholds. The highest efficiency is indicated by \(\otimes \).

where \(\epsilon \) is the diagnosis error and N is the data set size. According to (13), the set of 23 fuzzy focal elements created in the eighth iteration provides 99.07% efficiency, which is better results than published in [6]. The rule set for the thyroid disease diagnosis extracted from the training data are presented in the Table 2. It can be noticed that this efficiency comes at a price of numerous focal elements. This issue is related to the well-known trade-off between high reliability and good interpretability of the extracted knowledge [3]. With the \(\eta _{BPA}\) and \(\eta _T\) threshold there is possibility to tune the extracted knowledge. Diagnosis efficiency dependence on the thresholds determination in the range 0.05 to 0.95 with 0.05 step is presented in Fig. 3. The highest obtained efficiency is indicated by mark (\(\otimes \)). With the \(\eta _{BPA}\) and \(\eta _T\) thresholds set to 0.05 and 0.1 it is possible to obtain the lowest possible number of focal elements equal six providing slightly worse efficiency equal 95.35%. A lower \(\eta _{BPA}\) threshold allows the simplest rules to obtain higher BPA values and the final efficiency is acquired for very few rules. This rule set is presented in Table 3. It is possible to translate the fuzzy focal elements to the heuristic rules, which represent extracted knowledge from the medical data. According to the Fig. 1 fuzzy membership functions \(\mu _j^{(l)}(x_j)\), which are used in fuzzy focal elements sets are also describing fuzzy sets related to the natural language statements like “low”, “normal” and “high test result”. Natural language statements occurring in Table 4 are presented by suitable fuzzy sets in Fig. 1.

Table 3. Fuzzy focal element sets obtained as a result of the rule selection algorithm. The \(\eta _{BPA}\) and \(\eta _T\) thresholds are set to 0.2 and 0.3, respectively.
Table 4. Heuristic rules for the thyroid gland functionality diagnosis according to the knowledge extracted from the Thyroid Disease data (new-thyroid set)

6 Conclusions

Experimental results show that the Dempster-Shafer theory along with the fuzzy set theory can be used to create a diagnostic rule base and to interpret heuristically data-driven medical knowledge. This framework successfully deals with medical data issues like imprecision and uncertainty. As a result of the rule selection algorithm the fuzzy focal element set can be chosen to provide sufficiently effective diagnosis. Extracted knowledge is comprehensive for a human expert thanks to the intuitive understanding of trapezoidal membership functions of fuzzy focal elements. Performance of the rule extraction algorithm can be tuned with thresholds, which determine the BPA calculation and final diagnosis determination. The search of the best thresholds may involve computational complexity but the rule extraction algorithm is always fast. Generally, the solution with best diagnosis efficiency may not be related to the lowest number of focal elements. Hence, the threshold determination is also vague. The aim is to obtain a good compromise between the rule base reliability and its readability. This paper presents results of the knowledge extraction with the training data. Further research will deal with the mentioned compromise as well as with the rule base generalization quality.