Keywords

1 Introduction

Many researchers have identified several important and challenging issues [13] for clinical decision support. In “Grand challenges for decision support” Sittig et al. [1] setout ten critical problems for “designing, developing, presenting, implementing, evaluating, and maintaining all types of clinical decision support capabilities for clinicians, patients and consumers”. However Sittig et al.’s identification does cover little about data preprocessing. Sometimes, improved data quality is itself the goal of the analysis, usually to improve processes in a production database [4] and designing of decision support.

Two types of databases are available in medical domain [5]. The first is the dataset acquired by medical experts, which are collected for a special research topic where data collection is triggered by the generated hypothesis of a clinical trial. The other type is a huge dataset retrieved from hospital information systems. These data are stored in a database automatically without any specific research purpose. These data records are often used for further analysis and building clinical decision support system. These types of datasets are very complex where the numbers of records are very huge, with a large number of attributes for each record; many missing values and typically the datasets are mostly imbalanced with regard to their class label. In this paper we will be addressing the issue of missing value in clinical (cardiovascular) datasets.

Many real-life data sets are incomplete. The problem with missing attribute values is a very important issue in Data Mining. In medical data mining the problem with the missing values has become a challenging issue. In many clinical trials, the medical report pro-forma allow some attributes to be left blank, because they are inappropriate for some class of illness or the person providing the information feels that it is not appropriate to record the values for some attributes [6].

Typically there are two types of missing data [7]; one is called missing completely at random or MCAR. Data is MCAR when the response indicator variables R are independent of the data variables X and the latent variables Z. The MCAR condition can be succinctly expressed by the relation \(\mathrm{{P}}(\mathrm{{R}}{\vert }\mathrm{{X}},\mathrm{{Z}},\upmu ) = \mathrm{{P}}(\mathrm{{R}}{\vert }\upmu )\). The second category of missing data is called missing at random or MAR. The MAR condition is frequently written as \(\mathrm{{P}}(\mathrm{{R}} = \mathrm{{r}}{\vert }\mathrm{{X}} = \mathrm{{x}}, \mathrm{{Z}} = \mathrm{{z}}, \upmu ) = \mathrm{{P}}(\mathrm{{R}} = \mathrm{{r}}{\vert }\mathrm{{X}}^{\circ } =\mathrm{{x}}^{\circ },)\) for all \(\mathrm{{x}}^{\upmu }\), z and \(\upmu \) [8, 9].

In general, methods to handle missing values belong either to sequential methods like leastwise deletion, assigning most common values, arithmetic mean for the numeric attribute etc. or parallel methods where rule induction algorithm are used to predict missing attribute values [10]. There are reasons for which sequential leastwise deletion is considered to be a good method [7], but several works [6, 7, 11] have shown that the application of this method on the original data can corrupt the interpretation of the data and mislead the subsequent analysis through the introduction of bias.

While several techniques for missing value imputation are employed by researchers, most of the techniques are single imputation approaches [12]. The most traditional missing value imputation techniques are deleting case records, mean value imputation, maximum likelihood and other statistical methods [12]. In recent years, research has explored the use of machine learning techniques as a method for missing values imputation in several clinical and other incomplete datasets [13]. Machine learning algorithm such as multilayer perception (MLP), self-organising maps (SOM), decision tree (DT) and k-nearest neighbours (KNN) have been used as missing value imputation methods in different domains [11, 1421]. Machine learning methods like MLP, SOM, KNN and decisions tree have been found to perform better than the traditional statistical methods [11, 22].

In this paper we examine the use of Machine Learning techniques as a missing values imputation method for real life incomplete cardiovascular datasets. Where, we have used classifier to predict the value for a missing field and impute the predicted value to make the dataset complete. In order to compare the performance we have used four classifiers, Decision Tree [10], KNN [32], SVM [35] and FURIA [23] to predict the missing values. The datasets are later classified using Decision Tree, KNN, FURIA and K-Means Clustering; the results are compared with commonly used mean-mode imputation methods.

2 Overview of FURIA

Fuzzy Unordered Rule Induction Algorithm (FURIA) is a fuzzy rule-based classification method, which is a modification and extension of the state-of-the-art rule learner RIPPER. Fuzzy rules are obtained through replacing intervals by fuzzy intervals with trapezoidal membership functions [23]:

$$\begin{aligned} I^{F}\left( \nu \right) \mathop {=}\limits ^{{\text{ df }}} \left\{ \begin{array}{cl} 1&{} {\phi ^{c,L}\le \nu \le \phi ^{c,U}} \\ {\frac{\nu -\phi ^{s,L}}{\phi ^{c,L}-\phi ^{s,L}}}&{} {\phi ^{s,L}\le \nu \le \phi ^{c,L}} \\ {\frac{\phi ^{s,U}-\nu }{\phi ^{s,U}-\phi ^{c,U}}}&{} {\phi ^{c,U}\le \nu \le \phi ^{s,U}} \\ 0&{} {\text{ else }} \\ \end{array} \right. \end{aligned}$$
(1)

where \(\phi ^{c,L}\) and \(\phi ^{c,U}\)are the lower and upper bound of the membership of the fuzzy sets. For an instance x = (x\(_{1}\)......x\(_\mathrm{{n}})\) the degree of the fuzzy membership can be found using the formula [23]:

$$\begin{aligned} \upmu _{{{\text{ r }}^{{\text{ F }}} }} \left( {\text{ x }} \right) { \text{= } }\prod \nolimits _{{{\text{ i } \text{= } \text{1 }} \ldots {\text{ k }}}} {{\text{ i }}_{{\text{ i }}}^{{\text{ F }}} } ({\text{ x }}_{{\text{ i }}} ) \end{aligned}$$
(2)

For fuzzification of a single antecedent only relevant training data is \({\text{ D }}_{{\text{ T }}}^{{\text{ i }}}\) considered and data are partitioned into two subsets and rule purity is used to measure the quality of the fuzzification [23]:

$$\begin{aligned} {\text{ D }}_{{\text{ T }}}^{{\text{ i }}} = \left\{ {\text{ x }} \right. = \left( {{\text{ x }}_{{1 \ldots }} {\text{ x }}_{{\text{ k }}} } \right) \in {\text{ D }}_{{\text{ T }}}^{{\text{ i }}} |{\text{ I }}_{{\text{ j }}}^{{\text{ F }}} \left( {{\text{ x }}_{{\text{ j }}} } \right) > 0\,for\,all\,j \ne \left. {\text{ i }} \right\} \subseteq {\text{ D }}_{{\text{ T }}} \end{aligned}$$
(3)
$$\begin{aligned} {\text{ Pur }} = \frac{{{\text{ p }}_{{\text{ i }}} }}{{{\text{ p }}_{{\text{ i }}} + {\text{ n }}_{{\text{ i }}} }} \end{aligned}$$
(4)

where

$$\begin{aligned} {{\text{ p }}_{{\text{ i }}}} \mathop =\limits ^{{{\text{ def }}}} \sum \limits _{{{\text{ x }} \in {\text{ D }}_{{{\text{ T }} + }}^{{\text{ i }}} }} {\upmu _{{{\text{ A }}_{{\text{ i }}} }} \left( {\text{ A }} \right) } \end{aligned}$$
$$\begin{aligned} {\text{ n }}_{{\text{ i }}} {\mathop =\limits ^{{{\text{ def }}}}} \sum \limits _{{{\text{ x }} \in {\text{ D }}_{{{\text{ T }} - }}^{{\text{ i }}} }} {\upmu _{{{\text{ A }}_{{\text{ i }}} }} \left( {\text{ A }} \right) } \end{aligned}$$

The fuzzy rules \( {\text{ r }}_{1}^{{\left( {\text{ j }} \right) }} \ldots {\text{ r }}_{{\text{ k }}}^{{\left( {\text{ j }} \right) }} \) have been learned for the class \(\lambda _\mathrm{{j}},\)the support of this class is defined by [23]:

$$\begin{aligned} s_j \left( x \right) \mathop =\limits ^{\text{ df }} \sum _{i=1\ldots k} {\mu _{r_i^{\left( j \right) } } \left( x \right) } \cdot CF\left( {r_i^{\left( j \right) } } \right) \end{aligned}$$
(5)

where, the certainty factor of the rule is defined as

$$\begin{aligned} CF\left( {r_i^{\left( j \right) } } \right) =\frac{2\frac{\left| {D_T^{\left( j \right) } } \right| }{\left| {D_T } \right| }+\sum _{x\in D_T^{\left( j \right) } } {\mu _{r_i^{\left( j \right) } } \left( x \right) } }{2+\sum _{x\in D_T } {\mu _{r_i^{\left( j \right) } } \left( x \right) } } \end{aligned}$$
(6)

The use of the algorithm in of data mining can be found in [2325].

3 Decision Tree

The decision tree classifier is one of the most widely used supervised learning methods. A decision tree is expressed as a recursive partition of the instance space. It consists of a directed tree with a “root” node with no incoming edges and all the other nodes have exactly one incoming edge. [10]. Decision trees models are commonly used in data mining to examine the data and induce the tree and its rules that will be used to make predictions [26].

Ross Quinlan introduced a decision tree algorithm (known as Iterative Dichotomiser (ID 3)) in 1979. C4.5, as a successor of ID3, is the most widely-used decision tree algorithm [27]. The major advantage to the use of decision trees is the class-focused visualization of data. This visualization is useful in that it allows users to readily understand the overall structure of data in terms of which attribute mostly affects the class (the root node is always the most significant attribute to the class). Typically the goal is to find the optimal decision tree by minimizing the generalization error [28]. The algorithms introduced by Quinlan [29, 30] has proved to be an effective and popular method for finding a decision tree to express information contained implicitly in a data set. WEKA [31] makes use of an implementation of C4.5 algorithm called J48 which has been used for all of our experiments.

4 K-Nearest Neighbour Algorithm

K-Nearest Neighbor Algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space (defined using for example the Similarity measure). K-NN is a type of instance-based learning [32] or lazy learning where the function is only approximated locally and all computation is deferred until classification.

$$\begin{aligned} \mathbf{{Similarity}}\left( \mathbf {x,y} \right) = - \sqrt{\sum \nolimits _{\mathbf{{i} = 1}}^\mathbf{{n}} \mathbf {f} (\mathbf {x}_\mathbf{{i}} ,\mathbf {y}_\mathbf{{i}}}) \end{aligned}$$
(7)

The k-nearest neighbour algorithm is amongst the simplest of all machine learning algorithms where an object is classified by a majority vote of its neighbours, with the object being assigned to the class most common amongst its k nearest neighbours(k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbour.

5 K-Means Clustering

K-means is one of the simplest unsupervised learning algorithms proposed by Macqueen in 1967, which has been used by many researchers to solve some well-known clustering problems [10]. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume \(k \) clusters). The algorithm first randomly initializes the clusters center. The next step is to calculate the distance (discussed in the above section) between an object and the centroid of each cluster then take each point belonging to a given data set and associate it to the nearest centre and re-calculate the cluster centres. The process is repeated with the aim of minimizing an objective function knows as squared error function given by: 

$$\begin{aligned} \mathbf{J}\left( \mathbf{v} \right) =\mathop \sum \nolimits _{\mathbf{i}=1}^\mathbf{C} \mathop \sum \nolimits _{\mathbf{j}=1}^{\mathbf{C}_\mathbf{i} } \left( {||\mathbf{x}_\mathbf{i} } \right. -\mathbf{v}_\mathbf{j} ||)^{2} \end{aligned}$$
(8)

where, \(|\left| {x_i -v_j } \right| |\) is the Euclidean distance between \(x_i \) and \(v_i \), \(c_i \) is the number of data points in \(i^{th}\) cluster and \(c\) is the number of cluster centers.

6 Cardiovascular Data

We have used two datasets from Hull and Dundee clinical sites. The Hull site data includes 98 attributes and 498 cases of cardiovascular patients and the Dundee site data includes 57 attributes, and 341 cases from cardiovascular patients. After combining the data from the two sites, 26 matched attributes are left.

Missing values: After combining the data and removing redundant attributes we found that out of 26 attributes 18 attributes have a missing value frequency from 1 to 30 % and out of 832 records 613 records have 4 to 56 % missing values in their attributes.

From these two data sets, we prepared a combined dataset having 26 attributes with 823 records. Out of 823 records 605 records have missing values and 218 records do not have any missing values. Among all the records 120 patients are alive and 703 patients are dead. For this experiment according to clinical risk prediction model (CM1) [33], patients with status “Alive” are consider to be “Low Risk” and patients with status “Dead” are consider to be “High Risk”.

7 Mean and Mode Imputation

This is one of the most frequently used methods. It consists of replacing the unknown value for a given attribute by the mean (\(\bar{x}\)) (quantitative attribute) or mode (qualitative attribute) of all known values of that attribute [21].

$$\begin{aligned} {\bar{\mathbf{x }}}=\frac{1}{\mathbf{{n}}} \cdot \mathop \sum \nolimits _{\mathbf{{i=1}}}^{\mathbf{{n}}} \mathbf{{x}}_{\mathbf{{i}}} \end{aligned}$$
(9)

It replaces all missing records with a single and unique value \( \bar{x} \), which is the mean value of that attribute.

8 Proposed Missing Value Imputation Process

The original data set is first portioned in to groups. The records having missing values in their attributes are in one group (the complete data set) and the records without any missing values are placed in a separate group. The classifier is trained with the complete data sets, and later the incomplete data is given to the model for predicting the missing attribute values. The process is repeated for the entire set of attributes that have missing values. At the end of training, this training dataset and missing value imputed datasets are combined to make the finalised data. The final dataset is then fed to the selected classifier for classification (as shown in Fig. 1).

Fig. 1
figure 1

Missing value imputation process

9 Results

We have experimented with a number of machine learning algorithms as missing value imputation mechanisms; such as FURIA, decision tree [34], and SVM [35]. The performance is compared with the most commonly used missing imputation statistical method mean-mode. The results are also compared with the previously published results of the same experimental dataset with mean-mode imputation for K-Mix clustering [36].

Table 1 Different missing imputation methods with k-mean clustering
Table 2 Comparison results with k-mix clustering

From the Table 1 one can see that for K-mean clustering, decision tree imputation method shows accuracy of 64 % (slightly better than the other methods) but the sensitivity is 30 % which is almost as poor as the mean/mode imputation. SVM and mean/mode mutation show very similar performance with accuracy of 62–63 % and sensitivity of 29–32 %. On the other hand, fuzzy unordered rule induction algorithm as a missing value imputation method shows sensitivity of 43 % with accuracy of 58 %. Table 2 shows the comparison results of previously published results of K-Mix [37] clustering algorithm with mean mode imputation and simple K-mean clustering with FURIA missing value imputation. The result shows that the K-mean with FURIA as missing value imputation has higher sensitivity (43 %) than the K-mix with conventional mean/mode imputation method (0.25 %).

The datasets prepared by different imputation methods are also classified using well known classifier decision tree (J48), KNN and also with FURIA. The classification outcomes are presented in Tables 3, 4, 5. Table 6 presents the highest sensitivity value found of all the datasets prepared by different imputation methods and missing value imputation using FURIA shows the sensitivity 43.3 % which is the highest among all the machine learning methods and statistical method explored in this paper.

Table 3 Different missing imputation methods with J48 classification
Table 4 Different missing imputation methods with K-NN classification

For clinical data analysis it is important to evaluate the classifier based on how well the classifier is performing to predict the “High Risk” patients. As indicated earlier the dataset shows an imbalance on patient’s status. Only 120 records, out of 832 records, are of “High Risk” (14.3 % of the total records). A classifier may give very high accuracy if it can correctly classify the “Low Risk” patients but is of limited use if it does not correctly classify the “High Risk” patients. For our analysis we gave more importance to Sensitivity and Specificity then Accuracy to compare the classification outcome.

Table 5 Different missing imputation methods with Fuzzy Rule Induction Algorithm classification

If we analyse the ROC [38] space for all the imputation methods classified with three classifiers mentioned earlier and one clustering algorithm plotted in Fig. 2, we will find that most the machine learning methods are above the random line and most of the cases better than the statistical mean/mode imputation.

Table 6 Highest sensitivity value found with each of the imputation method
Fig. 2
figure 2

The ROC space and plots of the different imputation methods classified with J48, FURIA, KNN and K-Means.

If we evaluate the missing imputation based on the sensitivity than we can see the FURIA missing value imputation outperformed all the other machine learning and traditional mean/mode approaches to missing value imputation methods that we have examined in this work.

10 The Complexity of the Proposed Method

The complexity of the proposed method is related with the complexity of the classifier is used for the missing value imputation. If we use FURIA, than the fuzzy unordered rule induction algorithm can be analysed by considering the complexity of the rule fuzzification procedure, rule stretching and re-evaluating the rules. For \(\left| {D_T } \right| \) training data and \(n\) numbers of attribute the complexity of the fuzzification procedure is \(O(\left| {D_T } \right| n^{2})\) [23], with \({\vert }RS{\vert }\) numbers of rules and \(\left| {D_T } \right| \) training data the complexity of rule stretching is \(O(\left| {D_T } \right| n^{2})\) [23], and rule \(r\) with antecedent set A (r) the complexity for the rule re-evaluating is \(O({\vert }A (r){\vert })\). For the experimental data of 823 records with 23 attributes on an average it took 0.69 s to build the model for each attribute of missing values.

11 Conclusion

Missing attribute values are common in real life datasets, which causes many problems in pattern recognition and classification. Researchers are working towards a suitable missing value imputation solution which can show adequate improvement in the classification performance. Medical data are usually found to be incomplete as in many cases on medical reports some attributes can be left blank, because they are inappropriate for some class of illness or the person providing the information feels that it is not appropriate to record the values. In this work we examined the performance of machine learning techniques as missing value imputation. The results are compared with traditional mean/mode imputation. Experimental results show that all the machine learning methods which we explored outperformed the statistical method (Mean/Mode), based on sensitivity and some cases accuracy.

The process of missing imputation with our proposed method can be computationally expansive for large numbers of attribute having missing values in their attributes. However, we know that data cleaning is part of data pre-processing task of data mining which is not a real time task and neither a continuous process. Missing value imputation is a onetime task. With this extra effort we can obtain a good quality data for better classification and decision support.

We can conclude that machine learning techniques may be the best approach to imputing missing values for better classification outcome.