Keywords

1 Introduction and Literature Survey

Early detection of diseases accurately using supervised Classification Algorithms becomes easy to cure rather than critical stages. The classification techniques are extremely popular in various tools for an automated medical diagnosis. There is a significant increase in mobile devices that are used to track the body conditions of humans. With the aid of automated diagnosis methods for liver diseases it is possible to detect the disease at an early stage and it is easy to cure the disease, according to the SVM classifier.

Lung-Cheng reported that for the CDC Chronic fatigue syndrome dataset, Nave Bayesian classifier produces a high performance as compared to SVM and C4.5 [3]. Harper [2] stated that a single best classification method is not appropriate but that the best performing algorithm depends on the features to be evaluated. Sorich [1] stated that for the chemical datasets, SVM classifier produces the best predictive results. In this paper we demonstrate the efficiency of four data mining algorithms: Naive Bayes, KNN classifier, Support vector machines and ANN classifier algorithm [5].

2 Classification Algorithms

Supervised Learning techniques are popular in different clinical outcome predictions. A Classifier is a supervised learning approach for building classification models for a given data set. Classifiers are used for solving classification problems. Initially the Training Dataset is used to build a classification model and applied to the test data set which contains records of unknown labels. Below we introduce all the algorithms which we demonstrate further.

  1. A.

    Naive Bayes Classifier

    The Naive Bayes Classifier is the wellknown representation of the statistical learning algorithm. The Naive Bayes model is the massively simplified Bayesian probability model [13]. Naive-Bayes Algorithm uses the probability theory as an approach to the concept classification. Naive Bayes given its simplicity can often out perform more sophisticated classification methods [14]. Naive Bayes classifier are a group of Bayes Theorem based classification algorithms [9]. It is not a single algorithm, but a group of algorithms in which they all share a common concept, i.e. each pair of features is independent of each other. The classificator Naive Bayes works on a firm presumption of independence [13]. It is very simple and demonstrates high precision and speed when used in large databases.

    This assumption is called independence conditional probability by class. Using a few statistical tests such as Chi-squared and mutual knowledge tests, we can find the relationships of conditional independence among the features and use these relationships as constraints to create a Bayesian network.

  2. B.

    K-Nearest Neighbors (KNN) Algorithm

    KNN are also known as K-Nearest Neighbors is one of the simplest supervised Machine Learning algorithms which is mainly based on the feature similarity, it is mainly focused on the classification problems in the industry. That is, it classifies a data point based on the classification of its neighbours. KNN stores all available cases and classifies new cases based on a similarity measure from the existing ones. KNN algorithm is a commonly used algorithm as it is known for its easy interpretation, effectiveness in predicting and low calculation time. In the KNN algorithm the value of “k” is a factor that refers to the number of neighbors closest to include in a majority voting process. Choosing the correct value of “k” is a process called Parameter Tuning and for greater accuracy it is necessary. When the data is labeled the noise free and the data set is small the KNN algorithm is well used.

    This algorithm also allows Euclidean Distance to be determined to find the nearest neighbors of the unknown data point from all the points in the data set [12]. The most common classification is contained in the data set from the samples, so that this classification is applied to the new sample.

  3. C.

    Support Vector Machine (SVM)

    Support Vector Machine is defined in the supervised approach of the machine learning which is used mainly in the classification. SVM manages the data sets to sort the data into one of the groups. In this process, every element is represented as a point in the given data set that is plotted in n-dimensional space. The value of each characteristic reflects the value of a particular plane coordinate. Here, n is the number of data set functions. Then, [9] classification is performed to find the appropriate hyper plane which differentiates between the two classes of support vectors. Support Vectors in the plane are simply the coordinate points which represent the individual observation of data items in a given set of data. For the given support vectors, the primary objective in the classification is to identify the right hyper plane to distinguish the two groups better. Generally speaking, when the distances between the nearest point or class and the hyper-plane are maximized, the right hyper-plane is well regarded. The difference is called margin. SVM is therefore a boundary the better distinguishes the two support vector groups. The benefits of using SVM are that it supports High-Dimensional Input Space, Sparse Document Vectors and Parameter Regularization.

  4. D.

    Artificial Neural Networks (ANN)

    Artificial neural networks are widely known to biologically influence highly refined analytical techniques, and are capable of modeling nonlinear functions that are extremely complex [6]. Formally developed analytical techniques are focused on learning processes in the mental system and neurological functions of the brain, and are able to predict new observations after a phase of so-called learning from existing data. We now used a common ANN architecture called multilayer (MLP) in the paper with back propagation. The multilayer is sometimes colloquially referred to as “neural networks” especially when they have a single hidden layer. The back Propagation Algorithm is a multilayered Neural Networks for learning the rules credited to Rumelhart and McClelland.

3 Methodology

3.1 Identify Data Set

We used the dataset from UCI Machine Learning data repository named as Indian Liver Patient Data to implement research experiments mentioned in the paper below. The ILPD data set consists of 416 records of liver patients with the disorder and 167 records of patients with no illness. This data set comprises 11 attributes and is considered one of Andhra Pradesh most comprehensive data sets.

3.2 Understand and Clean the Data

Data acquisition and pre-processing is important step in data mining process. In this paper, we understand the distribution of each variables, if the distribution of variable not good we use log transformation on that variables. The data for the ILPD contained an 583 instances with 11 features. The design matrix of the Dataset given I the following Table 1:

Table 1 Dataset description

This study shows that the distribution of the other variables did not change significantly. For example, the histograms shown in Fig. 1 below show the distribution plot before deletion, and Fig. 2 shows the distribution plot after the records have been deleted. Comparing the two plots we find that there is no Comparing the two plots, we find that there is no significant change in the age attribute distribution and the Total Bilirubin attribute before and after these missing valued records are deleted.

Fig. 1
figure 1

This figure shows the histogram of the distribution of the age variable before deleting the missing data from the data set

Fig. 2
figure 2

This figure shows the histogram of the distribution of the age variable after deleting the missing data from the data set

3.3 Select Features and Model Building

Learning Accuracy or other performance of a Classifier depends on the data and significant diversity in identification of important features. When trying to improve the performance of a classifier we need better features. Select the best features and if it is significant, remove high correlated features with target class label. Select a subset of features which retains most of the relevant information. Select the significant feature subset and build a model using data mining classification algorithm, Create training data set and test data set based on the feature subset, Train the classifier with the training data set, Find accuracy or other performance metrics with validation data set, repeat for all feature subsets and select the best feature subsets which leads to improve the predictive accuracy of the classifier.

3.4 Performance Metrics

We evaluate the performance of different data mining algorithms on medical dataset using the following metrics

  1. (1)

    True positive rate: The ratio of Positive examples estimated correctly by the classifier

$$ TPR = \frac{True\;Position}{True\;Position + False\;Nagative} $$
(1)

where

True Positive = the set of positive examples correctly estimated by classification model

False Negative = the set of positive examples wrongly estimated by classification model

(True Positive + False Negative) = Total set of test results for the model being considered 2) True Negative Rate: The fraction of negative examples predicted correctly by the model

$$ TNR = \frac{True\;Nagative}{True\;True\;Nagative + False\;Positive} $$
(2)

where

True Negative = the set of negative examples estimated correctly by the classifier

False Positive = the set of negative examples predicted as positive.

(True Negative + False Positive) Total set of test results for the model being considered

  1. (2)

    Accuracy: The fraction of number of correct predictions and Total number of predictions

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(3)

K-fold Cross Validation:

Dealing bias and Variance Problems which are associated with the training data set and testing data set. To avoid the variance problem take number of training examples should be at least 10 times the number of variables. This system the complete was divided into approximately equal size of k subsets. The “k” times must be trained and tested in the classification model. In this analysis the 10-fold cross validation approach is used to estimate the efficiency of classifiers. The whole is split into 10 mutually exclusive subsets in 10-fold cross validation process. We then average the results obtained from the data (Fig. 3).

Fig. 3
figure 3

Performance Metrics on Different Algorithms

Classification Results and Future Scope:

In this paper, we evaluate the performance of different data mining algorithms on clinical data using different metrics such as Accuracy, True Positive Rate and True Negative Rate. We obtained better results from Support Vector Machine, Artificial Neural Networks. In future, we use Ensembles and Deep learning algorithms on medical data (Table 2).

Table 2 Classification of various algorithms on the data sets