Abstract
Computers have brought about significant technological improvements leading to the creation of enormous volumes of data, particularly in health care systems. The availability of vast amounts of data contributed to a greater need for data mining techniques to produce useful knowledge. Accurate analyzes of medical data are gaining early detection of illness and patient care with the increase of data in biomedical and health care communities. The data mining is one of the major approaches for developing sophisticated algorithms for classification of data. Some have castigated Data mining for not meeting all of the humanistic statistics specifications [5]. Classification of diseases is that one of the main applications of data mining and many important attempts have been made in recent years to improve the accuracy of the diagnosis of diseases through data mining. We used four prominent data mining algorithms such as Naive Bayes Classifier, K-Nearest Neighbors (KNN) Classifier, Artificial Neural Networks (ANN) and Support Vector Machine (SVM) algorithms to develop predictive models using that the ILPD (Indian Liver Patient Data Set) from the UCI Machine learning repository. For performance comparison purposes, we used the 10-fold cross validation method to calculate the estimation of six predictive models. We find that the support vector machine delivers the best results in a 74.82 percent accuracy classification and 56.55 percent accuracy of the Naive Bayes performed the worst. The performance metrics of classifiers were analyzed on medical dataset further sections below.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction and Literature Survey
Early detection of diseases accurately using supervised Classification Algorithms becomes easy to cure rather than critical stages. The classification techniques are extremely popular in various tools for an automated medical diagnosis. There is a significant increase in mobile devices that are used to track the body conditions of humans. With the aid of automated diagnosis methods for liver diseases it is possible to detect the disease at an early stage and it is easy to cure the disease, according to the SVM classifier.
Lung-Cheng reported that for the CDC Chronic fatigue syndrome dataset, Nave Bayesian classifier produces a high performance as compared to SVM and C4.5 [3]. Harper [2] stated that a single best classification method is not appropriate but that the best performing algorithm depends on the features to be evaluated. Sorich [1] stated that for the chemical datasets, SVM classifier produces the best predictive results. In this paper we demonstrate the efficiency of four data mining algorithms: Naive Bayes, KNN classifier, Support vector machines and ANN classifier algorithm [5].
2 Classification Algorithms
Supervised Learning techniques are popular in different clinical outcome predictions. A Classifier is a supervised learning approach for building classification models for a given data set. Classifiers are used for solving classification problems. Initially the Training Dataset is used to build a classification model and applied to the test data set which contains records of unknown labels. Below we introduce all the algorithms which we demonstrate further.
-
A.
Naive Bayes Classifier
The Naive Bayes Classifier is the wellknown representation of the statistical learning algorithm. The Naive Bayes model is the massively simplified Bayesian probability model [13]. Naive-Bayes Algorithm uses the probability theory as an approach to the concept classification. Naive Bayes given its simplicity can often out perform more sophisticated classification methods [14]. Naive Bayes classifier are a group of Bayes Theorem based classification algorithms [9]. It is not a single algorithm, but a group of algorithms in which they all share a common concept, i.e. each pair of features is independent of each other. The classificator Naive Bayes works on a firm presumption of independence [13]. It is very simple and demonstrates high precision and speed when used in large databases.
This assumption is called independence conditional probability by class. Using a few statistical tests such as Chi-squared and mutual knowledge tests, we can find the relationships of conditional independence among the features and use these relationships as constraints to create a Bayesian network.
-
B.
K-Nearest Neighbors (KNN) Algorithm
KNN are also known as K-Nearest Neighbors is one of the simplest supervised Machine Learning algorithms which is mainly based on the feature similarity, it is mainly focused on the classification problems in the industry. That is, it classifies a data point based on the classification of its neighbours. KNN stores all available cases and classifies new cases based on a similarity measure from the existing ones. KNN algorithm is a commonly used algorithm as it is known for its easy interpretation, effectiveness in predicting and low calculation time. In the KNN algorithm the value of “k” is a factor that refers to the number of neighbors closest to include in a majority voting process. Choosing the correct value of “k” is a process called Parameter Tuning and for greater accuracy it is necessary. When the data is labeled the noise free and the data set is small the KNN algorithm is well used.
This algorithm also allows Euclidean Distance to be determined to find the nearest neighbors of the unknown data point from all the points in the data set [12]. The most common classification is contained in the data set from the samples, so that this classification is applied to the new sample.
-
C.
Support Vector Machine (SVM)
Support Vector Machine is defined in the supervised approach of the machine learning which is used mainly in the classification. SVM manages the data sets to sort the data into one of the groups. In this process, every element is represented as a point in the given data set that is plotted in n-dimensional space. The value of each characteristic reflects the value of a particular plane coordinate. Here, n is the number of data set functions. Then, [9] classification is performed to find the appropriate hyper plane which differentiates between the two classes of support vectors. Support Vectors in the plane are simply the coordinate points which represent the individual observation of data items in a given set of data. For the given support vectors, the primary objective in the classification is to identify the right hyper plane to distinguish the two groups better. Generally speaking, when the distances between the nearest point or class and the hyper-plane are maximized, the right hyper-plane is well regarded. The difference is called margin. SVM is therefore a boundary the better distinguishes the two support vector groups. The benefits of using SVM are that it supports High-Dimensional Input Space, Sparse Document Vectors and Parameter Regularization.
-
D.
Artificial Neural Networks (ANN)
Artificial neural networks are widely known to biologically influence highly refined analytical techniques, and are capable of modeling nonlinear functions that are extremely complex [6]. Formally developed analytical techniques are focused on learning processes in the mental system and neurological functions of the brain, and are able to predict new observations after a phase of so-called learning from existing data. We now used a common ANN architecture called multilayer (MLP) in the paper with back propagation. The multilayer is sometimes colloquially referred to as “neural networks” especially when they have a single hidden layer. The back Propagation Algorithm is a multilayered Neural Networks for learning the rules credited to Rumelhart and McClelland.
3 Methodology
3.1 Identify Data Set
We used the dataset from UCI Machine Learning data repository named as Indian Liver Patient Data to implement research experiments mentioned in the paper below. The ILPD data set consists of 416 records of liver patients with the disorder and 167 records of patients with no illness. This data set comprises 11 attributes and is considered one of Andhra Pradesh most comprehensive data sets.
3.2 Understand and Clean the Data
Data acquisition and pre-processing is important step in data mining process. In this paper, we understand the distribution of each variables, if the distribution of variable not good we use log transformation on that variables. The data for the ILPD contained an 583 instances with 11 features. The design matrix of the Dataset given I the following Table 1:
This study shows that the distribution of the other variables did not change significantly. For example, the histograms shown in Fig. 1 below show the distribution plot before deletion, and Fig. 2 shows the distribution plot after the records have been deleted. Comparing the two plots we find that there is no Comparing the two plots, we find that there is no significant change in the age attribute distribution and the Total Bilirubin attribute before and after these missing valued records are deleted.
3.3 Select Features and Model Building
Learning Accuracy or other performance of a Classifier depends on the data and significant diversity in identification of important features. When trying to improve the performance of a classifier we need better features. Select the best features and if it is significant, remove high correlated features with target class label. Select a subset of features which retains most of the relevant information. Select the significant feature subset and build a model using data mining classification algorithm, Create training data set and test data set based on the feature subset, Train the classifier with the training data set, Find accuracy or other performance metrics with validation data set, repeat for all feature subsets and select the best feature subsets which leads to improve the predictive accuracy of the classifier.
3.4 Performance Metrics
We evaluate the performance of different data mining algorithms on medical dataset using the following metrics
-
(1)
True positive rate: The ratio of Positive examples estimated correctly by the classifier
where
True Positive = the set of positive examples correctly estimated by classification model
False Negative = the set of positive examples wrongly estimated by classification model
(True Positive + False Negative) = Total set of test results for the model being considered 2) True Negative Rate: The fraction of negative examples predicted correctly by the model
where
True Negative = the set of negative examples estimated correctly by the classifier
False Positive = the set of negative examples predicted as positive.
(True Negative + False Positive) Total set of test results for the model being considered
-
(2)
Accuracy: The fraction of number of correct predictions and Total number of predictions
K-fold Cross Validation:
Dealing bias and Variance Problems which are associated with the training data set and testing data set. To avoid the variance problem take number of training examples should be at least 10 times the number of variables. This system the complete was divided into approximately equal size of k subsets. The “k” times must be trained and tested in the classification model. In this analysis the 10-fold cross validation approach is used to estimate the efficiency of classifiers. The whole is split into 10 mutually exclusive subsets in 10-fold cross validation process. We then average the results obtained from the data (Fig. 3).
Classification Results and Future Scope:
In this paper, we evaluate the performance of different data mining algorithms on clinical data using different metrics such as Accuracy, True Positive Rate and True Negative Rate. We obtained better results from Support Vector Machine, Artificial Neural Networks. In future, we use Ensembles and Deep learning algorithms on medical data (Table 2).
References
Sorich MJ, Miners JO, McKinnon RA, Winkler DA, Burden FR, Smith PA (2003) Comparison of linear and nonlinear classification algorithms for the prediction of drug and chemical metabolism by human UDP-glucuronosyltransferase isoforms. J Chem Inf Comput Sci 43(6):2019–2024
Paul R, Harper A (2005) A review and comparison of classification algorithms for medical decision making. Health Policy 71(3):315–331
Huang L-C, Hsu S-Y, Lin E (2009) A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic data. J Transl Med 7(81)
Cios KJ, Moore GW (2002) Uniqueness of medical data mining. Artif Intell Med 26:124
Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall, New Jersey
Yu H, Kim S (2012) SVM tutorial: classification, regression, and ranking. In: Handbook of natural computing
Rish I (2001) An empirical study of the Naive Bayes classifier. In: IJCAI workshop on empirical methods in AI
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Amer Stat 46(3):175–185
Berman JJ (2002) Confidentiality issues for medical data miners. Artif Intell Med 26:25–36
Burke HB, Goodman PH, Rosen DB, Henson DE, Weinstein JN, Harrell FE Jr et al (1997) Artificial neural networks improve the accuracy of cancer survival prediction. Cancer 79:857–862
Al-Sharafat WS, Naoum R (2009) Development of Genetic-based machine learning for network intrusion detection. World Acad Sci Eng Technol 55: 20–24
Eyheramendy JJ, Lewis D, Madigan D (2003) On the Naive Bayes model for text categorization. In: proceedings artificial intelligence statistics
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Suragala, A., Venkateswarlu, P., China Raju, M. (2021). A Comparative Study of Performance Metrics of Data Mining Algorithms on Medical Data. In: Kumar, A., Mozar, S. (eds) ICCCE 2020. Lecture Notes in Electrical Engineering, vol 698. Springer, Singapore. https://doi.org/10.1007/978-981-15-7961-5_139
Download citation
DOI: https://doi.org/10.1007/978-981-15-7961-5_139
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7960-8
Online ISBN: 978-981-15-7961-5
eBook Packages: EngineeringEngineering (R0)