Keywords

1 Introduction

Liver is the largest organ after the skin in our body. It perform many functions cleansing blood toxins, converting food into nutrients to control hormone level. The diagnosis of liver diseases at early stage can improve survival rate of patient life. Techniques are used to find pattern from the large dataset are called the data mining techniques. it have several function such as classification, association rules and clustering etc. classification is supervised learning technique used for dataset in dissimilar group of classes or in different levels. Classification method performs two steps one is dataset are used to trained to built model and in second it used for classification [1].

2 Literature Survey

In the paper [2] Indian liver patient dataset and UCLA dataset were used. Analysis was done by ANOVA and MANOVA to recognize difference among the groups. Authors took common attributes e.g. ALKPHOS, SGPT and SGOT for both datasets. Analysis of Variance (ANOVA) was done using multivariate tables. Author investigated 99% and 90% significant levels and found the good results.

The study [3] deals with two distinct feature combinations viz SGOT, SGPT, and Alkaline Phosphates of two datasets (ILPD and BUPA liver disorder). Error rate, sensitivity, prevalence and specificity were exponentially observed. The attributes like total bilirubin, direct bilirubin, albumin, gender, age and total proteins facilitate in liver cancer diagnosis.

The paper [4] indicated neural network to train adaptive activation function for extracting rules. OptaiNET, an Artificial Immune Algorithm (AIS) was used to set rules for liver disorders. Based on input attribute adaptive activation was trained to use neural network extract rules efficiently in hidden layer. ANN to performs the data coding, to classifies coding data and finally extracts rules. It correctly diagnosed 192 samples (out of 200) belonging to class 0 covering 96% and 135 samples (out of 145) belonging to class 1 covering 93%. Entire samples correctly diagnosed 94.8%.

The study [5] pointed out univariate analysis and feature selection for predicator attributes. Predictive data mining is a significant tool for researchers of medical sciences. ILPD dataset was chosen for men and women. The classification algorithms were trained to test and to perform some results for accuracy and error analysis. For men and women the SVM gave high accuracy 99.76% and 97.7% respectively.

In the survey [6] classification algorithm decision tree induction (J48 algorithm) employing dataset from the Pt. B.D. Sharma Postgraduate Institute of Medical Science, Rohtak was used. The dataset contained 150 instances (100 instances for training purpose and 50 instances for the test data), 8 attributes and 2 classes for the model using 10 fold cross validation in WEKA tool and J48 algorithms classified correctly 100% instances. The result was expressed in four categories e.g. cost/benefit of J48 for class YES = 44, cost/benefit of J48 for class NO = 56, classification accuracy for YES = 56%, classification accuracy for NO = 44%. Many other algorithms on this dataset were applied and J48 algorithms showed best results.

The publication [7] described classification using data mining approaches on ILPD. Naïve bayes, Random Forest and SVM. The algorithms were implemented using R tool and for improving the accuracy the hybrid neuro SVM that is the combination of the SVM and feedforward Neural Network (ANN) was used. Root mean square error (RMSE) and mean absolute percentage error were pointed out. This model gave 98.83% accuracy.

In the publication on [1] various decision tree algorithms were used based on the data mining concept such as AD Tree, Decision Tree, J48, Random Forest, Random Tree on the liver cancer dataset. They were used for the training purpose and preprocessing was applied for missing or noisy data. Classification algorithms were performed with feature selection and without using feature selection. Its performances were measured in terms of Accuracy, Precision, and Recall. The accuracy (71.35%) of the decision stump was very good compared to other algorithms and J48 and random forest gave 70.66% and 70.15% accuracy respectively.

The publication on [8] indicated PSO java to execute dataset and to categorize training attributes in order to retrieve pbest and gbest. The pbest was then compared with lbest to set the best solution for attribute selection. The PSO gave gammagt 4.60, alkphos 4.49, SGPT 3.91, SGOT 3.07, drinks 1.36. The selected dataset was applied to WEKA tool to perform the classification. Then it applied the Kstar algorithm. PSO-Kstar algorithm is the best data mining technique giving accuracy up to 100%.

The paper [9] described different clustering algorithms for predication on BUPA liver disorder and ILPD dataset for performance analysis. The simple BIZ model was selected effectively. Different attribute selections were done for accuracy, such as 5, 6, 7, 8 and 9. The logistic Regression and SVM (PSO) gave best results for the BUPA liver disorder as well as ILPD dataset, with accuracy 89.14% and 89.66% respectively.

3 Methodology

In this process the Indian liver patient dataset have been taken after the preprocessing is performed in this method the missing values problem are solved after the supervised filter are used in that resample method are used then Lazy classifier such as IBKLG, LocalKnn, RseslibKnn algorithms are used in WEKA tool for classification. 10 folds cross validations are used then performance and error evaluation is performed (Fig. 1).

Fig. 1.
figure 1

Classification process

4 Result and Discussion

Lazy classifiers are used for analysis of the liver cancer disease. In this process any algorithm that gave better accuracy, precision and classified more correct instances is the good algorithm in term of early diagnosis of the liver cancer.

4.1 IBKLG Algorithm

IBKLG classifier is a part of lazy classifier. K-nearest neighbors classifier can select appropriate value of K based on cross-validation. It also performs distance weighting. It selects number of neighbor is one, The standard deviation set to 1.0, do not check capabilities to false, meanSquared value to false. It is based on nearest neighbor search algorithm using linearNNSearch algorithm. 10 folds cross validations are used for testing. It correctly classifies 573 instances (covering 98.28%) and incorrectly classifies 10 instances (covering 1.72%) out of 583 instances (Fig. 2, Tables 1 and 2).

Fig. 2.
figure 2

Area under ROC for IBKLG algorithm with a value 0.9986

Table 1. Error evaluation for IBKLG algorithm.
Table 2. Confusion matrix for IBKLG algorithm.

4.2 LocalKnn Algorithm

LocalKnn algorithm is based on K nearest neighbor classifier with local metric induction. It improves accuracy in relation to standard k-nn, particularly in case of data with nominal attributes. It works with reasonably 2000 + training instances. 100 batch size is selected. Do not check capabilities to set to false. Learning Optimal K values to true and number of neighbors used to vote for the decision to one, size of the local uses induce local metric to 100. The metric vicinity size for density based is 200. The voting for the decision by nearest neighbors is set to inverse square distance. It uses distance based weighting method. 10 fold cross validations are applied. It correctly classifies 576 instances (covering 98.80%) and incorrectly classifies 7 instances (covering 1.20%). Time taken to build model is 68.19 s (Fig. 3, Tables 3 and 4).

Fig. 3.
figure 3

Area under ROC for LocalKnn algorithm with a value 0.9844

Table 3. Error evaluation for LocalKnn Algorithm
Table 4. Confusion matrix for Local Knn Algorithm

4.3 RseslibKnn Algorithm

RseslibKnn is a part of lazy classifier. It sets some properties defines such as batch size, learning optimal k value, do not check capabilities, cross validation, kernel setting, density based metric and so on. Time taken to building model is 1.3 s. 10 folds cross validations. It correctly classifies 571 instances (covering 97.94%) and incorrectly classifies 12 instances (covering 2.06%) out of 583 instances (Fig. 4, Tables 5 and 6).

Fig. 4.
figure 4

Area under ROC for RseslibKnn algorithm with a value 0.9766

Table 5. Error evaluation for RseslibKnn algorithm
Table 6. Confusion matrix for RseslibKnn algorithm

4.4 Comparison of Error Evaluation and Performance Analysis of Three Lazy Classifiers (RselibKnn, IBKLG, LocalKnn) for ILPD Dataset

See Figs. 5 and 6.

Fig. 5.
figure 5

Error evaluation of Lazy classifier

Fig. 6.
figure 6

Performance analysis of Lazy classifier

5 Conclusion and Future Perspective

A close assessment of error estimation of three Lazy classifiers (RseslibKnn, IBKLG, LocalKnn) has been performed whereby the minimum error value is achieved through LocalKnn. The LocalKnn is best in terms of accuracy and recall while IBKLG indicates best precision. It is evident that if any classification algorithm classifies instances accurately, then diagnosis of the liver cancer can be done easily and accurately in early stages.

Further research work or classifiers can be applied on different types of cancers such as Breast cancer, Prostate Cancer, Lung cancer etc. Appling these algorithms may generate better results. As an extension of this Biopsy and mammography images can be used for analysis using machine learning methods. Research can also be applied for analysis of survival rate of the patient.