Abstract
This paper proposes a novel hybrid classification model which has enhanced the performance of the standard kNN (k = 1) classification model significantly. In this study by the means of ensemble stacking approach kNN classification model and rotation forest classification model are hybridized as base classifiers and simple logistic classifier as the meta classification model. The performance of this proposed hybrid model was assessed using Accuracy and FMeasure. The model was compared with standard kNN and nine other classification models. The results showed that the proposed hybrid model has notably high performance than the other models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The process of finding interesting patterns from enormous amounts of data is called data mining. Such rich and fascinating patterns can be valuable for large businesses and for making smart decisions. This helps in improving customer relationship, developing marketing policies, improving sales and reduces costs.
Data mining is a multidisciplinary field which bonds statistics, machine learning, artificial intelligence and database technologies to predict future from large data repositories. The data mining methods such as association, classification and clustering can be applied on various kinds of data such as database data, transactional data and data warehouse.
The focus of this paper is on one of the techniques of data mining called classification. The process of sorting objects into similar groups is called classification. The classification process has two steps, first is the training step or the learning step where a classification model also known as a classifier finds correlations between the class labels and the features in a given dataset. In the second step, this classification model is supplied with test data to see the performance of the model. There are various application areas where classification can be useful like spam filtering, fraud detection, target marketing, customer attraction, customer retention, performance prediction, manufacturing, medical diagnosis etc. So there is a great need in the research field to improve the accuracy of the classifier. There are many classification models and much work has been carried out to improve the efficiency of these traditional models. Nearest neighbour is one of most popular classical classification model which has been chosen for this paper.
Using ensemble methods for improving the classification is one of the active research fields in machine learning. An ensemble blends a series of k trained models M1, M2, M3,…, Mk with the aim to improved composite classification model, M* [1]. In this paper a novel technique has been developed to improve the efficiency of Nearest Neighbor model through ensemble method.
2 Related work
Yu et al. [2], have proposed a paper on special dataset problems such as imbalance data, spare data, and have addressed using HBKNN and latter to tackle the noisy data problem in the high dimensional dataset, an ensemble method for random subspace with HBKNN (RS-HBKNN) is proposed, which outperforms most of the classification approaches.
Vinoth et al. [3], A text document classifier is proposed by incorporating the KNN classifier with support vector machine (SVM) classifier. In this proposed SVM-KNN, approach minimises the limitations in classification accuracy. Here the training data reduced from diverse classes and it is utilized by the SVM. The SVs from different classes provided as learning data for KNN classification algorithm where instead of Euclidean function the nearest centroid distance function is used this minimises the computation time for distance calculations.
Mittal and Gill [4], the aim of this approach is to propose a hybrid model for an efficient diabetic prediction. A hidden pattern is obtained by using feature selection on the dataset and then a two layered classification is applied on this refined data. Here in this study SVM classifier and Neural Network classifier are hybridized. This proposed system is compared with other few classifiers and has got good accuracy.
Aci et al. [5], the aim of this study is to eliminate those data which are difficult to train. The three classification algorithm k nearest neighbour, Bayesian methods and genetic algorithm is combined to produce a new dataset out of the original data. This method is tested on various dataset. The new data gives better classification accuracy than the old data.
Miloud-Aouidate and Baba-Ali [6], this paper is based on the hybridization of kNN and ant colony optimisation, using condensing approach with respect to kNN. Condensing allow to reduce the instances from the datasets significantly so that the accuracy of this training set is very close to the complete set of the training data. So the condensing kNN and optimization approach together formed a novel algorithm which outperforms the standard kNN and other condensed kNN algorithms.
3 Classification approaches
In this study classification models such as nearest neighbors, rotation forest, and simple logistic models are hybridised using the ensemble method called STACKING.
3.1 Nearest neighbour
Classifier or k-nearest neighbour classifier is the instance based learner classifier that compares an unknown instance with the training instances similar to it. When an unknown instance is given, the k-NN classifier searches for the k instances closest to the unknown instance. Here in this paper the value for k is equal to 1 i.e. only 1NN is considered. The closeness is obtained by distance metrics like Euclidean distance, Manhattan distance or Minkowski distance. Euclidean distance is considered here. To calculate Euclidian distance from two points p = (p1, p2) and q = (q1, q2), the distance formula is as follows: [7, 8]
3.2 Rotation forest
Classifier produces ensembles of classifiers. In this classifier the dataset features are split into k subsets, and on each subset the principle component analysis is applied and a new set of subsets. Now the data is transformed into new features and along with it, a decision tree is learned. The splits in the decision tree leads to different rotations. Like this, a diverse classifier is obtained. The information about the spread out of the data is completely stored in the new space of extracted features [9]. This is how accurate separate classifiers are built. In rotation forest classifier, the aim is to get diverse and accurate classifiers concurrently. This method is called rotation forest because it’s a combination of principle component analysis which is a rotation of coordinate axes and the base classifier model is a decision tree [9].
3.3 Simple logistic regression
The simple logistic model is a supervised classification model. It is more like a linear classifier which takes the calculated logits scores and weights to forecast the class label. The score is more like the numerical correspondent to a particular categorical attribute. The weights are like the weightages corresponding to the particular target. Now these scores and weights are multiplied to form logits. The logits are then passed to a function called normalized exponential function which will return the probabilities for target class. The class with the highest probability will be the predicted class for the given unknown instance. Formula for normalized exponential function: [10].
where k-dimensional vector of real values in the range 0–1. The simple logistic classifier is selected as the meta classifier for stacking which is the second phase of stacking process. The meta classifier plays an important role in getting good results in the proposed hybrid model.
3.4 Ensemble stacking
Classification is a two phased process. In the first phase base classifiers train the datasets using j-cross-validation, c1, c2, …, ci produces y1, y2, …, yi as output of the classifier and then this output is the input to the meta classifier which is the second phase of stacking. The meta classifier minimises the error in order to optimise the base classifiers. This process is repeated for k-cross validation to get the final stacked classification model [11,12,13,14,15,16,17,18] (Fig. 1).
4 Working of proposed hybrid model
In this paper, a unique hybridization of 1-NN model (where the parameter value for k = 1) and rotation forest model are taken as the base classification model and meta classifier as simple logistic model in stacking, is proposed. The basic working behind the proposed hybrid model is as follows:
-
A training dataset D is send separately to each of the base classification models [1-NN and rotation forest] present in the stacking phase 1.
-
The two outputs of the two classification models of stacking phase 1 are combined to form a single new dataset Dh.
-
This dataset acts as the input training data for the meta classification model [Simple logistic model] of stacking phase 2 which produces the final output H. H is the final prediction of the proposed hybrid model.
The proposed hybrid model is compared with other ensemble methods like bagging, boosting, and logitboost, and it has also compared with six other classification models like Naive bayes, Bayesian Network, simple CART (classification and regression tree), J48, decision tree, decision table upon 13 datasets taken from UCI repository [19] (Fig. 2).
5 Datasets used
There are thirteen datasets used in this paper which are taken from UCI data repository. The datasets are a mix of all kinds of data, some are categorical in nature, some are numeric and there are dichotomous class problems as well as multi class problems (Table 1).
6 Results and discussions
6.1 Computation of performance of classification models
There are certain measures to evaluate the performance of the classification model. Here in this paper binary classification problem as well as multi class problems are taken under consideration. The classification algorithms selected in this whole study can tackle multiclass problems as well.
Quality of model is treated as good if the maximum number of items is correctly classified. True positive and True negative values are evaluated, which tells when the classifier is getting things right, while False positive and False negative tells when the classifier is getting things wrong. There are many other measures to estimate a classifier like TP rate, FP rate, F-measure, precision and recall. True positives (Tp): when a classification model correctly predicts the label as positive. True negatives (Tn): when a classification model correctly predicts the label as negative. False positives (Fp): when a classification model predicts the label incorrectly as positive. False negatives (Fn): when a classification model predicts the label incorrectly as negative [20].
The performance of the classification is obtained by calculating accuracy, precision, and recall. Accuracy measures the rate of total correct Tp and Tn predictions to all predictions. Precision quantifies the correctness rate of the class predictions as positive by the classification model and recall measures the rate of positives are correctly predicted as positive. They both are opposite so comparison between them are difficult. So, a single measure from precision and recall could be computed, called F-measure [20].
F-measure is the weighted harmonic mean of Precision and Recall.
6.2 Experimental results
The results of this experiment are shown in two different tables. The Table 2 shows that the proposed hybrid model has outperformed the traditional nearest neighbor model significantly in terms of accuracy, in most of the cases.
Following is the graphical representation of comparison between the proposed hybrid model and the standard kNN. For both models the values for k = 1 (Fig. 3).
With respect to each datasets in the graph, it is observed that the proposed hybrid method is significantly better than the standard 1NN model.
The Table 3 shows the comparison between the proposed Hybrid model and 9 other standard classification models, out of these 9 models 3 are ensemble based models, bagging, Ad boost and logitboost. For this comparison, the same above mentioned 13 datasets has been taken. In 9 out of 13 datasets cases, the proposed hybrid model has shown highest accuracy, and in 2 cases proposed hybrid model has same highest values as other two models in this study and in only 2 cases the proposed system has the lower accuracy values. The models used here are Naive Bayes, Bayesian networks, decision table, decision tree, J48, simple CART.
The experiment was aimed at comparing performance of the proposed hybrid model using stacking against the standard kNN model and it was also compared with six other standard models as well as three ensemble models. Here T-paired test has done to show the comparison of the proposed hybrid method using stacking with other standard method in terms of accuracy, Fmeasure and ranking. The table also shows the accuracy of the models and the standard deviations values. Here confidence level is taken .05.
Below is the graphical presentation of the above mentioned table of comparison between the proposed hybrid model and other classification models in terms of accuracy. The proposed hybrid model has scored the highest in all cases (Fig. 4).
In Fig. 5 The Fmeasure of proposed hybrid model has scored high in maximum cases. The last row shows the win/tie/loss of the entire models in terms of Fmeasure. It is observed that none of the models has obtained any wins against the proposed hybrid method and ranking a model takes place by the number of times a given model beat the other models in j-cross validation. Table 4 shows the ranking of the classification models according to the difference between the number of times each model has been significantly better and worse than another models. Proposed hybrid model has the maximum wins in ranking (Fig. 6).
7 Conclusion and future scope
In this paper, a novel technique is implemented using ensemble method in which it is observed that the proposed hybrid model has given very good results as compared to other ensemble models and classic models tested upon 13 datasets from UCI repository. The method not only proved to be a good method for binary class problems but also good enough for multiclass problems as well. This hybridization of three classifiers 1NN, rotation forest as base classifiers and simple logistic regression as meta classifier built in stacking ensemble method has improved the accuracy rate of the standard 1NN significantly. In terms of future application this technique can be used with other classification algorithms as well.
References
Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Morgan Kaufmann, Waltham
Yu Z, Chen H, Liu J, You J, Leung H, Han G (2015) Hybrid k-nearest neighbor classifier. IEEE Trans Cybern 46(6):1263–1275
Vinoth R, Jayachandran A, Balaji M, Srinivasan R (2014) A hybrid text classification approach using KNN and SVM. Int J Adv Found Res Comput (IJAFRC) 1(3):20–26
Gill NS, Mittal P (2016) Computational hybrid model, with two level classification using SVM and neural network for predicting diabetes disease. J Theor Appl Info Technol 87(1):1–10
Aci M, İnan C, Avci M (2010) A hybrid classification method of k nearest neighbor, Bayesian methods and genetic algorithm. Expert Syst Appl 37(7):5061–5067
Miloud-Aouidate A, Baba-Ali AR (2012) A hybrid KNN-ant colony optimization algorithm for prototype selection. Laboratory of Robotics, Parallelism and Embedded University of Sciences and Technology Houari Boumediene, USTHB, T. Huang et al. (Eds.): ICONIP 2012, © Springer-Verlag Berlin Heidelberg
Sahu SK, Kumar P, Singh AP (2018) Modified K-NN algorithm for classification problems with improved accuracy. Int J Inf Technol 10(1):65–70
Zhang S, Li X, Zong M, Zhu X, Cheng D (2017) Learning k for kNN classification. ACM Trans Intell Syst Technol 8(3):43
Rodriguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 28(10):19–30
Sumner M, Frank E, Hall M (2005) Speeding up logistic model tree induction. In: 9th European conference on principles and practice of knowledge discovery in databases
Wolpert DH (1992) Stacked generalization. Neural networks. IEEE Trans Pattern Anal Mach Intell 28(10):241–259
Hsu KW (2017) A theoretical analysis of why hybrid ensembles work. Hindawi Comput Intell Neurosci. https://doi.org/10.1155/2017/1930702
Salunkhe UR, Mali SN (2016) classifier ensemble design for imbalanced data classification: a hybrid approach. In: International conference on computational modelling and security (CMS)
Huang MW, Chen CW, Lin WC, Ke SW, Tsai CF (2017) SVM and SVM ensembles in breast cancer. PLOS ONE 12(1):e0161501
Huda S, Yerarwood J, Jelinek HF, Hasssan MM, Fortin G, Buckland M (2016) A hybrid feature selection with ensemble classification for imbalanced access. IEEE Access 4:9145–9154
Chen Y, Zhao X, Lin Z (2014) Optimizing subspace SVM ensemble for hyperspectral imagery classification. IEEE J Sel Topics Appl Earth Observations Remote Sens 7(4):1295–1305
Campos R, Canuto S, Salles T, de Sá CC, Gonçalves MA (2017) Stacking bagged and boosted forests for effective automated classification. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval
El Bialy R, Salama MA, Karam O (2016) An ensemble model for heart disease data sets: a generalized model. In: Proceedings of the 10th International Conference on Informatics and Systems, INFOS, May 09–11. ACM
Tang J, Alelyani S, Liu H (2015) Data classification: algorithms and applications, data mining and knowledge discovery series. CRC Press, Boca Raton
Agaoglu M (2016) Predicting instructor performance using data mining techniques in higher education. IEEE Access 4:2379–2387
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nair, P., Khatri, N. & Kashyap, I. A novel technique: ensemble hybrid 1NN model using stacking approach. Int. j. inf. tecnol. 12, 683–689 (2020). https://doi.org/10.1007/s41870-018-0109-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-018-0109-0