1 Introduction

The process of finding interesting patterns from enormous amounts of data is called data mining. Such rich and fascinating patterns can be valuable for large businesses and for making smart decisions. This helps in improving customer relationship, developing marketing policies, improving sales and reduces costs.

Data mining is a multidisciplinary field which bonds statistics, machine learning, artificial intelligence and database technologies to predict future from large data repositories. The data mining methods such as association, classification and clustering can be applied on various kinds of data such as database data, transactional data and data warehouse.

The focus of this paper is on one of the techniques of data mining called classification. The process of sorting objects into similar groups is called classification. The classification process has two steps, first is the training step or the learning step where a classification model also known as a classifier finds correlations between the class labels and the features in a given dataset. In the second step, this classification model is supplied with test data to see the performance of the model. There are various application areas where classification can be useful like spam filtering, fraud detection, target marketing, customer attraction, customer retention, performance prediction, manufacturing, medical diagnosis etc. So there is a great need in the research field to improve the accuracy of the classifier. There are many classification models and much work has been carried out to improve the efficiency of these traditional models. Nearest neighbour is one of most popular classical classification model which has been chosen for this paper.

Using ensemble methods for improving the classification is one of the active research fields in machine learning. An ensemble blends a series of k trained models M1, M2, M3,…, Mk with the aim to improved composite classification model, M* [1]. In this paper a novel technique has been developed to improve the efficiency of Nearest Neighbor model through ensemble method.

2 Related work

Yu et al. [2], have proposed a paper on special dataset problems such as imbalance data, spare data, and have addressed using HBKNN and latter to tackle the noisy data problem in the high dimensional dataset, an ensemble method for random subspace with HBKNN (RS-HBKNN) is proposed, which outperforms most of the classification approaches.

Vinoth et al. [3], A text document classifier is proposed by incorporating the KNN classifier with support vector machine (SVM) classifier. In this proposed SVM-KNN, approach minimises the limitations in classification accuracy. Here the training data reduced from diverse classes and it is utilized by the SVM. The SVs from different classes provided as learning data for KNN classification algorithm where instead of Euclidean function the nearest centroid distance function is used this minimises the computation time for distance calculations.

Mittal and Gill [4], the aim of this approach is to propose a hybrid model for an efficient diabetic prediction. A hidden pattern is obtained by using feature selection on the dataset and then a two layered classification is applied on this refined data. Here in this study SVM classifier and Neural Network classifier are hybridized. This proposed system is compared with other few classifiers and has got good accuracy.

Aci et al. [5], the aim of this study is to eliminate those data which are difficult to train. The three classification algorithm k nearest neighbour, Bayesian methods and genetic algorithm is combined to produce a new dataset out of the original data. This method is tested on various dataset. The new data gives better classification accuracy than the old data.

Miloud-Aouidate and Baba-Ali [6], this paper is based on the hybridization of kNN and ant colony optimisation, using condensing approach with respect to kNN. Condensing allow to reduce the instances from the datasets significantly so that the accuracy of this training set is very close to the complete set of the training data. So the condensing kNN and optimization approach together formed a novel algorithm which outperforms the standard kNN and other condensed kNN algorithms.

3 Classification approaches

In this study classification models such as nearest neighbors, rotation forest, and simple logistic models are hybridised using the ensemble method called STACKING.

3.1 Nearest neighbour

Classifier or k-nearest neighbour classifier is the instance based learner classifier that compares an unknown instance with the training instances similar to it. When an unknown instance is given, the k-NN classifier searches for the k instances closest to the unknown instance. Here in this paper the value for k is equal to 1 i.e. only 1NN is considered. The closeness is obtained by distance metrics like Euclidean distance, Manhattan distance or Minkowski distance. Euclidean distance is considered here. To calculate Euclidian distance from two points p = (p1, p2) and q = (q1, q2), the distance formula is as follows: [7, 8]

$$ {\text{d}}\left( {{\text{p}},{\text{q}}} \right) = \sqrt {\mathop \sum \limits_{i = 1}^{n} \left( {{\text{p}}_{1} - {\text{q}}_{1} } \right)^{2} } $$
(1)
$$ {\text{d}}\left( {{\text{p}},{\text{q}}} \right) = \sqrt {({\text{p}}_{1} } - {\text{q}}_{1} )^{2} + ({\text{p}}_{2} - {\text{q}}_{2} )^{2} + ({\text{p}}_{3} - {\text{q}}_{3} )^{2} . $$
(2)

3.2 Rotation forest

Classifier produces ensembles of classifiers. In this classifier the dataset features are split into k subsets, and on each subset the principle component analysis is applied and a new set of subsets. Now the data is transformed into new features and along with it, a decision tree is learned. The splits in the decision tree leads to different rotations. Like this, a diverse classifier is obtained. The information about the spread out of the data is completely stored in the new space of extracted features [9]. This is how accurate separate classifiers are built. In rotation forest classifier, the aim is to get diverse and accurate classifiers concurrently. This method is called rotation forest because it’s a combination of principle component analysis which is a rotation of coordinate axes and the base classifier model is a decision tree [9].

3.3 Simple logistic regression

The simple logistic model is a supervised classification model. It is more like a linear classifier which takes the calculated logits scores and weights to forecast the class label. The score is more like the numerical correspondent to a particular categorical attribute. The weights are like the weightages corresponding to the particular target. Now these scores and weights are multiplied to form logits. The logits are then passed to a function called normalized exponential function which will return the probabilities for target class. The class with the highest probability will be the predicted class for the given unknown instance. Formula for normalized exponential function: [10].

$$ \sigma ({\text{z}})_{j} = \frac{{{\text{e}}^{{{\text{z}}_{\text{j}} }} }}{{\mathop \sum \nolimits_{{{\text{k}} = 1}}^{\text{k}} {\text{e}}^{{{\text{z}}_{\text{k}} }} }} \quad {\text{for j}} = 1,..,{\text{k}} $$
(3)

where k-dimensional vector of real values in the range 0–1. The simple logistic classifier is selected as the meta classifier for stacking which is the second phase of stacking process. The meta classifier plays an important role in getting good results in the proposed hybrid model.

3.4 Ensemble stacking

Classification is a two phased process. In the first phase base classifiers train the datasets using j-cross-validation, c1, c2, …, ci produces y1, y2, …, yi as output of the classifier and then this output is the input to the meta classifier which is the second phase of stacking. The meta classifier minimises the error in order to optimise the base classifiers. This process is repeated for k-cross validation to get the final stacked classification model [11,12,13,14,15,16,17,18] (Fig. 1).

Fig. 1
figure 1

The stacking process

4 Working of proposed hybrid model

In this paper, a unique hybridization of 1-NN model (where the parameter value for k = 1) and rotation forest model are taken as the base classification model and meta classifier as simple logistic model in stacking, is proposed. The basic working behind the proposed hybrid model is as follows:

  • A training dataset D is send separately to each of the base classification models [1-NN and rotation forest] present in the stacking phase 1.

  • The two outputs of the two classification models of stacking phase 1 are combined to form a single new dataset Dh.

  • This dataset acts as the input training data for the meta classification model [Simple logistic model] of stacking phase 2 which produces the final output H. H is the final prediction of the proposed hybrid model.

The proposed hybrid model is compared with other ensemble methods like bagging, boosting, and logitboost, and it has also compared with six other classification models like Naive bayes, Bayesian Network, simple CART (classification and regression tree), J48, decision tree, decision table upon 13 datasets taken from UCI repository [19] (Fig. 2).

Fig. 2
figure 2

Proposed hybrid model

5 Datasets used

There are thirteen datasets used in this paper which are taken from UCI data repository. The datasets are a mix of all kinds of data, some are categorical in nature, some are numeric and there are dichotomous class problems as well as multi class problems (Table 1).

Table 1 Datasets taken from UCI repository

6 Results and discussions

6.1 Computation of performance of classification models

There are certain measures to evaluate the performance of the classification model. Here in this paper binary classification problem as well as multi class problems are taken under consideration. The classification algorithms selected in this whole study can tackle multiclass problems as well.

Quality of model is treated as good if the maximum number of items is correctly classified. True positive and True negative values are evaluated, which tells when the classifier is getting things right, while False positive and False negative tells when the classifier is getting things wrong. There are many other measures to estimate a classifier like TP rate, FP rate, F-measure, precision and recall. True positives (Tp): when a classification model correctly predicts the label as positive. True negatives (Tn): when a classification model correctly predicts the label as negative. False positives (Fp): when a classification model predicts the label incorrectly as positive. False negatives (Fn): when a classification model predicts the label incorrectly as negative [20].

The performance of the classification is obtained by calculating accuracy, precision, and recall. Accuracy measures the rate of total correct Tp and Tn predictions to all predictions. Precision quantifies the correctness rate of the class predictions as positive by the classification model and recall measures the rate of positives are correctly predicted as positive. They both are opposite so comparison between them are difficult. So, a single measure from precision and recall could be computed, called F-measure [20].

$$ {\text{Accuracy}} = \frac{{{\text{T}}_{\text{p}} + {\text{T}}_{\text{n}} }}{{{\text{T}}_{\text{p}} + {\text{T}}_{\text{n}} + {\text{F}}_{\text{p}} + {\text{F}}_{\text{n}} }} $$
(4)
$$ {\text{Precision}} = \frac{{{\text{T}}_{\text{p}} }}{{{\text{T}}_{\text{p}} + {\text{F}}_{\text{p}} }} $$
(5)
$$ {\text{Recall}} = \frac{{{\text{T}}_{\text{p}} }}{{{\text{T}}_{\text{p}} + {\text{T}}_{\text{n}} }} $$
(6)
$$ {\text{Fmeasure}} = 2. \frac{{{\text{precision}}.{\text{recall}}}}{{{\text{precision}} + {\text{recall}}}} $$
(7)

F-measure is the weighted harmonic mean of Precision and Recall.

6.2 Experimental results

The results of this experiment are shown in two different tables. The Table 2 shows that the proposed hybrid model has outperformed the traditional nearest neighbor model significantly in terms of accuracy, in most of the cases.

Table 2 Shows the comparison between proposed hybrid model and standard k-nearest neighbor model in terms of accuracy values

Following is the graphical representation of comparison between the proposed hybrid model and the standard kNN. For both models the values for k = 1 (Fig. 3).

Fig. 3
figure 3

Comparison between the proposed hybrid model and the standard kNN

With respect to each datasets in the graph, it is observed that the proposed hybrid method is significantly better than the standard 1NN model.

The Table 3 shows the comparison between the proposed Hybrid model and 9 other standard classification models, out of these 9 models 3 are ensemble based models, bagging, Ad boost and logitboost. For this comparison, the same above mentioned 13 datasets has been taken. In 9 out of 13 datasets cases, the proposed hybrid model has shown highest accuracy, and in 2 cases proposed hybrid model has same highest values as other two models in this study and in only 2 cases the proposed system has the lower accuracy values. The models used here are Naive Bayes, Bayesian networks, decision table, decision tree, J48, simple CART.

Table 3 Shows the comparison of proposed hybrid model and other classification models

The experiment was aimed at comparing performance of the proposed hybrid model using stacking against the standard kNN model and it was also compared with six other standard models as well as three ensemble models. Here T-paired test has done to show the comparison of the proposed hybrid method using stacking with other standard method in terms of accuracy, Fmeasure and ranking. The table also shows the accuracy of the models and the standard deviations values. Here confidence level is taken .05.

Below is the graphical presentation of the above mentioned table of comparison between the proposed hybrid model and other classification models in terms of accuracy. The proposed hybrid model has scored the highest in all cases (Fig. 4).

Fig. 4
figure 4

Performance of the proposed hybrid model compared to other classification models

In Fig. 5 The Fmeasure of proposed hybrid model has scored high in maximum cases. The last row shows the win/tie/loss of the entire models in terms of Fmeasure. It is observed that none of the models has obtained any wins against the proposed hybrid method and ranking a model takes place by the number of times a given model beat the other models in j-cross validation. Table 4 shows the ranking of the classification models according to the difference between the number of times each model has been significantly better and worse than another models. Proposed hybrid model has the maximum wins in ranking (Fig. 6).

Fig. 5
figure 5

Classification accuracy and standard deviation of proposed hybrid models and other standard models. *proposed model is significantly better than the other models

Fig. 6
figure 6

Shows the Fmeasure of all the models against the proposed hybrid model

Table 4 Ranking of the classification models

7 Conclusion and future scope

In this paper, a novel technique is implemented using ensemble method in which it is observed that the proposed hybrid model has given very good results as compared to other ensemble models and classic models tested upon 13 datasets from UCI repository. The method not only proved to be a good method for binary class problems but also good enough for multiclass problems as well. This hybridization of three classifiers 1NN, rotation forest as base classifiers and simple logistic regression as meta classifier built in stacking ensemble method has improved the accuracy rate of the standard 1NN significantly. In terms of future application this technique can be used with other classification algorithms as well.