Keywords

1 Introduction

Adaptive learning is more effective than traditional non-adaptive learning algorithms and is better suitable for large-scale data [3]. In any expert system, before arriving at a conclusion, opinions from all the experts are taken into consideration and then the final decision is made. This is the principle behind ensemble learning [7]. In applications where the size of data is too large for a single classifier to analyse, ensemble systems partition the data into subsets where each classifier works on a subset of dataset and further combines the results using the existing approaches like majority voting, weighted majority voting, etc. [5]. There are two ways of combining the classifiers: classifier fusion and classifier selection [6]. In classifier fusion approach, all individual classifiers are trained on the whole dataset. Examples of this include bagging predictors and boosting [8]. In classifier selection approach, each individual classifier performs its best in some part of total dataset. There are two major components of any ensemble system. The first component is making a diverse ensemble. The second component is used to combine the output of decisions of the single classifiers.

In the real world, data consist of classes with overlapping boundaries. Excessive training will help solve this problem, but it will result in overfitting which will lead to misclassifications of testing data. Whereas learning generalized boundaries will not lead to overfitting but it will misclassify the overlapping patterns. Therefore, we opt to use clustering. Clustering makes it easy to learn the decision boundaries. Organization of data into groups is one of the fundamental methods of learning.

For each problem, let \(x = \{x_1, x_2 \ldots x_n\}\) be a set of input vector in \(R^p\) and \(y = \{y_1, y_2 \ldots y_n\}\), for a system given by S, where S transmutes x to y

$$\begin{aligned} y=S(x) \end{aligned}$$
(1)

Here, \(x = \{x_1, x_2 \ldots x_p\} \in R^p\) is an input vector and \(y = \{ y_1, y_2 \ldots y_r\} \in R^r\) is the output vector of a system. The purpose of our experiment is to identify a classification system that builds S to explain the given input–output data (xy).

We present adaptive cluster-based ensemble learner (ACEL) in the following sections. Section 2 gives the literature review. Section 3 gives the description about proposed approach. Section 4 summarizes the experimental setup. Section 5 summarizes the results. Section 6 presents the conclusions and the future work in this field.

2 Literature Review

Clusters are dense regions which are separated by low-density regions in feature space. Several Bayesian approaches are used for data clustering like undirected graphical model. Ensemble classifier combines the result of various diverse base classifiers [13]. Diversity is a property used to define ensemble classifiers. Greater diversity is observed when incorrect decisions made by one of the classifiers are handled by the other classifiers. This results in uniform distribution of errors. To combine the results of base classifiers, various methods have been proposed including majority voting, weighted majority and decision template. Among the several existing approaches for ensemble learning, boosting and bagging have been used to a greater extent [9, 10]. In bagging, the base classifiers learn on data subsets drawn randomly from entire training set, and the results are combined by majority voting [2]. In boosting, re-sampling of data instances is performed. The new learners work on the instances that are difficult to classify by the previous number of the ensemble. This mechanism encourages the construction of complementary learners.

Lately novel cluster ensemble technique, CE-GMDH was brought forward that comprises three parts: one initial approach, one conveying function and one external condition [11]. Experimentations were performed by CE-GMDH on artificial and real data. Yu et al. [14] has suggested a feature assortment oriented semi-supervised cluster ensembling technique for clustering of tumour obtained from instances of biomolecular datasets. A progressive semi-supervised clustering ensemble technique with arbitrary subspace method, limitation propagation and progressive ensemble member selection technique was brought forward by [15]. Alves et al. [1] has developed a methodology of ensembling by using multiple particle swarm optimization and demonstrated its ability to solve problems of computation biology.

3 Proposed Approach

In our proposed approach, we have performed homogeneous clustering for partitioning the patterns belonging to a single class only. Fixed number of rules for each class (here, one rule for each class) is used. Every class is represented by a combination of rules. To generate initial rule base, the training data are clustered using three different categories of clustering method k-means, fuzzy c-means (FCM) and particle swarm optimization (PSO) based clustering algorithm [4]. In order to catch each aspect of data learning process, three different varieties of clustering algorithms which are of different nature and can cluster using different approaches have been used.

Every single cluster represents a thick region in the input dataset which is depicted by the related cluster centroid. Every individual cluster is thereafter transformed into a rule, after which we perform adaptive rule tuning process to minimize the error function. The proposed cluster-based classification produces three diverse base classifiers. These base classifier’s results are joined using majority voting algorithm which is used for class prediction. Majority voting technique is used to predict the ultimate classification decision. The majority voting algorithm can be represented as

$$\begin{aligned} \sum _{t=1}^{T} d_{t,J}(x) = max_{j = 1, 2, \ldots , c} \sum _{t=1}^{T} d_{t,j} \end{aligned}$$
(2)

In situations where individual classifier decisions are not dependent on each other, it can be observed that majority voting combination technique will lead to a performance and accuracy improvement. The classification results are compared with other standard algorithms J48, Adaboost, SMO, Naive Bayes and Random Forest. Figure 1 depicts the proposed approach.

Fig. 1
figure 1

Proposed approach

Adaptive learning should be stopped when the training error (or misclassification) reaches an acceptable level. The main goal of rule tuning is to remove irrelevant data associated with the cluster and to include the data which belong to the cluster. Rule tuning potentially increases the predictive power of the rule, helping to avoid overfitting to the training data. As soon as the misclassification reaches to zero or maximum iteration gets completed, the construction of the final rule, i.e. the rule tuning procedure completes.

The strategy for the rule tuning procedure is based on the concept of best centroid for minimizing sum of squared error (SSE), and it can be obtained by the mean of the points in the cluster.

$$\begin{aligned} SSE = \sum _{i=1}^K\sum _{x\in {C_i} }{(c_i - x)}^2 \end{aligned}$$
(3)

Let \(C_i\) be the ith cluster, x is a point in \(C_i\) and \(c_i\) is the mean of the ith cluster. In order to find the best centroid which minimizes sum of squared error(SSE) to zero can be performed by the following differentiation for kth centroid \(c_k\). \(m_k\) is the number of objects in kth cluster

$$\begin{aligned} \begin{array}{rcll} \frac{\delta }{\delta c_k} SSE =\frac{\delta }{\delta c_k} \sum \limits _{i=1}^K\sum \limits _{x\in {C_i} } {(c_i - x)}^2 \\ \frac{\delta }{\delta c_k} SSE = \sum \limits _{i=1}^K\sum \limits _{x\in {C_i} }\frac{\delta }{\delta c_k} {(c_i - x)}^2 \\ \frac{\delta }{\delta c_k} SSE = \sum \limits _{x\in {c_k} }{2*(c_k - x_k)} \\ \end{array} \end{aligned}$$
(4)

Equating sum of squared error (SSE) to zero,

$$\begin{aligned} \frac{\delta }{\delta c_k} SSE =0 \end{aligned}$$
(5)

Now combining (4) and (5)

$$\begin{aligned} \sum _{x\in {c_k} }{2*(c_k - x_k)} = 0 \end{aligned}$$
(6)
$$\begin{aligned} m_kc_k = \sum _{x\in {c_k} } {x_k} \end{aligned}$$
(7)
$$\begin{aligned} c_k= \frac{1}{m_k}\sum _{x\in {c_k} } {x_k} \end{aligned}$$
(8)

Hence, it can be observed that the best centroid for minimizing the SSE of a cluster is mean of all the points in a cluster.

We have followed the same principle and tuned centre accordingly in order to minimize the misclassification accuracy. The training and prediction methodology of rule tuning for adaptive learning is presented in Algorithm 1.

figure a

4 Experimental Setup

In order to validate the proposed approach, the experiments were conducted on the benchmark datasets obtained from the UCI machine learning repository [12]. The ACEL algorithm has been applied on seven benchmark datasets. The datasets used are Iris, Thyroid, Balance Scale, Vertebral Column, Haberman’s Survival, Liver Disorder and Diabetes.

The result of the proposed approach has been compared with benchmark algorithms like J48, Adaboost, SMO, Naive Bayes and Random Forest. Table 1 gives a summary of datasets. The experiment was performed in MATLAB.

Table 1 Datasets used
Table 2 Results in terms of classification accuracy

5 Results and Discussion

We have trained our learner ACEL on seven biological datasets taken from UCI machine learning repository [12]. To test cluster-based ensemble learner, we compare the classification results by ACEL and by other standard algorithms J48, Adaboost, SMO, Naive Bayes and Random Forest. Table 2 gives the accuracy corresponding to each dataset.

For the Thyroid dataset, ACEL gives 93.46 accuracy, performing similar to other state-of-art algorithms J48, Adaboost and outperforming SMO by almost 5%. For the Iris dataset, ACEL outperformed other algorithms and performed similar to SMO with an accuracy of 96.67. For the dataset Liver disorder, ACEL performs better than Naive Bayes and SMO by at least 15%. Haberman’s Survival dataset using ACEL achieved 72.84 accuracy performing better than Random Forest by at least 8%. The algorithm gives an accuracy of 84.83 for the dataset Vertebral Column, which is better than Adaboost by almost 9% and SMO by more than 12%. In Balance Scale dataset, our proposed learner performs better than J48 and Adaboost algorithm.

The results obtained after performing ACEL algorithm suggest that applying clustering on the datasets and transforming the cluster to rules followed by adaptive tuning on these clusters optimizes the classification decision. The experiments performed on benchmark datasets show that ACEL works effectively in classifying datasets.

Fig. 2
figure 2

Confusion matrices

Figure 2 shows the confusion matrix when proposed ACEL algorithm is applied to each dataset. The actual and predicted labels give the number of instances that are correctly classified or misclassified. Confusion matrix can be used to describe the performance of a classification model. In confusion matrix, the count of true positive and true negative indicates how well a classification model works. In the figure, we can see that count of TP and TN is high for Iris, Thyroid, Balance Scale, Haberman’s Survival and Vertebral Column suggesting high performance of ACEL.

6 Conclusions and Future Work

We have presented a novel cluster-based ensemble learner (ACEL) based on the principle of cluster ensemble learning along with rule tuning, which gives better adaptability and improved accuracy. The proposed algorithm has been evaluated on biological experimental datasets. The results of the experiments have shown that our ensemble approach has given comparable results to individual learners. The evidences from the experimental results show that adaptive cluster ensemble learning process using tuning improves accuracy to a greater extent. In our future research, we would like to focus on finding the optimal number of clusters and other critical issues in ensemble classification like integration mechanism and robustness.