1 Introduction

It is clear that there has been an unexpected increase in the quantity and variety of data generated worldwide by computers, mobile phones, and sensors. Just as computer technology evolved, the quantity and variety of data have also increased, becoming more focused on storing every type of data, the so-called Big Data. As the volume of data to build a predictive model increases, the complexity of that training increases too. As a result, building actionable predictive modeling of a large-scale unstructured data set is a definitive Big Data problem. Predictive learning models try to discover patterns of training data and label new data instances to the correct output value. To efficiently handle unstructured large-scale big data sets, it is critical to develop new machine learning methods that combine several boosting and classification algorithms.

Extreme learning machine (ELM) was proposed by Huang et al. (2006b) based on generalized single-hidden layer feedforward networks (SLFNs). The main characteristics of ELM are small training time compared to traditional gradient-based learning methods, high generalization property of predicting unseen examples with multi-class labels and parameter free with randomly generated hidden nodes. ELM algorithm is used in many different areas including document classification (Zhao et al. 2011), bioinformatics (Wang and Wang 2006) multimedia recognition (Zong and Huang 2011; Lan et al. 2013).

In recent years, much computational intelligence research has been devoted to building predictive modeling of distributed and parallel frameworks. In this research, the proposed learning model creates data chunks with varying size and bag of classifier functions using ELM algorithm trained with these arbitrary chosen sub-data sets with AdaBoosting method for large-scale predictions. By creating data chunks from the training data set using the MapReduce paradigm, each subset of the training data set is used to find out the set of ELM ensembles as a single global classifier function.

The main objective of this work is to train large-scale data sets using ELM and AdaBoost. Another objective is to achieve the model’s classification performance with same or close to the conventional ELM method. Conventional ELM training cannot be applied to large-scale data sets on a single computer because of their complexity. Then experiments section is split into two subsections: “commonly used data sets” in Sect. 5.1.1 and “large-scale data sets” in Sect. 5.1.2. Commonly used data sets are suitable for training on a single computer with the conventional ELM algorithm. We trained these data sets both conventional and proposed methods to show the classification performance changes of the proposed method. Classification performance results are shown in Sect. 5.3.

The contributions of this paper are as follows:

  • A generative MapReduce technique-based AdaBoosted ELM classification model is proposed for learning and, thus, faster classification model training is achieved.

  • This research proposes a new learning method for AdaBoosted ELM that achieves parallelization both in the large-scale data sets and reduced computational time of learning algorithm.

  • Training computations of working nodes are independent of each other, thus minimizing the data communication. The other approaches, including Support Vector Machine training, need data communication for the support vector exchange (Lu et al. 2008; Sun and Fox 2012; Catak et al. 2013).

The rest of the paper is organized as follows: Sect. 2 briefly introduces some of the earlier works related to our problem. Section 3 describes algorithm ELM, AdaBoost and MapReduce technique. Sections 4 and 5 evaluate the proposed learning model. Section 6 concludes this paper.

2 Related work

In this section, we describe the general overview of the literature review. Section 2.1 describes the general distributed ELM methods. Section 2.2 shows the MapReduce-based ELM training methods.

2.1 Literature review overview

MapReduce-based learning algorithms from distributed data chunks have been studied by many researchers. Many different MapReduce-based learning solutions over arbitrary partitioned data have been proposed recently. Some popular MapReduce-based solutions to train machine learning algorithms in the literature include the following. Panda et al. proposed a learning tree model which is based on the series of distributed computations, and implements each one using the MapReduce model of distributed computation (Panda et al. 2009). Zhang et al. (2012) develop some algorithms using MapReduce to perform parallel data joins on large-scale data sets . Sun et al. (2009) use batch updating-based hierarchical clustering to reduce computational time and data communication. Their approach uses co-occurrence-based feature selection to remove noisy features and decrease the dimension of the feature vectors. He et al. proposed parallel density-based clustering algorithm (DBSCAN). They developed a partitioning strategy for large-scale non-indexed data with a 4-stages MapReduce paradigm (He et al. 2011). Zhao et al. (2009) proposed parallel k-means clustering based on MapReduce. Their approaches focus on implementing k-means with the read-only convergence heuristic in the MapReduce pattern.

Table 1 The differences between proposed model and literature review

2.2 MapReduce-based ELM training methods

Sections 2.2.1, 2.2.2, 2.2.3, 2.2.4 and 2.2.5 describe five different MapReduce training methods of ELM algorithm.

2.2.1 ELM\(\star \)

Xin et al. proposed MapReduce-based ELM training method called as ELM\(^*\) (Xin et al. 2014). The main idea behind this method is to calculate matrix multiplication of ELM to find weight vector. They show that Moore–Penrose generalized inverse operator is the most expensive computation part of the algorithm. As we know, matrix multiplication can be divided into smaller parts. Using this property, they proposed an efficient implementation of training phase to manage massive data sets. The final output of this method is a single classifier function. In this paper, they proposed two different versions of ELM\(^*\), naive and improved. In naive-ELM\(^*\), the algorithm has two classes, Class Mapper and Class Reducer. Both classes contain only one method. In improved ELM\(^*\), they decompose the calculation of matrix multiplication using MapReduce framework. Moreover, the proposed algorithm decreases the computation and communication cost. In the experimental platform, they used their synthetic data sets to evaluate the performance of the proposed algorithms with MapReduce framework.

2.2.2 OS-ELM-based classification in hierarchical P2P network

Sun et al. proposed OS-ELM (Liang et al. 2006)-based distributed ensemble classification in P2P networks (Sun et al. 2011). They apply the incremental learning principle of OS-ELM to hierarchical P2P network. They proposed two different versions of the ensemble classifier in hierarchical P2P, one-by-one ensemble classification and parallel ensemble classification. In one-by-one learning method, each peer, one by one, calculates the classifier with all the data. Therefore, this approach has a large network delay. In the parallel ensemble learning, all the classifiers are learnt from all the data in a parallel manner. Conversely to ELM\(^*\), their experimental results are based on three different real data sets downloaded from the UCI repository.

2.2.3 Parallel online sequential ELM: POS-ELM

Wang et al. (2013) have been proposed parallel online sequential extreme learning machine (POS-ELM) method. The main idea behind in this approach is to analyze the dependency relationships and the matrix calculations of OS-ELM (Liang et al. 2006). Their experimental results are based on nine different real data sets downloaded from the UCI repository.

2.2.4 Distributed and kernelized ELM: DK-ELM

Bi et al. (2013) have been proposed both distributed and kernelized ELM (DK-ELM) based on MapReduce. The difference between ELM and kernelized ELM is that K-ELM applies kernels opposite to create random feature mappings. They provide a distributed implementation RBF kernel matrix calculation in massive data learning applications. Their experimental results are based on four different real data sets downloaded from the UCI repository and four synthetic data sets.

2.2.5 ELM-MapReduce

Chen et al. (2013) have been proposed MapReduce-based ELM ensemble classifier called ELM-MapReduce, for large-scale land cover classification of remote sensing data. Their approach contains two sequential phases: parallel training of multiple ELM classifiers and voting mechanism. In parallel training phase of proposed method, each Map function computes an ELM classifier with a given training data set. In second phase called voting mechanism, a new MapReduce job is executed with a new partitioned test set into each Map function with notation \(data_j\). In Reduce function of this phase, each \(data_j\) is predicted with each ELM classifier trained in parallel training phase. Final classification predictions are the output of final Reduce function. Therefore, this approach has a high communication cost. Their experimental results are based on synthetic remote sensing image of the training data.

2.3 The differences between proposed model and literature review

The main differences are:

  • In ELM\(\star \), they use matrix multiplication decomposition. Each Map function is responsible to calculate the Moore–Penrose generalized inverse operation. And their method produces one single classifier. In the proposed model in our paper, each Reduce function produces ensemble classifier based on AdaBoost method. The final output ensemble classifier is a voting-based combination of ensemble classifier trained in each Reduce  phase.

  • In OS-ELM-based classification in hierarchical P2P Network, POS-ELM and DK-ELM, they propose ensemble classifier that combines multiple classifier trained with data chunks. Each peer classifier is learned from the local data. Therefore, each peer produces a single ELM classifier. In our method, each node (or peer) produces ensemble classifier to increase the classification accuracy.

  • In ELM-MapReduce, they propose ensemble classifier with two different MapReduce jobs. In first MpaReduce job, their approach produces a single ELM classifier in each Map function. In second MapReduce job, the test set is partitioned into each Map function and produces final predicted labels based on the voting mechanism of ELM classifiers that are trained in the first MapReduce job. In our method, prediction is not included; our aim is to create a final ensemble classifier in only one MapReduce job.

Table 1 shows the main differences of all proposed methods. There are five different columns that are ensemble methods, single pass MapReduce, matrix multiplication, entire data set and network communication. Ensemble column shows that the method builds a set of classifier function (i.e., ensemble model) to improve the accuracy performance of the final classification model. If an ensemble method is applied, then the performance of final model will have better accuracy result (Kuncheva and Whitaker 2003). Single pass MapReduce column shows that an iterative approach is not applied to the model. Entire learning phase is performed in a single pass of data through the job. Matrix multiplication column shows that the hidden layer matrix is calculated in each Map function. The hidden layer matrix computation is a compute intensive operation. Entire data set column shows that each Map operation needs entire data set to build a final classifier model. Network communication column shows that each MapReduce job needs to communicate with another job. Network communication will affect negatively on training time of the algorithm.

3 Preliminaries

In this section, we introduce preliminary knowledge of ELM, AdaBoost and MapReduce briefly.

3.1 Extreme learning machine

ELM was originally proposed for the single-hidden layer feedforward neural networks (Huang et al. 2006a, b). Then, ELM was extended to the generalized single-hidden layer feedforward networks where the hidden layer may not be neuron like (Huang and Chen 2005, 2006). The main advantages of the ELM classification algorithm are that ELM can be trained hundred times faster than traditional neural network or support vector machine algorithm since its input weights and hidden node biases are randomly created and output layer weights can be analytically calculated using a least-squares method (Tang et al. 2015; Huang et al. 2008). The most noticeable feature of ELM is that its hidden layer parameters are selected randomly.

Given a set of training data \(\mathcal {D}=\{(\mathbf {x}_i, y_i)\mid i=1,\ldots ,n\},\mathbf {x}_i \in \mathbb {R}^p,\, y_i \in \{1, 2,\ldots ,K\}\}\) sampled independently and identically distributed (i.i.d.) from some unknown distribution. The goal of a neural network is to learn a function \(f:\mathcal {X} \rightarrow \mathcal {Y}\) where \(\mathcal {X}\) is instance and \(\mathcal {Y}\) is the set of all possible labels. The output label of an single hidden-layer feedforward neural networks (SLFNs) with N hidden nodes can be described as:

$$\begin{aligned} f_N(\mathbf {x}) = \sum _{i=1}^{N}\beta _iG(\mathbf {a}_i,b_i,\mathbf {x}) , \, \mathbf {x} \in \mathbb {R}^n, \, \mathbf {a}_i \in \mathbb {R}^n \end{aligned}$$
(1)

where \(\mathbf {a}_i\) and \(b_i\) are the learning parameters of hidden nodes and \(\beta _i\) is the weight connecting the ith hidden node to the output node.

The output function of ELM for generalized SLFNs can be identified by

$$\begin{aligned} f_N(\mathbf {x}) = \sum _{i=1}^{N}\beta _iG(\mathbf {a}_i,b_i,\mathbf {x}) = \mathbf {\beta } \times h(\mathbf {x}) \end{aligned}$$
(2)

For the binary classification applications, the decision function of ELM becomes

$$\begin{aligned} f_N(\mathbf {x}) = \mathrm{sign}\left( \sum _{i=1}^{N}\beta _iG(\mathbf {a}_i,b_i,\mathbf {x}) \right) = \mathrm{sign}\left( \mathbf {\beta } \times h(\mathbf {x}) \right) \end{aligned}$$
(3)

Equation 2 can be written in another form as:

$$\begin{aligned} H\beta =T \end{aligned}$$
(4)

where H and T are, respectively, hidden layer matrix and output matrix. Hidden layer matrix can be described as:

$$\begin{aligned} H(\tilde{a},\tilde{b},\tilde{x})= \begin{bmatrix} G(a_1,b_1,x_1)&\cdots&G(a_L,b_L,x_1) \\ \vdots&\ddots&\vdots \\ G(a_1,b_1,x_N)&\cdots&G(a_L,b_L,x_N) \end{bmatrix}_{N \times L} \end{aligned}$$
(5)

where \(\tilde{a}=a_1,\ldots ,a_L\), \(\tilde{b}=b_1,\ldots ,b_L\), \(\tilde{x}=x_1,\ldots ,x_N\). Output matrix can be described as:

$$\begin{aligned} T= \begin{bmatrix} t_1 \ldots t_N \end{bmatrix}^T \end{aligned}$$
(6)

The hidden nodes of SLFNs can be randomly generated. They can be independent of the training data.

3.2 AdaBoost

The AdaBoost (Freund and Schapire 1995) is a supervised learning algorithm designed to solve classification problems (Freund et al. 1999). The algorithm takes as input a training set \((\mathbf {x}_1, y_1),\ldots ,(\mathbf {x}_n, y_n)\) where the input sample \(\mathbf {x}_i \in \mathbb {R}^p\), and the output value, \(y_i\), in a finite space \(y\in {1,\ldots , K}\). AdaBoost algorithm assumes, like ELM, a set of training data sampled independently and identically distributed (i.i.d.) from some unknown distribution \(\mathcal {X}\).

Given a space of feature vectors X and two possible class labels, \(y \in \{-1,+1\}\), AdaBoost goal is to learn a strong classifier \(H(\mathbf {x})\) as a weighted ensemble of weak classifiers \(h_t(\mathbf {x})\) predicting the label of any instance \(\mathbf {x} \in X\) (Landesa-Vzquez and Alba-Castro 2013).

$$\begin{aligned} H(\mathbf {x}) = \mathrm{sign}(f(\mathbf {x}))=\mathrm{sign}\left( \sum _{t=1}^{T}\alpha _t h_t(\mathbf {x}) \right) \end{aligned}$$
(7)

Pseudocode for AdaBoost is given in Alg. 1

figure a

3.3 MapReduce

MapReduce is a new programming model to run parallel applications for large-scale data sets processing to support data-intensive applications. It is derived from the map and reduce function combination from functional programming. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The MapReduce was originally developed by Google and built on principles in a parallel manner (Dean and Ghemawat 2008). The MapReduce framework first takes the input, divides it into smaller data chunks, and distributes them to worker nodes. MapReduce is divided into three major phases called map, reduce and a separated internal shuffle phase. The MapReduce framework automatically executes all those functions in a parallel manner over any number of processors/servers (Schatz 2009).

Pseudo-code of MapReduce framework is shown in Eq. 8.

(8)

Mapreduce programming technique is widely used on different scientific fields, i.e., cyber-security (Choi et al. 2014; Ogiela et al. 2014), high energy physics (Bhimji et al. 2014), and biology (Xu et al. 2014).

4 Proposed approach

In this section, we provide the details of the MapReduce-based distributed AdaBoosted ELM algorithm. The basic idea of AdaBoost-ELM based on MapReduce technique is introduced in Sect. 4.1. The MapReduce implementation of AdaBoosted ELM is described in Sect. 4.3.

4.1 Basic idea

Our main task is to parallel and distributed execute the computation of AdaBoosted ELM classification method. AdaBoosted ELM’s basic idea is to calculate ensemble of classifier functions over partitioned data \((X_m,Y_m)\) in a parallel manner. In Table 2, a summary of commonly used variables and notations to assess the classifier model performance of the AdaBoosted ELM method is given for convenience.

Table 2 Commonly used variables and notations

4.2 Analysis of the proposed algorithm

Barlett showed that the size of the weights is more important than the size of the neural network (Bartlett 1998). Kragh et al. also showed that ensemble methods of neural networks get better accuracy performance over unseen examples (Krogh and Vedelsby 1995). The main motivation of the this work is the idea that small size ELM ensembles can obtain more accurate classifier model that are comparable to individual classifiers.

In the proposed model, at every data chunk, there is a set of classifier functions that acts as a single classification model. The single model at every data chunk m is defined as follows:

$$\begin{aligned} f^{(m)}(\mathbf {x}) = \mathrm{{arg\,max}}_k \sum _{t=1}^{T}{\alpha _t h_t(\mathbf {x})} \end{aligned}$$
(9)

The selected ensemble ELM classifier models from the reduce phase of MapReduce algorithm are combined into one single classification model.

$$\begin{aligned} \hat{h}(\mathbf {x}) = \mathrm{{arg\,max}}_k \sum _{i=1}^{m}{f^{(m)}(\mathbf {x})} \end{aligned}$$
(10)

4.3 Implementation of the model

The pseudo-codes of MapReduce-based AdaBoost ELM are shown in Algorithm 2 and Algorithm 3. The Map procedure of our training model is implemented based on random assignment of each row of the training data set with split size of data, M, in line 2 of Algorithm 2. The input, \(\mathbf {x}\) , is a row of training data set \(\mathcal {D}\). Map procedure partition the input matrix by row, producing \(<randomSplitId,\mathbf {x}>\) key-value pairs. randomSplitId is the identifier of the data chunk and is transferred as the input key to Reduce phase. The pseudo-code of Reduce phase is shown in Algorithm 3. Reduce procedure is implemented based on the for-loop of lines 3 - 8 of Algorithm 3. The output ELM classifier of sub-data set \((\mathbf {X}_k,\mathbf {y}_k)\) is calculated using AdaBoost constantly block by block, so every reduce task completes training phase and outputs an AdaBoosted set of classifier functions. The mapper’s input k is the randomSplitId to create the data chunk and created in the Map phase of our training model.

figure b
figure c

5 Experiments

In this section, we perform experiments on real-world data sets from the public available data set repositories. Public data sets are used to evaluate the proposed learning method. Then, classification models of each data set are compared for accuracy results with the single instance of learning algorithm performance.

In Sect. 5.1, we explain the data sets and parameters that are used in experiments. The conventional ELM is applied all data sets and we find the accuracy performance over number of hidden nodes in Sect. 5.3. In Sect. 5.2, we show the empirical results of proposed distributed Adaboost ELM training algorithm.

5.1 Experimental setup

In this section, we apply our approach to five different data sets to verify its effectivity and efficiency. To demonstrate the effectiveness and performance of the proposed model, we apply it on various classification data sets from public data set repositories. To obtain an optimal value of Mapper size, m, we range it in the range from 20 to 100.

5.1.1 Commonly used classification data sets

We experiment on five public data sets which are summarized in Table 3, including Pendigit, Letter, Statlog, Page-blocks and Waveform. They are all multiclass data sets. All experiments are repeated 5 times and the results are averaged. All data sets are publicly available in svmlight format on the LIBSVM website (LIBSVM 2015).

Pendigit data set is a collection of pen-based recognition of handwritten digits (Alimoglu and Alpaydin 1996). The data set contains 250 samples from 44 people. The first 7494 instances written by 30 people are used for the training data set, and the digits written by other 14 people are used for the independent testing purpose.

Skin data set is a collection of skin segmentation constructed over R, G, B color space (Bhatt et al. 2009). The data set contains face images of different age groups (young, middle, old), genders and racial groups (White, Black, Asian). The data set contains 245,057 instances; out of which 50,859 is the skin labeled instances and 194,198 is non-skin instances.

Statlog/shuttle data set is a collection of space shuttle created by NASA (Hsu and Lin 2002). The data set contains 43,500 training instances and 14.500 testing instances. 80 % of the data belong to class 1.

Page blocks data set is a collection of page layout of a document that has been detected by a segmentation process (Malerba et al. 1996). The data set contains 4500 training instances and 973 testing instances.

Waveform data set is a collection of Breiman’s waveform domains of CART book’s (Breiman et al. 1984). The data set contains 4400 training instances and 600 testing instances.

Table 3 Description of the testing data sets used in the experiments

5.1.2 Large-scale classification data sets

We experiment on three public large-scale data sets which are summarized in Table 4, including “Record Linkage Comparison Patterns (Donation)”, “SUSY” and “HIGGS”. All experiments are repeated 5 times and the results are averaged.

Donation represent individual data, including first and family name, sex, date of birth and postal code, which were collected through iterative insertions in the course of several years. The comparison patterns in this data set are based on a sample of 100.000 records dating from 2005 to 2008 (Schmidtmann et al. 2009). The data set contains 5,749,132 training instances and 1,000,000 testing instances. The data set is available on UCI website (UCI 2011).

SUSY is a classification data set that distinguishes between a signal process which produces supersymmetric particles and a background process which does not (Baldi et al. 2014). The first 8 features are kinematic properties measured by the particle detectors in the accelerator. The last ten features are functions of the first 8 features. The data set contains 5,000,000 training instances and 50,000 testing instances. The data set is available on UCI website (UCI 0000).

HIGSS is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not (Baldi et al. 2014). The first 21 features (columns 2–22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features. The data set contains 11,000,000 training instances and 500,000 testing instances. The data set is available on UCI website (UCI 2014).

Table 4 Description of the testing large-scale data sets used in the experiments
Fig. 1
figure 1

Number of hidden nodes in ELM versus classifier precision. a Statlog data set, b Skin data set, c Pen digit data set, d Waveform data set, e Page blocks data set

5.2 Evaluation

Since the data sets that are used in our experiments are highly imbalanced, traditional accuracy-based performance evaluation is not enough to find out an optimal classifier. We used four different metrics, the overall prediction accuracy, average recall, average precision (Turpin and Scholer 2006) and F score, to evaluate the classification accuracy, which are common measurement metrics in information retrieval (Manning et al. 2008; Makhoul et al. 1999).

Precision is defined as the fraction of retrieved samples that are relevant. Precision is shown in Eq. 11.

$$\begin{aligned} \mathrm{Precision} = \frac{\mathrm{Correct}}{\mathrm{Correct} + \mathrm{False}} \end{aligned}$$
(11)

Recall is defined as the fraction of relevant samples that is retrieved. Recall is shown in Eq. 12.

$$\begin{aligned} \mathrm{Precision} = \frac{\mathrm{Correct}}{\mathrm{Correct} + \mathrm{Missed}} \end{aligned}$$
(12)

The proposed evaluation model calculates the precision and recall for each class from prediction scores and then finds their mean. Average precision and recall are shown in Eqs. 13 and 14.

$$\begin{aligned} \mathrm{Precision}_{\mathrm{avg}}= & {} \frac{1}{n_{\mathrm{classes}}}\sum _{i=0}^{n_{\mathrm{classes}}-1}{\mathrm{Prec}_i}\end{aligned}$$
(13)
$$\begin{aligned} \mathrm{Recall}_{\mathrm{avg}}= & {} \frac{1}{n_{\mathrm{classes}}}\sum _{i=0}^{n_{\mathrm{classes}}-1}{\mathrm{Recall}_i} \end{aligned}$$
(14)

F-measure is defined as the harmonic mean of precision and recall. The

$$\begin{aligned} F_1 = 2 \times \frac{\mathrm{Prec}_{\mathrm{avg}} \times \mathrm{Recall}_{\mathrm{avg}}}{\mathrm{Prec}_{\mathrm{avg}} + \mathrm{Recall}_{\mathrm{avg}}} \end{aligned}$$
(15)

5.3 Data set results with conventional ELM

Figure 1 shows that the accuracy performance of ELM for experimental data sets becomes steady state after a threshold value of N. The testing classification performance is measured through accuracy, precision, recall and \(F_1\) measure. N varies from 150 to 500.

Table 5 shows the best performance of the conventional ELM method of each data set.

The conventional ELM training algorithm can be applied only in Sect. 5.1.1. The large-scale data sets in Sect. 5.1.2 are not feasible to train on a single computer.

5.4 Testing accuracy analysis

Because two different data set types (“commonly used”, “large scale”) are used, the results are divided into two different sections. In Sect. 5.4.1, the figures and the plots show the implementation results of commonly used classification data sets. Section 5.4.2 shows the large-scale data sets results.

5.4.1 Commonly used classification data sets

The results of accuracy and performance tests with real data are shown in Table 6 and Figs. 2, 3, 4, 5 and 6. According to the these results, AdaBoost T size and Mapper size have more impact on the accuracy of ensemble ELM classifier than number of hidden nodes in ELM network.

Table 5 Data set results with conventional ELM
Table 6 Best performance results of data sets
Fig. 2
figure 2

Statlog data set heatmap. a Split size and adaboost T size, b Split size and number of nh, c Adaboost T size and number of nh

Fig. 3
figure 3

Pendigit data set heatmap. a Split size and adaboost T size, b Split size and number of nh, c Adaboost T size and number of nh

The accuracy of classification models is visualized by heatmap color coding according to

  • Map size (M)–AdaBoost size (T)

  • Map size (M)–Number of hidden nodes (nh)

  • AdaBoost size (T)–Number of hidden nodes (nh)

Figure 2, 3, 4, 5 and 6 are used to plot the quantitative differences in accuracy score of each data set. Heatmaps are two-dimensional graphical representations of data with a pre-defined colormap to display the values of a matrix (Khomtchouk et al. 2014). Heatmaps can be used to understand what parameters affect the accuracy of the classification model. The figures are used to comparatively illustrate accuracy levels across a number of different parameters including Map size, AdaBoost size and the number of hidden nodes in ELM algorithm obtained from the proposed learning method.

Fig. 4
figure 4

Skin data set heatmap. a Split size and adaboost T size, b Split size and number of nh, c Adaboost T size and number of nh

Fig. 5
figure 5

Page blocks data set heatmap. a Split size and adaboost T size, b Split size and number of nh, c Adaboost T size and number of nh

Fig. 6
figure 6

Waveform data set heatmap. a Split size and adaboost T size, b Split size and number of nh, c Adaboost T size and number of nh

According to Table 7, classification performance results of the proposed method have almost the same values with the conventional ELM method.

5.4.2 Large-scale classification data sets

Figure 7 shows the speed up on mapper size over proposed method on large-scale data sets. To asses the effectiveness of the learning algorithm, the time is measured with varying mapper size. Because of high dimensionality, the data sets cannot be trained on a single computer. Then, the standard speed up percentage is modified, such that

$$\begin{aligned} S_p = \frac{t_{\mathrm{{arg\,min}} {m} \in M }}{t_p} \end{aligned}$$
(16)

where \(t_{\mathrm{{arg\,min}} {m} \in M }\) is the total time on minimum mapper that can be achieved to build a classifier model.

As can be seen from the figure, the data sets achieve performance improvement in learning time of the algorithm. By examining the trends observed as the number of mappers increases, one can see that non-linear speed up is achieved.

Table 7 Performance comparison of ELM and proposed model
Fig. 7
figure 7

Stability analysis of ensemble ELM classifiers with Mapper size

Fig. 8
figure 8

Stability analysis. a Stability analysis of ensemble ELM classifiers with Mapper size, b Stability analysis of ensemble ELM classifiers with AdaBoost T size

5.5 Stability analysis

Standard deviation of testing accuracy of the method is shown in Fig. 8a, b. We analyzed the stability of ensemble ELM classifier with two aspects, Mapper size and AdaBoost T size. Mapper size is the most important variable for the model stability according to the Fig. 8a. From Fig. 8a, b, we can find that standard deviation of testing accuracy decreases enormously with the increasing of Mapper function size. Through this analysis, one can argue that a model with high Mapper function size has higher stability than low Mapper function size.

6 Conclusion and future works

In this paper, a parallel AdaBoost extreme learning machine algorithm implementation has been proposed for massive data learning. By creating the overall data set into data chunks, MapReduce-based learning algorithm reduces the training time of ELM classification. To overcome the accuracy performance decreasing, distributed ELM is enhanced with AdaBoost method. The experimental results show that AdaBoosted ELM reduces not only the training time of large-scale data sets, but also evaluation metrics of the accuracy performance as compared with the conventional ELM.

The proposed AdaBoost-based ELM has three different trade-off parameters which are (1) data chunk split size, M, (2) maximum number of iterations, T, in AdaBoost Algorithm and lastly (3) number of hidden layer nodes nh in ELM algorithm. The empirical results in heatmap figures show that parameters M and T are more dominant than parameter nh for the classification accuracy of the hypothesis.

The algorithm is designed to deal with large-scale data set ELM training problems. Another objective is to achieve the model’s classification performance with same or close to the conventional ELM method. Classification performance results are shown in Sect. 5.3. The empirical results show us that classification performance results of the proposed method have almost the same values with the conventional ELM method.