1 Introduction

The volume of the data is increasing exponentially with the wide acceptance of IoT and as a result, the decision-making entities are modifying their strategies to accommodate the massive growth data for their business needs. The volume of data arriving at every instance of time is to be processed at the arrival itself for structured storage and future use. Handling the fast occurrence of massive data is the present challenge faced by the research and industry community dealing with such kinds of online applications. The future computational algorithms will focus more on the fast processing of data, and the near real-time solutions for very fast data streams are becoming the research target of many groups. The potential of such fast algorithms will range from the security systems used in army operations to the business plans in a consumer shop.

Online data, the live data, processing have a larger potential in smart application those are going to lead us in the coming decade. Autonomous cars [27] are one of the interesting areas where the hardware, sensors and algorithms work together efficiently and faster than the human. The online processing in the earlier computational era is carried out through the batch processing of data, where the huge data volume will be divided into smaller compared to the initial. Those batches will be fed into the computational system for processing on a sequential basis, where the batches will be processed one by one in a pre-determined order. The size of batches and execution time of the process will decide the future of such an application, and such kind of architecture is not preferred in the later stages as well. The advancements in the computing infrastructures and algorithms helped the researchers to use the online data analysis in many applications areas that has published recently, like remote sensing [22, 26], hostile activity analysis [23], cyber forensic [24, 25]. The advancements in machine learning improved the capability of artificial intelligence applications to a great extent. The biologically-inspired strategies, especially the Artificial Neural Network are used in many computer vision and allied areas. The arrival of deep architecture in learning models has provided a drastic shift in the computational aspects, especially in computer vision. The industry and researchers were deeply impressed by the capability of the model to extract high-level and quality features from the input data using multiple layers. Most of the transformations of data that happened between these layers were complex and required more computational cost in terms of time and infrastructure. Hence the deep architecture is rarely used in applications where these constraints are strictly restricted.

The real-time processing was expensive and complex in its earlier stages, but later the researchers in this area made it lighter and simple. Extreme Learning Machine (ELM) is one of the simplest machine learning model that consists of only one hidden layer, hence the computational complexity is that much reduced. The generalization capability of ELM is theoretically proven and the dimension of output weight in ELM is proportional to the number of neurons in the hidden layer. ELM working with batches of input data and hence there is a single pass computation in the model. The training parameters in the ELM are taken randomly and training output is derived from the connections between the hidden layer and output layer. The performance of ELM is competing and the time taken for the execution is remarkable, hence it is exercised in many machine learning applications particularly in near real-time environments. A recent work came for real-time COVID19 diagnosis [34] using ELM is an example for the same.

The resources utilization in terms of memory and training time of most new-generation algorithms are high, and the real-time or online data processing strategies should have emphasized those aspects. Generally, ELM takes a bundle of data in a single pass for training, and it can be modified for a sequential model where a fixed batch size of data will be fed into the model for processing. In this case, the model must wait for the next batch to be filled, creating a lag in the execution. This work intends to tap the theoretical foundation of ELM and design a model that can effectively manage input batches of any size.

The proposed model is a distributed version of batch processing where the data is divided into mini-batches and given for processing in a different ELM process. Any number of batches can be processed through this architecture at a time within a defined time of execution. Instead of waiting for the slot, a particular batch can directly give for processing on its arrival itself. The output of the parallel application is fixed according to the number of the hidden nodes so that a batch with any size can be given to a process and the integration of the results is a comparatively simple task. The knowledge updating mechanism used in our method helped to fine-tune the knowledge of ELM hence it could improve the results on every arrival of the batches. Datasets from different public sources were used in the experiments of this work. The datasets consist of sparse datasets and other formats with categorical and numerical data. The datasets are also of different dimensions, in their feature sets and number of instances, to check the constancy of the results with various combinations. The performance of the proposed parallel ELM is compared with benchmarked methods in the scikit-learn [12] implementation and stream classifiers in scikit-multiflow [18] framework. Various evaluation metrics like accuracy, f1 score and precision are used to validate the reliability of the proposal. The training time of the proposed method is also compared with all the methods used in the study. Performance of the method with some newly arrived works in stream data classification methods [39,40,41,42,43] is also compiled at the end of the experiment section. It is observed in the experiments that our method outperformed the other methods in terms of the performance metrics. The training time of the proposal is remarkable and shows the relevance of our architecture in the live data classification.

The rest of the paper is organized as follows: Section 2 gives an abstract view of the related works and the other methods used for the study. Section 3 is dedicated to explaining the proposed method and the theoretical model behind the proposal. The experimental setup, results and analysis are included in Section 4. The conclusion of the work is given in Section 5.

2 Related works

The machine learning methods are extensively used in many real-time applications [14, 20, 35, 36, 38] and their improvements helped those applications to become smarter. The general acceptance of Artificial Neural Networks gives confidence to the researchers to do more experiments with ANN and to derive new algorithms. There are a lot of works in the deep architecture of ANN which is found more efficient but the same is computationally complex and time-consuming. The ELM is a single-layer feed forward network derived by Huang et al. [1]. The Moore-Penrose Generalized Inverse and Minimum Norm Least Squares Solution [1] of Learner Systems are the two basic principles behind the working of ELM. The ELM became reputed among the community because of its reliable performance through the Universal approximation [8, 29] principle. Instead of instance-by-instance processing, the ELM process a batch in a single step, and hence the model has to be rerun while appending the datasets. Huang et al. [30] derived another extension of ELM called incremental ELM, where the optimum number of hidden nodes in ELM has derived analytically [31]. Online Sequential ELM (OS-ELM) [32] is emerged to address the drawback of ELM while processing the batch data. A dual objective method is introduced by them to deal with the data that come as batches in a periodic interval. It is noted in their results that the method is performed better than the other algorithms in the domain. This work is recently updated by Hualong Yu et al. [33] by modifying it to accommodate issues of increasing classes in the upcoming batches. The work of Mirza and Lin [28], named weighted online sequential extreme learning machine, is found suitable for imbalanced class problems.

In the state of art methods, there are several proven strategies in the classification domain which are generally used in online applications. The Stochastic Gradient Descent (SGD) [4] is one among them, in which the internal learning parameters are updated in every iteration from the random mini-batches of the actual dataset. In Radial Basis Function kernel SVM [5], the classical SVM will become powerful with the help of the Radial Basis Kernel that uses the exponent polynomial for the discrimination of classes. Gaussian Process Classifier [6] is a probabilistic classifier that generalizes the Gaussian probability distribution. Random Forest Classifier [7] is an ensemble of uncorrelated decision trees and the net performance of the classifier will be always better. Multi-layer Perceptron is the classical model in learning algorithms and the basis of the new generation classifiers. AdaBoost classifier [9] is an ensemble of a couple of classifiers and it will boost performance through an iterative ensemble method. Gaussian Naive Bayes Classifier [10] is based on the Naive Bayes methods where much importance is given to the conditional independence of the feature sets. Quadratic Discriminant Analysis [11] is used to classify the nonlinear distributions which is an extension of Linear Discriminant Analysis.

The Dynamic Weighted Majority Classifier [13] can be used as a stream classifier which is an ensemble of a couple of learning methods and the classification happens according to the weighted-majority vote mechanism. Online Boosting classifier [15] is a variant of ensemble model in online machine learning methods, where the capability of weak learning algorithms is increased in batch processing. Oza Bagging classifier [16, 17] is a stream classifier which is an extended version of ensemble-based boosting algorithms. In this algorithm, an additional parameter is used to denote the importance of each batch in the online process and it will be updated over time.

3 Proposed method

ELM is a single pass SLFN network that converts the input data into ELM feature space. As explained in the proceeding part of this section, the theory and working of ELM shows that it is computationally less expensive than the other learning models. Figure 1 represents the general architecture of ELM where function f represents the relation 7 and \( f^{\prime } \) is carried out as per relation 5.

Fig. 1
figure 1

General Architecture of ELM

There are four main components in the proposed method as shown in Fig. 2– Parallel ELM, Weight Synthesiser, Knowledge Base and Evaluator. The data, either from a live source or available data store, for the training is partitioned into batches and passes to the parallel ELM block. The ELM training modules are tiny in terms of execution complexity and coding, which takes the data chunk and produces the output weight βi. The outputs of the parallelly executing ELMs are processed in the weight synthesizer and produce the actual output of the run. The corresponding weights are stored, updated, in the knowledge base for the incoming datasets. The evaluator module will evaluate the testing dataset with the help of output weight. The algorithm of the strategy is given in Algorithm 1.

Fig. 2
figure 2

The Architectural Diagram of proposed method

The ELM is formulated on top of two competes [1] - Moore-Penrose Generalized Inverse and Minimum Norm Least Squares Solution of Learner Systems. Any system that can be modelled with a linear relation Ax = y can be effectively solved with the help of Moore-Penrose Generalized Inverse. The matrix \(A^{\dagger }_{(q \times p)}\) will be the Moore-Penrose Generalized Inverse of the matrix A(p×q) under the following situations.

$$ \begin{array}{@{}rcl@{}} AA^{\dagger} A&=&A,A^{\dagger} AA^{\dagger}=A^{\dagger},(AA^{\dagger} )^{T}\\&=&AA^{\dagger},(A^{\dagger} A)^{T}=A^{\dagger} A \end{array} $$
(1)

\(\hat {x}\) will be a the Least Mean Squares Solution of Ax = y when

$$ \|A\hat{x}-y\|=\min_{x}\|Ax-y\| $$
(2)

in Euclidean norm ∥.∥. In another way, A will be a minimum norm least squares solution of Ax = y. A simple learning system with N samples - (xi, ti),xi = [xi1, xi2...xin]T and ti = [ti1, ti2...tim] can be modelled as

$$ \sum\limits_{i=1}^{M} \beta_{i}g(w_{i}.x_{j}+b_{i})=y_{j}, j=1,2,...N $$
(3)

where M hidden neurons,wi the initial weight β is the weight connecting to the output layer, bi is the threshold applied to the corresponding neuron, the g(x) will be the activation that produces the prediction y by satisfying \( {\sum }_{j=a}^{N}\|y_{i}-t_{i}\| = 0 \) and the relation 3 will become

$$ \sum\limits_{i=1}^{M} \beta_{i}g(w_{i}.x_{j}+b_{i})=t_{j}, j=1,2,...N $$
(4)

And it can be represented as in the form of leaner system as

$$ H\beta=T $$
(5)

As per the minimum norm least squares solution, the output weight β of a single feed forward learning network can be derived as

$$ \beta=H^{\dagger}T $$
(6)

where H is the Hidden Layer vectors of the input matrix, and it is represented as

$$ H=\left[ \begin{array}{ccc} g(w_{1}.x_{1}+b_{1}) & {\cdots} & g(w_{N}.x_{1}+b_{N}) \\ {\vdots} & {\ddots} & {\vdots} \\ g(w_{1}.x_{N}+b_{1}) & {\cdots} & g(w_{N}.x_{N}+b_{N}) \end{array} \right] $$
(7)

and

$$ \beta= \left[ \begin{array}{c} {\beta_{1}^{T}}\\ {\vdots} \\ {\beta_{N}^{T}} \end{array} \right] and T= \left[ \begin{array}{c} {T_{1}^{T}}\\ {\vdots} \\ {T_{N}^{T}} \end{array} \right] $$
(8)

The relation (6) is used to find the output weight of ELM in the Parallel ELM block of the proposed architecture. The initial value of weight and bias strongly affects the performance of the ELM. In the general implementation of ELM the \( \left [ \begin {array}{c} w\\ b \end {array} \right ] \) is assigned randomly. In our method, the primary values of these parameters are derived from the Singular Value Decomposition of the data sample.

$$ SVD(d_{i})=USV^{T} $$
(9)

where V is an orthogonal matrix, S is a diagonal matrix in which the first r diagonal values are non-zero values and U is a semi-unitary matrix. By the low rank approximation, the first r values of the matrices are enough to represent the initial matrix as

$$ d_{i} \Leftarrow U_{r}S_{r}{V_{r}^{T}} $$
(10)

where Ur and Vr consists of the first r column of the matrices in (8) and Sr be a r × r square matrix. Hence the initial values of the weight and bias are can be deduced

$$ \left[ \begin{array}{c} w\\ b \end{array} \right]=V_{r} $$
(11)

Because of the inherent representation capability of the matrix Vr, the ELM could perform better than the traditional method.

In Algorithm 1, the steps with a line number 2 are executed by the parallel processes. There are four major steps in it, 2.a. to 2.d., in which the SVD initialization of internal parameters is the first phase. The Hidden Layer matrix and output weights are calculated for each set of batches as in the case of the first phase. The output weight of the particular batch is also stored by the concerned parallel process.

figure a

Remembrance of the past knowledge for future improvement is one of the key ideas of major learning models. In the case of ELM, such kind of feedback is technically not possible in its learning activity. We have introduced the additional structure Knowledge Base (KB) for attempting this drawback of ELM. The KB is a fixed-length structure that consists of L weight vectors. The weight vectors of the individual ELM in the parallel processes will be stored in the KB according to the eligibility criteria. The weight that has been derived at each time of evaluation will be fed back to the KB to strengthen the performance of the methodology. The output weight that is indifferent from the existing collection is discarded through mechanisms such as distance measure from the mean of the entire collection and the degree of reliability that the testing data set gives. Hence the vectors that can contribute significantly to processing the incoming data stream are stored in KB. A mean model-based centrality measure is used to find the output weight of the model at an instance of evaluation. Being a simple measure, the weight synthesizing process will not create much computational burden to the model without sacrificing the performance.

The theoretical model reveals that the ELM is computationally efficient for the processing of data on the fly. There is no need to fine-tune the internal parameters, and the major computation is happening with a few steps. By introducing the parallel algorithm architecture, the proposed method can handle any number of batches at a time. Hence the proposed method can further improve the computational cost, comparing with sequential processing. The SVD initialization of the parameters, Hidden Layer matrix and output weight calculations are the steps in the method that can consume computational power but it is negligible in comparison with the new generation learning models.

4 Experiment and results

We have conducted a different set of experiments to evaluate the performance of the proposed strategy. The experiments are categorized into three - i) detecting the influence of batch size in the overall performance, ii) performance comparison with a bunch of benchmarked strategies in classification, iii) a comparative study with the performance of some methods in the domain. The F1 score, Accuracy and Precision are used to quantify the efficiency of the method, while milliseconds are used to measure the execution time of the learner algorithm. A desktop PC with an Intel i5 processor with 6 cores, 2.90GHz speed and an internal memory of 32 GB is used for both coding and execution of the work. We have conducted experiments with hidden nodes of \(\min \limits \{Batch Size, Feature Dimension\}\). The model is designed to incorporate any number of parallel processes at a time. In the experiments, we have used two or three parallel processes at a time.

We have used a basket of publically available real-world datasets in these experiments and the details are abstracted in Table 1.

Table 1 Datasets used in the experiments

The datasets are selected from different public depositories - the Covtype, Electricity, Airlines and Poker are taken from Massive Online Analysis(MOA) [2]]; Adult, Nomao, RCV1 and Shuttle are from UCI machine learning repository [3]; Weather and Real-sim are downloaded from OpenML [19]; The SEAa and AGRa [21] are synthetic datasets. Two sparse datasets - RCV1 and Real-sim are also used in the preliminary experiments. We have selected comparatively larger datasets, ranges from 34465 to Ten Lakhs instances, to simulate the online arrival of data by dividing it into batches. The number of feature sets was also one of the criteria for selecting the data sets, and we have a very good range of features with a minimum of 3 and a maximum of 47236. Most of the datasets are binary classes and some of them have more than 10 classes as well. The attributes of the two datasets have both categorical and numerical data and we have converted those categorical data to numerical as a preprocessing phase. As seen in Table 1, we have chosen the datasets such a way to give participation to the maximum parameters that can directly affect the performance of the classification task.

4.1 Batch size and performance of proposed method

The number of samples used in the training activity will generally affect the quality of the performance of the classifiers. The objective of this set of experiments is to verify the same in our proposed method. All the datasets listed in Table 1 are used for the experiments. We have created a stream of batches in different sizes, ranges from 200 to 2000, for the experiment. The datasets are divided into an 80-20 ratio for training and testing purposes.

The classification accuracies of the experiments with 12 datasets are shown in Fig. 3. The training part of the datasets is partitioned into batches of 200, 300, 400, 500, 1000 and 2000 sizes, and conducted the experiments separately for each set. The testing dataset will be fed into the model in a single stretch, and there were samples with more than lakhs of instances in some datasets. The accuracies obtained in all the experiments for the 200 samples were the lowest among all. It is clear in the results that the accuracy was increasing along with the size of the samples. The Airlines dataset got the lowest accuracy of 0.65 among all the datasets. Three datasets- Electricity, KDD99 and Shuttle – were top in the performance and the Electricity marked the best accuracy of 0.998. The other datasets performed in the range of 75% to 95% accuracy.

Fig. 3
figure 3

Performance of proposed method in different datasets and batches

It is also observed in Fig. 4 that the performance of the model becoming consistent after the batch size of 500. Only two datasets- SEAa and Weather show a simple negative trend after the 1000 batch size. While studying the spread and deviation of the accuracy within every batch size, we have noted that the deviation of the results was very minute and ranges from 0.0251 to 0.00021. The interesting thing is that the standard deviation of the results in the batch sizes 500, 1000 and 2000 were again reducing significantly and coming to the range 6.4x10-3 to 0.18x10-3. For these batches, the negligible inconstancy of the results shown the earlier batches have further reduced more than 50% in seven datasets in the collection. All these statistics show that the method is consistently performing with almost all datasets and batches taken for the study, especially from 500 to 2000.

Fig. 4
figure 4

Standard Deviation of the results for the batch size 500 to 2000

The proposed method is also compared with a couple of ELM variants-UFROS-ELM [44], EOS-ELM [45] and IDS-ELM [46]. The results obtained are pictured in Fig. 5 – (a) shows the accuracy obtained with these methods against the datasets Electricity, KDD99, and Shuttle. Figure 5(b) gives the training time of these methods with another set of data.

Fig. 5
figure 5

Performance comparisons of the proposed method with ELM-Variants. (a) Accuracy of the methods (b) Training time of the methods

It is noted in Fig. 5(a) that our method performed well with the other variants of ELM; there is a slight upper hand only for the others in some experiments. These competitive results show the importance of our method, where the training process happens with mini-batches. The knowledge base that has been updated in every stage, with new mini-batches, has a significant role in the results produced. In other ELM variants, the entire training data is taken in a single stretch, but they could not capture a substantial upper hand in the accuracy.

It is evident in Fig. 5(b) that the time taken by the proposed method is significantly low compared with the other ELM variants used for the study. Since the proposed method used mini-batches for the training, the training time recorded the lowest in the group, an essential criterion for the stream data analysis.

4.2 Performance comparison with benchmarked methods

The experiments conducted in this section are to compare the performance of our method with a bunch of bench-marked algorithms in classification. We have chosen the Stochastic Gradient Descent (SGD) [4], Radial Basis Function kernel SVM (RBF SVM)[5], Gaussian Process Classifier (GPC) [6], Random Forest Classifier (RFC)[7], Multi-layer Perceptron (MLP), AdaBoost classifier (ABC)[9], Gaussian Naive Bayes Classifier (GNB)[10], Quadratic Discriminant Analysis(QDA)[11], and ELM for the study. The scikit-learn [12] implementation of these classifiers is used though out these experiments. Some of the methods are not directly supporting the sparse datasets and hence the RCV1 and Real-sim are not analysed anywhere in this section. The presentation of the results in this section is done in three stages – analyse the effect of mini-batch size in the performance of the classifiers, comparative analysis of the accuracy and F1 score of the methods and the comparative analysis of training time of the classifiers with the proposed method.

As in the case of previous experiments, we have conducted a series of experiments with different mini-batch sizes ranging from 200 to 2000. The first set of experiments in this section is to analyse the influence of the batch size in different classifiers. It is noted that different methods recorded their maximum in various batch sizes.

It is noted that all the batches got represented in Fig. 6. There were ten datasets and ten classifiers; hence a total of 100 combinations were plotted in the scatter plot Fig. 6. It is noted that some dataset-method combinations for maximum results in the lesser batch sizes. The frequency of top winning combinations in the 200 to 500 batch size is comparatively lower than 1000 and 2000. Even though the mini-batch 2000 has received a higher density of winning hits, it is noted that the best results will not be deeply affected by the batch sizes, and it shows the importance of parallelized strategy using mini-batches.

Fig. 6
figure 6

The size of the top wining batches of the classifiers with various dataset-method combinations

The average ranking of the methods is also given in Table 2. It is noted that Covtype and Electricity got the lowest and highest accuracies respectively in the entire experiments. In the other methods, the GPC performed well and they are top in two datasets – Electricity and Nomao. NBC was the poor performer in the list and they ranked the least in the four data sets- Covtype, Electricity, Shuttle and Weather. Our method received the best ranking compared with the other nine methods. We could achieve the top position in six datasets and the second position in the other three. The highest accuracy received for our method was with Electricity even though we are in the second position. The lowest accuracy of 65.2 is marked with Airlines, but we could win the top position in that experiment. Our proposed method received an accuracy above 90% with four datasets and only three got below 70%. It is also observed that the performance of our method with the winning method in the concerned experiment is only below 1%. It is happy to view that our method performed better than the initial version of ELM. All these facts show that our method performed better compared with all other benchmarked methods in most of the datasets. The knowledgebase updated in every process of validation helped our model to fine-tune the knowledge for better classification.

Table 2 Comparison of Accuracy with benchmarked methods

It is in Fig. 7 noted that most of the datasets except Covtype, AGRa and AGRa, scored more than 80% precision in at least one of the experiments. Three datasets- Electricity, KDD99 and Shuttle – got more than 99% precision and Electricity got the highest score of 99.88. Our parallel method scored the best precision in four datasets and reached the second potion with another two. The RFC received the highest precisions for Electricity and Nomao datasets, and hence they could reach the second position. It is observed that most of the methods performed well with the datasets Electricity, KDD99 and Shuttle. As seen in the previous set of experiments the proposed methods outperformed in the precision score as well. The consistent and better results of some datasets with most of the methods show the rich features or patterns underlined in those datasets.

Fig. 7
figure 7

Precision obtained by the methods for different datasets

It is noted in Fig. 8 that the training time is increasing along with the mini-batch size. There are two distinct groups of graphs in the figure where the training time of the sparse dataset, Real-sim and RCV1, are entirely different from others. The minimum training time for these datasets was 15.01 and 26.12 respectively which is much above the other groups. The other nine datasets in the list have taken only a minimal time for the training that comes in the range from 0.284 to 14.6 milliseconds. While considering the average training time of all the datasets, the Weather performed the best and the RCV1 reached the worst. Considering the earlier group, the variation of time across the batch size dimension is very less in this group. It is cleared in the experiment that both the batch size and feature size influence the training time of our proposed method. The consistent performance of our method in a range of 1.09 to 6.12 milliseconds once again proves the significance of our strategy.

Fig. 8
figure 8

The time taken (milliseconds) for training for different datasets and batch sizes

It is cleared in Table 3 that seven methods out of ten need much time for the training in all the datasets where SGD is the only one exceptional case other than ELM and the proposed method. The basic theoretical background of ELM and proposed methods are the same and our method is doing some additional computations as explained in Algorithm 1. Hence the proposed strategy got a slightly higher learning time than the ELM which is below 0.1 milliseconds in most cases. The other methods except SGD need more than 100 times the learning time than the ELM versions. The proposed method required below 1 millisecond for completing the training in seven datasets and others have taken only 2.3 seconds and below.

Table 3 Training Time (milliseconds) of benchmarked and proposed method

4.3 Performance comparison with stream classifiers

We have extensively used the scikit-multiflow [18], a dedicated and open source framework for stream data analysis, in the experiments discussed in this section. The performance of the proposed method is compared with four ensemble methods for stream classifiers [37] - Dynamic Weighted Majority Classifier (DWMC)[13], Accuracy Weighted Ensemble Classifier (AWEC), Online Boosting classifier (OBC) [15], Oza Bagging classifier (OzBC)[17]. All the datasets except the sparse datasets are used in the experiments. The accuracy and F1 score of the methods are compared to evaluate the reliability of the proposed method. The training time of the method is also studied in the experiments.

The maximum accuracy in Table 4 of this section, 99.82%, is obtained for the Electricity with parallel ELM. The least accuracy of 50.3 is recorded by the Accuracy Weighted Ensemble Classifier for the dataset Adult. It is observed that the proposed method got the highest accuracy for six datasets and in the meantime, it has placed in the last position for the other two. Our method received more than 90% accuracy for four datasets and reached the first position in six datasets. It is noted in the results that all the methods performed consistently with four datasets in the list. Among the other stream classifiers, DWC and Online Boosting classifier performed almost the same pattern and shared the second position in the average ranking. As in the case of other experiments, the overall performance of our parallel ELM performed better compared with the stream-based classifiers as well.

Table 4 Comparison of Accuracy with stream methods

The F1 score achieved by the different streaming classifiers and the proposed methods for the ten datasets is abstracted in Fig. 9. A score of 90% and more has been recorded in six experiments in the pool and four of them were scored by the proposed method. As the extension of the indication of Table 4, the parallel ELM outperformed the stream classifiers for the F1 matrix.

Fig. 9
figure 9

The F1 Score of obtained for the datasets by the stream classifiers and proposed method

It is seen in Table 5 that the parallel ELM could complete the training earlier than the other stream-based classifiers for all the data sets. The DWMC performed comparatively better than the other three methods and in most of the data sets, it has recorded the training time near to the proposed strategy. A proportional variation of training time with respect to the number of features and batch size is visible throughout our method. Such kind of pattern is not visible in most of the classifiers in this category. The comparative study in the table strengthens the inferences derived in the previous sections and experiments.

Table 5 Training Time (milliseconds) of stream classifiers and proposed method with 2000 batch mini size

Our proposed method obtained maximum accuracy of 99.82%, 99.68% in Electricity and KDD99 datasets. The Pokerhand, Stagger, LED datasets obtained third and the fourth position. Hyperplane obtained the fifth position. Covtype, Airline datasets got the lowest accuracy of all. It is clear from the table, the training time of our proposed method is comparatively low than the other existing methods.

Table 6 compares the proposed method with some ensemble-based learning algorithms that deal with stream data classification. Datasets that have some reported outputs with the selected methods are taken for the compilation of this table. The time and accuracy of each dataset are compared and found that our method came first in six among the nine datasets. In the case of training time, our method outperformed all these stream-based classifies. These results show the significance of the single-pass, knowledge-based parallel methods in online data classification.

Table 6 A comparative analysis with existing methods in stream data classification. The training Time(TT) is given in milliseconds

The results of all the experiments reveal that the strategy followed in the proposed methods is efficient in terms of training time and reliable performance in most of the data sets. The underlined principle of ELM and the mean model-based knowledge base are helped the model for producing a consistent result. Since the learning strategy is almost stable in different batch sizes, the performance of parallel implementation was steady and it could improve the execution time effectively.

5 Conclusion

This paper exploited the theory of ELM and the working principle of parallel algorithms for modelling an efficient architecture for online data classifications. The introduction of a knowledge base and output weight processing module improved the initial version of ELM. We have divided the data sets into different mini-batches for the implementation of parallel ELM. We have conducted a set of extensive experiments to validate our architecture. The data sets with various properties are selected from different public sources for the experiments. The performance of a group of bench-marked classifiers and stream-based methods are compared and analysed in a collection of experiments. It is observed in the results that our method outperformed the other strategies in all validation matrices though out the experiments. The training time of parallel ELM is considerably reduced in all the experiments without sacrificing much in the reliability of the results. In the future, the results can be further improved by fine-tuning the knowledge updating process of the knowledge-based module.