Keywords

1 Introduction

In recent years, deep learning methods have become attractive because of their success in multiple areas. Convolutional neural networks (CNN) won lots of competitions with conventional methods in visual tasks, and recurrent neural networks (RNN) based models were good at processing sequential inputs [10]. These methods use gradient-based back-propagation (BP) learning algorithm to train a large number of parameters in their networks. Proposed by Hinton in [13], the main idea of BP algorithm is to tune the weights and biases following the gradient of the loss function. However, due to the large computational cost, some modern-day neural networks may need several weeks to finish the training step [11]. Moreover, overfitting [6] is another serious problem that can cause the model to perform well during the training while achieving poor performance when testing.

Meanwhile, some randomization based neural networks have been proposed to overcome the flaws of the BP-based models [14, 15, 17]. The weights and biases in these models are randomly generated and kept fixed during the training process. Only the parameters in the output layers are obtained by the close-form solution [16]. Random vector functional link network (RVFL) is a typical single-hidden-layer randomized neural network [12]. It has a direct link to convey the information from the input layer directly to the output layer. This is useful because the output layer contains both the linear original features and the non-linear transformed features. The newest version of this model was proposed in [8] and called ensemble deep random vector functional link network (edRVFL), the authors convert the single-hidden-layer RVFL to the deep version and employ the idea of ensemble learning to reduce the computational complexity. Similar to the conventional neural networks, the edRVFL network also consists of an input layer, an output layer, and several hidden layers. The hidden weights and biases in this network are randomly generated and do not need to be trained. The uniqueness of this frame is that each layer is treated as an independent classifier, just like a single RVFL network. Eventually, the final output is obtained by fusing all the outputs.

Ensemble learning methods are widely used in classification problems, they combine multiple models for prediction to overcome the weakness of each single learning algorithm [19]. Among them, bagging [1], boosting [4], and stacking [18] are the three most popular and successful methods. Adaboost was originally proposed to improve the performance of the decision trees [3]. This method intends to combine several weak classifiers to obtain a strong classifier. The misclassified samples in the previous classifiers will be given greater importance in the following classifiers. Furthermore, the classifier with higher accuracy will also be assigned a higher weight in the final prediction. In this paper, we introduce two novel methods using Adaboost to generate the ensemble model of RVFL and the deep version of RVFL. We called them adaptive ensemble random vector functional link networks (ada-eRVFL) and adaptive ensemble deep random vector functional link networks (ada-edRVFL). In ada-eRVFL, we treat each single RVFL network as the weak classifier in Adaboost. However, in ada-edRVFL, we treat every single layer as the weak classifier.

2 Related Works

2.1 Random Vector Functional Link Networks

Random vector functional link network (RVFL) is a randomization based single hidden layer neural network proposed by Pao  [12]. The basic structure of RVFL is shown in Fig. 1.

Fig. 1.
figure 1

The structure of RVFL network. The red lines are defined as the direct link which transfer the linear original features to the output layer. (Color figure online)

Both the linear original features and the non-linearly transformed features are conveyed to the output layer through the direct link and the hidden layer, respectively. Therefore, the output weights \(\beta \) can be learned from the following optimization problem:

$$\begin{aligned} O_{RVFL}=\mathop {min}\limits _{\beta }||D\beta -Y||^2+\lambda ||\beta ||^2 \end{aligned}$$
(1)

where D is the combination of all linear and non-linear features, and Y are the true labels of all the samples. \(\lambda \) denoted as the parameter for controlling how much the algorithm cares about the model complexity. This optimization problem can be solved by ridge regression [7], and the solution can be written as follows:

$$\begin{aligned} Primal Space: \beta =(D^TD+{\lambda }I)^{-1}D^TY \end{aligned}$$
(2)
$$\begin{aligned} Dual Space: \beta =D^T(DD^T+{\lambda }I)^{-1}Y \end{aligned}$$
(3)

The computational cost of training the RVFL network is reduced by suitably choosing between the primal or dual solution [15].

2.2 Ensemble Deep Random Vector Functional Link Networks

Inspired by other deep learning models, the authors of  [8] proposed a deep version of the basic RVFL network. They also used ensemble learning to improve the performance of this model. The structure of edRVFL is shown in Fig. 2. Let n be the hidden neuron number and l be the hidden layer number. The output of the first hidden layer can be obtained by:

$$\begin{aligned} H^{(1)}=g(XW^{(1)}),\quad W^{(1)} \in \mathbb {R}^{d\times n} \end{aligned}$$
(4)

where X denotes the input features, d represents the feature number of the input samples, and \(g(\cdot )\) is the non-linear activation function used in each hidden neuron. When the layer number \(l>1\), similar to the RVFL networks, the hidden features in the previous hidden layer as well as the original features in the input layer are concatenated together to generate the next hidden layer. So Eq. 4 becomes, when \(l>1\):

$$\begin{aligned} H^{(l)}=g([H^{(l-1)}X]W^{(l)}),\quad W^{(l)} \in \mathbb {R}^{(n+d)\times n} \end{aligned}$$
(5)

edRVFL network treats every hidden layer as a single RVFL classifier too. After getting predictions from all the layers via ridge regression, these outputs will be fused by ensemble methods to reach the final output.

2.3 AdaBoost

Boosting has been proven to be successful in solving classification problems. It was first introduced by [3], with their algorithm called Adaboost. This method was originally proposed for the two-class classification problem and improving the performance of the decision tree. The main idea of Adaboost is to approximate the Bayes classifier by combining several weak classifiers. Typically, the Adaboost algorithm is an iterative procedure, it starts with using unweighted samples to build the first classifier. During the following steps, the weights of the misclassified samples in the previous classifier will be boosted in the next classifier. That means these samples are given higher importance during the error calculation. After several repetitions, it employs weighted majority voting to combine outputs from every classifier to obtain the final output.

Fig. 2.
figure 2

The structure of edRVFL network. Each single layer is treated as an independent RVFL network. The final output is obtained by combing the predictions from all the classifier.

In [5], the authors developed a new algorithm called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME) which directly extended the Adaboost to the multi-class case without complicating it into multiple two-class problems.

3 Method

In this section, we proposed the adaptive ensemble random vector functional link networks (ada-eRVFL). ada-eRVFL is inspired by the adaptive boosting method. For a data set \(\textit{\textbf{x}}\), the weights \(\alpha \) are assigned to the sub-classifiers according to the error function:

$$\begin{aligned} err^{(m)} = \sum _{i=1}^n \omega _i \mathbbm {l}(c_i \not = R^{(m)}(x_i)) / \sum _{i=1}^n \omega _i \end{aligned}$$
(6)

where R represents the single RVFL classifier and \(\omega \) is the sample weight described as the follows:

$$\begin{aligned} \omega _i \leftarrow \omega _i\cdot exp\left( a^{(m)}\cdot \mathbbm {l}\left( c_i \not = R^{(m)}(x_i)\right) \right) \end{aligned}$$
(7)

It is worth noting that the sample weights only contribute in computing the error function of the sub-classifier. The training phase of the weak RVFL-classifiers only utilizes the original samples. The weak RVFL classifiers for ensemble are individual and independent. The mis-classified samples would be assigned a larger weight.

figure a

Another proposed variant is adaptive ensemble deep random vector functional link networks (ada-edRVFL). The deep RVFL network consists of a few hidden layers instead of single hidden layer in RVFL. Each hidden layer can constitute a sub-classifier with the input layer, output layer and the direct link between them.

figure b

For both proposed ideas, we need to make sure the classifier weights \(\alpha \) are positive, thus it is required that \(1-err^{(m)}> 1/K\).

4 Experiments

The experiments are performed on 9 classification datasets selected from UCI Machine Learning Repository [2], including both binary and multiple classification problems. Sample volume and feature dimensions are coverage from hundred to thousand. The details of the dataset are stated below. In the interests of a fair comparison, we use the exact same validation and test subsets and the same data pre-processing method as in  [9].

For the RVFL based methods, the regularization parameter \(\lambda \) is set as \(\frac{1}{C}\) where C is chosen from range \(2^x \{x = -6,-4,\dots ,10,12\}\). Based on the size of the dataset, the hidden neuron number these methods can be tuned from \(\{2^2,2^3,\dots ,2^{10},2^{11}\}\).

For the dRVFL based methods, the regularization parameter \(\lambda \) is set as \(\frac{1}{C}\) where C is chosen from range \(2^x \{x = -6,-4,\dots ,10,12\}\). Based on the size of the dataset, the hidden neuron number these methods can be tuned from \(\{2^2,2^3,\dots ,2^{10},2^{11}\}\). Besides, the maximum number of hidden layers for the edRVFL based methods is set to 32, which is also the number of sub-classifiers in ada-edRVFL.

Table 1. Accuracy (%) and Average rank of variant approaches

From Table 1, the ada-dRVFL achieves the best performance on all datasets. On average, the accuracy of ada-edRVFL has an improvement of over two percentages. It suggests the proposed adaptive ensemble method can effectively boost the performance of weak classifiers, and this method make single RVFL network competitive with deep RVFL network.

5 Conclusion

We proposed an adaptive ensemble method of random vector functional link networks. Compared to the edRVFL, our framework outperforms the edRVFL’s result. The proposed method can be employed to different RVFL based networks. After utilizing the proposed ensemble method, the performance of ensemble RVFL can compete with the deep RVFL networks. Specifically, we test the proposed method on 9 UCI classification task machine learning datasets. The experimental results show that the proposed ensemble variant is both effective and general.