Introduction

Inspired by cognitive science, the extreme learning machine (ELM) was proposed to overcome challenging issues (such as slow convergence and local minima problem) faced by the conventional gradient-based approaches in a single hidden-layer feedforward neural network (SLFN) [1]. Since its inception, the ELM has attracted increasing attention from researchers all over the world. One of the salient features of the ELM is that the input weights and hidden biases are generated randomly instead of being tuned iteratively. The output weights are computed based on the prefixed input weights and hidden biases using the Moore-Penrose (MP) generalized inverse only once [2, 3]. Another characteristic of the ELM is that almost any nonlinear piecewise continuous random hidden nodes can be used in it [4, 5]. Compared with the conventional gradient-based approaches, the ELM has a significantly lower computational time and presents better generalization performances. Due to its efficiency and universal approximation capabilities, the ELM has been demonstrated on various problems in different fields, such as classification [68], recognition [912], imbalanced leaning [13, 14], steganalysis [15], and so on. However, there exist some sets of non-optimal input weights and hidden biases which may influence the performance in minimizing the cost function. Due to the randomness, the input weights and hidden biases generated by the ELM may be non-optimal. Since the output weights highly depend on the input weights and hidden biases, the non-optimal input weights and hidden biases may have a negative impact on the final results. Moreover, the ELM may require more hidden neurons than the conventional gradient-based algorithms in many cases [1618].

To address the above ELM’s drawbacks and find a proper set of input weights and hidden biases, Huang et al. proposed a hybrid form of the differential evolutionary algorithm and ELM method called the evolutionary ELM (E-ELM) [19]. Besides the E-ELM, some other bio-inspired methods and global search techniques are also adopted in the ELM during the past several years. Cao et al. proposed an improved learning algorithm named the self-adaptive evolutionary ELM (SaE-ELM) and evaluated the performances on regression and classification problems [20]. Xu and Shu introduced a hybrid particle swarm optimization (PSO) and ELM algorithm for a prediction problem [21]. Saraswathi et al. presented a PSO-driven ELM, combined with the integer-coded genetic algorithm (ICGA), to solve gene selection and cancer classification [22]. Silva et al. combined the ELM with group search optimization (GSO) and valuated the influence of four different forms of handling members that fly out of the search space bounds [23]. More contemporary soft computing techniques, such as ant colony optimization [2426], artificial bee colony algorithm [2729], firefly algorithm [30, 31], binary-coded PSO [32], and multi-subswarm PSO [33], are also promising to be applied in the ELM.

Recently, inspired by the biological characteristics and living habits shown in the dolphins’ predatory process, Wu et al. presented a novel evolutionary algorithm named dolphin swarm algorithm (DSA) to solve optimization problems [34]. By simulating the dolphins’ predatory process, DSA can make use of the behavior rules of dolphins and the interactions between dolphins to produce changes and achieve certain goals at the group level. With the help of effective strategies, the DSA has shown many promising features, such as fast periodic convergence, local-minima-free, no specific demand on the cost function, and so on [34]. In comparison with the classical evolutionary algorithms, the DSA has better global search ability, better stability, and higher convergence speed, which have been proven through some benchmark functions with different properties [35]. Therefore, compared with the conventional evolutionary algorithms, the DSA is a preferable method to address the ELM’s drawbacks caused by the non-optimal input weights and hidden biases.

In this paper, we introduce a new hybrid method named dolphin swarm extreme learning machine (DS-ELM) for SLFN. In the proposed algorithm, several sets of input weights and hidden biases are used. Each set is encoded into one vector, namely the dolphin. These dolphins are evaluated by root mean squared error (RMSE) [1923] and updated by the four pivotal phases of DSA. After the four pivotal phases, the set of input weights and hidden biases represented by the best dolphin is selected. Then, the output weights are derived by the chosen set. To evaluate the effectiveness of our method, we compare the proposed algorithm with the standard ELM and three state-of-the-art methods (PSO-ELM, E-ELM, and SaE-ELM) under 13 benchmark datasets (the newest or the most popular datasets) obtained from the University of California Irvine (UCI) Machine Learning Repository [36]. The experimental results show that DS-ELM can achieve superior generalization performances than all the compared algorithms.

The remaining structures of this paper are as follows. The “Preliminaries” presents the standard ELM and DSA methods. After that, the DS-ELM is proposed in “Dolphin Swarm Extreme Learning Machine (DS-ELM).” “Experiments” is made up of certain experiments. The DS-ELM will be compared with the standard ELM, PSO-ELM, E-ELM, and SaE-ELM under six regression benchmark datasets and seven classification benchmark datasets obtained from the UCI Machine Learning Repository [36]. A summary of this paper will be given in the last section.

Preliminaries

Extreme Learning Machine (ELM)

The ELM was originally proposed for SLFN by Huang et al. [1]. Then, it was extended to the generalized SLFN where the hidden layer need not be neuron like [5]. The main concept of the ELM is that the input weights and hidden biases are generated randomly instead of being tuned iteratively. Moreover, the output weights are computed by the MP generalized inverse only once [2, 3]. Therefore, the ELM has a significantly lower computational time compared with the conventional gradient-based approaches. Besides, ELM theories have shown that almost any nonlinear piecewise continuous random hidden nodes can be used in the ELM [4, 5], and the resultant networks have universal approximation capabilities [3739].

The output function of the ELM for generalized SLFNs with L hidden nodes is

$$ {f}_L(x)=\sum_{i=1}^L{\beta}_i{g}_i(x)=\sum_{i=1}^L{\beta}_i G\left({a}_i,{b}_i, x\right), x\in {R}^d,{\beta}_i\in {R}^m $$
(1)

where a i and b i are the input weight and hidden bias of the ith hidden node respectively; β i denotes the linked weight between the ith hidden node and the output neurons; and g i (x) denotes the output function G(a i , b i , x) of the ith hidden node. The output functions G(a, b, x) are a nonlinear piecewise continuous function satisfying ELM universal approximation capability theorems [4, 5]. For additive nodes with activation function g(x), g i (x) is defined as

$$ {g}_i(x)= G\left({a}_i,{b}_i, x\right)= g\left({a}_i\cdot x+{b}_i\right),{a}_i\in {R}^d,{b}_i\in R $$
(2)

For radial basis function (RBF) nodes with activation function g(x), g i (x) is defined as

$$ {g}_i(x)= G\left({a}_i,{b}_i, x\right)= g\left({b}_i\left\Vert x-{a}_i\right\Vert \right),{a}_i\in {R}^d,{b}_i\in {R}^{+} $$
(3)

Suppose we are training the generalized SLFN with L hidden neurons to learn N arbitrary samples (x i , t i ). Since the input weights and hidden biases are generated randomly, the nonlinear system can be converted to a linear model:

$$ H\beta = T $$
(4)

where

$$ H={\left[\begin{array}{ccc}\hfill G\left({a}_1,{b}_1,{x}_1\right)\hfill & \hfill \cdots \hfill & \hfill G\left({a}_L,{b}_L,{x}_1\right)\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill G\left({a}_1,{b}_1,{x}_N\right)\hfill & \hfill \cdots \hfill & \hfill G\left({a}_L,{b}_L,{x}_N\right)\hfill \end{array}\right]}_{N\times L} $$
(5)
$$ \beta ={\left[\begin{array}{c}{\beta}_1^T\\ {}\vdots \\ {}{\beta}_L^T\end{array}\right]}_{L\times m}\mathrm{and}\kern0.5em T={\left[\begin{array}{c}{t}_1^T\\ {}\vdots \\ {}{t}_N^T\end{array}\right]}_{N\times m} $$
(6)

H is the hidden-layer output matrix; β = [β 1, β 2, …β K ]T is the output weights; T = [t 1, t 2, …t N ]T is desired outputs. The input weights and hidden biases are generated randomly. Thus, the matrix H can be computed by the input samples. Training the generalized SLFN is converted into finding the least-square (LS) solution to the given linear model. The minimum norm LS solution to the linear model Eq. (4) is

$$ \widehat{\beta}={H}^{\dagger } T $$
(7)

where H is the MP generalized inverse of the matrix H. By utilizing such MP generalized inverse method, the ELM is able to obtain good generalization performances [2, 3].

In summary, the ELM algorithm is presented in algorithm 1.

figure a

Dolphin Swarm Algorithm (DSA)

The DSA was introduced by Wu et al. as an efficient global search method to solve varied optimization problems [34]. It is mainly implemented by simulating the biological characteristics and living habits shown in the dolphin’s actual predatory process, such as echolocation, information exchanges, cooperation, and division of labor. By simulating the dolphin’s actual predatory process, the DSA can make use of the behavior rules of dolphins and the interactions between dolphins to produce changes at the group level and achieve certain goals. With the help of effective strategies, the DSA has shown many promising features, such as fast periodic convergence, local minima free, no specific demand on the cost function, and so on [34]. Compared with the conventional evolutionary algorithms, the DSA has better global search ability, better stability, and higher convergence speed, which have been proven through some benchmark functions with different properties [35].

The dolphin’s actual predatory process consists of three stages. In the first stage, each dolphin independently takes advantage of sounds to search for nearby prey and to evaluate the surrounding environment by echoes. In the second stage, dolphins exchange their information. Those dolphins that find large prey call other dolphins for help. And those dolphins that have received information move toward the prey and surround it along with other dolphins. In the last stage, the prey is surrounded by the dolphins and then what the dolphins need to do is to take turns to enjoy the food, which means that the predatory process is accomplished [34].

In the DSA, a swarm of N dolphins is kept. In the D dimensional search space of optimization problems, the dolphins are defined as Dol i  = [x 1, x 2, …x D ]T , (i= 1, 2,  … , N), where x j (j = 1, 2,  ⋯ , D) is the component of each dimension to be optimized. For each Dol i , there are two corresponding variables L i (i = 1, 2,  ⋯ , N) and K i (i = 1, 2,  ⋯ , N), where L i represents the optimal solution Dol i finds in a single time and K i represents the optimal solution of what Dol i finds by itself or gets from others. The fitness function represented by fitness(X) is the basis for judging whether the position is better. There are three kinds of distances used in total. One is the distance between Dol i and Dol j named DD i , j , one is the distance between Dol i and K i , named DK i , and the other one is the distance between L i and K i named DKL i . Moreover, exchanging processes are maintained by an N × N order matrix named transmission time matrix TS, where the term TS i , j represents the remaining time of the sound spreading from Dol j to Dol i .

Generally, DSA can be divided into six phases. In each phase, dolphins have distinct work to do. Besides the initialization phase and termination phase, four pivotal phases of DSA including search phase, call phase, reception phase, and predation phase need to be expounded for a better understanding.

Search Phase

In the search phase, each dolphin searches its nearby area by making sounds V i  = [v 1, v 2,  ⋯ , v D ]T , (i = 1, 2,  ⋯ , M) toward M random directions, where v j (j = 1, 2,  ⋯ , D) is the component of each dimension. Besides, the sounds satisfy ‖V i ‖ = speed , (1 = 1, 2,  ⋯ , M), where speed is a constant. Within the maximum search time T, the sound V j that Dol i makes at time t will search a new solution X ijt :

$$ {X}_{i jt}={\mathrm{Dol}}_i+{V}_j\times t $$
(8)

For the new solution X ijt that Dol i gets, calculate its fitness E ijt :

$$ {E}_{ijt}=\mathrm{fitness}\left({X}_{ijt}\right) $$
(9)

If

$$ {E}_{iab}=\underset{j=1,\dots, M; t=1,\dots, T}{ \min }{E}_{ijt}=\underset{j=1,\dots, M; t=1,\dots, T}{ \min}\mathrm{fitness}\left({X}_{ijt}\right) $$
(10)

then L i is determined as

$$ {L}_i={X}_{i ab} $$
(11)

If

$$ \mathrm{fitness}\left({L}_i\right)<\mathrm{fitness}\left({K}_i\right) $$
(12)

then replace K i by L i . Otherwise K i does not change.

Call Phase

In the call phase, each dolphin will make sounds to inform other dolphins of its result in the search phase. And the transmission time matrix TS needs to be updated as follows:

For K i , K j , and TS i , j , if

$$ \mathrm{fitness}\left({K}_i\right)>\mathrm{fitness}\left({K}_j\right) $$
(13)

and

$$ {\mathrm{TS}}_{i, j}>\frac{DD_{i, j}}{\mathrm{speed}} $$
(14)

then update TS i , j as

$$ {\mathrm{TS}}_{i, j}=\frac{DD_{i, j}}{\mathrm{speed}} $$
(15)

Reception Phase

When DSA enters the reception phase, all the terms TS i , j subtract one to indicate that the sounds spread a unit of time, then check all the terms TS i , j , if

$$ {\mathrm{TS}}_{i, j}=0 $$
(16)

then it means that the sound spreading from Dol j to Dol i can be received. Compare K i with K j . If it satisfies Eq. (13), then replace K i by K j . Otherwise K i does not change. Then set TS i , j as

$$ {\mathrm{TS}}_{i, j}=\left\lceil \frac{\mathrm{range}}{\mathrm{speed}}\right\rceil $$
(17)

where range is the length of the search space.

Predation Phase

In the predation phase, each dolphin needs to get a new position according to its known information and it can be discussed in two cases.

  • For DK i , if

$$ {DK}_i\le T\times \mathrm{speed} $$
(18)

we can get Dol i ’s new position newDol i :

$$ {\mathrm{newDol}}_i={K}_i+ w\times \left({\mathrm{Dol}}_i-{K}_i\right)\times \left(1-\frac{2}{e}\right) $$
(19)

where e is a constant which is greater than 2 and w is an arbitrary unit vector.

  • For DK i , if

$$ {DK}_i> T\times \kern0.5em \mathrm{speed} $$
(20)

we can get Dol i ’s new position newDol i :

$$ {\mathrm{newDol}}_i={K}_i+ w\times \left(1-\frac{D{ K}_i\times \frac{1}{\mathrm{fitness}\left({K}_i\right)}+\left({ D K}_i-{ D K L}_i\right)\times \frac{1}{\mathrm{fitness}\left({K}_i\right)}}{e\times { D K}_i\times \frac{1}{\mathrm{fitness}\left({K}_i\right)}}\right)\times { D K}_i $$
(21)

After all the Dol i get their new position newDol i , compare newDol i with K i in fitness. If

$$ \mathrm{fitness}\left({\mathrm{newDol}}_i\right)<\mathrm{fitness}\left({K}_i\right) $$
(22)

then replace K i by newDol i . Otherwise K i does not change.

If the end condition is satisfied, DSA enters the termination phase. Otherwise, DSA enters the search phase again.

Although there are many parameters used in DSA, only a few of them are user-specified, including the number of dolphins N, the number of sounds M, the speed of sounds speed, maximum search time T, and the constant e used in the predation phase.

In summary, the DSA is presented in algorithm 2.

figure b

Dolphin Swarm Extreme Learning Machine (DS-ELM)

The standard ELM generates the input weights and hidden biases randomly. However, there exist some sets of non-optimal input weights and hidden biases which may inevitably influence the performance in minimizing the cost function and have a negative impact on the final results. There are two feasible ways to avoid these non-optimal input weights and hidden biases. One is to simply increase the number of hidden nodes which may weaken the performances of the ELM. The other one is to find a proper set of input weights and hidden biases which can achieve a favorable performance with less hidden nodes and more compact networks. In this section, we present our new hybrid approach DS-ELM to improve the performance of the ELM in the latter way by taking advantages of the DSA’s great global search ability.

In the DS-ELM, a swarm of N dolphins is kept. Each dolphin represents one set of input weights and hidden biases, namely Dol i  = [w 11, w 12,  … , w 1K , w 21, w 22,  … , w M1, w M2,  … , w MK , b 1, b 2,  … , b K ]T , (i = 1, 2,  … , N). All the input weights and hidden biases are randomly initialized within the range of [−1, 1]. For each dolphin, the corresponding output weights β are computed by Eq. (7).

Then, we utilize the RMSE as the fitness function [1923]:

$$ \mathrm{fitness}\left({\mathrm{Dol}}_i\right)=\sqrt{\frac{\sum_{j=1}^M{\left\Vert {\sum}_{j=1}^K\beta g\left({w}_i\cdot {x}_j+{b}_j\right)-{t}_j\right\Vert}_2^2}{M}} $$
(23)

where T = [t 1, t 2,  … , t M ]T is the desired output.

With N dolphins and the fitness function, we are able to adopt the DSA to our DS-ELM. After the search phase, the call phase, the reception phase, and the predation phase described in “Dolphin Swarm Algorithm (DSA),” an optimal set of input weights and hidden biases will be obtained eventually.

In summary, the DS-ELM is presented in algorithm 3.

figure c

Experiments

In this section, the DS-ELM is compared with the standard ELM and three state-of-the-art methods, including the PSO-ELM, E-ELM, and SaE-ELM. The experiments are composed of two parts. In the first part, these five algorithms are tested under six regression benchmark datasets. And seven classification benchmark datasets are employed in the second part. Specifically, these 13 benchmark datasets are obtained from the UCI Machine Learning Repository [36]. All these simulations are conducted in Matlab 9.1 environment (the latest version in 2016) with an Intel Core i7, 2 GHz CPU and 8 GB RAM.

In our experiment, each dataset is divided into two parts, namely the training set and the testing set. We adopt a tenfold cross-validation method in the training set and use the validation set to determine the suitable parameters of the five algorithms. The number of hidden nodes is gradually increased in a preset region [5, 10, 15…200], and the one with the lowest validation error is selected. Moreover, the number of individuals (dolphins in the DS-ELM, particles in the PSO-ELM, and population in the E-ELM and SaE-ELM) is 5, and the times of invoking the ELM is equally set as 1000 for the DS-ELM, PSO-ELM, E-ELM, and SaE-ELM. Other parameters, suggested by [20, 21, 34], are shown in Table 1.

Table 1 Parameters used in the DS-ELM, PSO-ELM, and E-ELM

Regression

In this subsection, the five algorithms are compared under six regression benchmark datasets from the UCI Machine Learning Repository [36], which are Servo, Breast Cancer Wisconsin (Cancer), Housing, Airfoil Self-Noise (Airfoil), Wine Quality (Wine), and Air Quality (Air). The Air dataset is one of the newest regression datasets in the UCI Machine Learning Repository [36]. And the other five datasets are the most popular datasets. The specifications of the six datasets are shown in Table 2.

Table 2 Specifications of the six regression benchmark datasets

All the attributes have been normalized to the range of [−1, 1], and the desire outputs have been normalized to the range of [0, 1]. To get a better comparison and reduce the influence of accidental factors, all the experimental results are obtained by taking the average of 20 independent experiments. For each experiment, the training set and the testing set are randomly selected from the whole dataset, and the two sets have no overlap with each other to avoid overfitting. Moreover, the two sets are kept the same for each trial of the five algorithms. The results are shown in Table 3, and the best results are shown in italic font.

Table 3 Results of the five algorithms under the six regression benchmark datasets

In terms of the training time, it is quite easy to see that the ELM is the fastest one since all the other five algorithms invoke it multiple times. Besides, there is not much difference in training time among the DS-ELM, PSO-ELM, E-ELM, and SaE-ELM, because invoking the ELM is time consuming and the times of invoking the ELM of these four algorithms are the same. But it can still be seen that the DS-ELM is slightly faster than the PSO-ELM, E-ELM, and SaE-ELM.

As for the testing accuracy, the DS-ELM, PSO-ELM, E-ELM, and SaE-ELM obtain better results with less hidden nodes than the ELM which means that the DS-ELM, PSO-ELM, E-ELM, and SaE-ELM can achieve better generalization performances with more compact networks. Compared with the PSO-ELM, E-ELM, and SaE-ELM, the DS-ELM has smaller means and standard deviations. Wilcoxon’s signed-rank tests with 95% confidence interval show that the DS-ELM is significantly different from the ELM, PSO-ELM, E-ELM, and SaE-ELM which indicates that the DS-ELM is superior to the other four methods under these six regression benchmark datasets.

Classification

In this subsection, the performances of the five algorithms are evaluated under seven classification benchmark datasets from the UCI Machine Learning Repository [36]. The seven datasets are Iris, Heart, Pima Indians Diabetes (Diabetes), Image Segmentation (Image), Landsat Satellite (Satellite), Occupancy Detection (Occupancy), and Shuttle respectively. The Occupancy dataset is one of the newest classification datasets in the UCI Machine Learning Repository [36]. And the other six datasets are the most popular datasets. The specifications of the seven datasets are listed in Table 4.

Table 4 Specifications of the seven classfication benchmark datasets

All the attributes have been normalized to the range of [−1, 1]. The same as the experiments in “Regression,” all the experimental results are obtained by taking the average of 20 independent experiments. For each experiment, the whole dataset is divided into two parts with no overlap, namely the training set and testing set. And the two sets are kept coincident for each trial of the five algorithms. The results are shown in Table 5, and the best results are emphasized with italic font.

Table 5 Results of the five algorithms under the seven classification benchmark datasets

Speaking of the training time, the same conclusion can be made that the ELM runs fastest and the other four methods put in nearly the same amount of time to train the networks which is theoretically reasonable as analyzed in “Regression.”

When it comes to the testing accuracy, it can be easily seen that the DS-ELM achieves the highest mean testing accuracy and the lowest standard deviation in all the classification datasets. In addition, Wilcoxon’s signed-rank tests with 95% confidence interval show that the DS-ELM is significantly different from the ELM, PSO-ELM, E-ELM, and SaE-ELM, which indicates that the DS-ELM outperforms the other four approaches under these seven classification benchmark datasets.

Conclusion

In this paper, we firstly introduced the standard ELM and DSA. Then, we proposed a brand new hybrid approach named dolphin swarm extreme learning machine for the SLFN. In our new algorithm, the DSA is used to optimize the input weights and hidden biases of the ELM, and the ELM is employed to calculate the output weights. In the experiment part of this paper, our method is compared with the standard ELM and three state-of-the-art methods, which are the PSO-ELM, E-ELM, and SaE-ELM under six regression benchmark datasets and seven classification benchmark datasets obtained from the UCI Machine Learning Repository. Experimental results show that our method has better generalization performances with more compact networks than the standard ELM. Moreover, our algorithm can achieve better testing results (smaller RMSE on regression and higher accuracy on classification). According to Wilcoxon’s signed-rank tests, the dolphin swarm extreme learning machine is superior to the other four methods both on the six regression datasets and the seven classification datasets in the experiments. However, the DS-ELM is time consuming compared with the standard ELM. Besides, it does not reduce the number of hidden nodes compared with the state-of-the-art methods. Future research works will be concentrated on how to get faster as well as how to make the network more compact.