1 Introduction

The popular method for training a feedforward neural network (FNN) is the backpropagation (BP) algorithm [1]. However, the traditional BP algorithm has the following major drawbacks. First, it is apt to be trapped in local minima. Second, it have not considered the network structure features as well as the involved problem properties, thus its generalization capabilities are limited. Finally, since BP algorithms are the gradient-based type learning algorithms, they converge very slowly [27].

In the literatures [8, 9], two algorithms referred as Hybrid-I and Hybrid-II methods, respectively were proposed. The Hybrid-I algorithm incorporates the first-order derivatives of the neural activation at hidden layers into the sum-of-square error cost function to reduce the input-to-output mapping sensitivity. The Hybrid-II algorithm incorporates the second-order derivatives of the neural activations at hidden layers and output layer into the sum-of-square error cost function to penalize the high-frequency components in training data. In the literature [10], a modified hybrid algorithm (MHLA) is proposed according to Hybrid-I and Hybrid-II algorithms to improve the generalization performance. All the above learning algorithms are purely local search algorithms and apt to converge to local minima.

Obviously, gradient-based learning algorithm has good capability of local search. On the other hand, particle swarm optimization (PSO) algorithm has good capability of global search [1115]. Therefore, global search combining with local search in a learning algorithm can improve the convergence performance of the algorithm.

In the recent years, PSO has been used increasingly as an effective technique for searching global minima [1316]. When compared to genetic algorithm, the PSO algorithm has no complicated evolutionary operators and adjusts less parameter in the course of training [1719].

Hence, in the literature [20], a double search approach referred as APSOAEFDI–MHLA for function approximation was proposed to obtain better approximation performance. First, the APSOAEFDI which combined APSO with the first-order derivative information of the approximated function was used to search globally. Then MHLA was used to search locally within the global search. In this paper, an improved approach similar to APSOAEFDI–MHLA for function approximation based on adaptive particle swarm optimization (APSO) and a priori information is proposed. In order to overcome the drawbacks from gradient-based algorithm for FNN, the FNN is trained by the APSOAEFDI first to near the global minima, and then the network is trained again by a modified gradient-based algorithm with magnified gradient function [21]. Due to combining APSO with local search algorithm and considering the a priori information, the improved approach has better approximation accuracy and convergence rate. Finally, simulation results are given to verify the efficiency and effectiveness of the proposed learning approach.

2 Particle swarm optimization

The PSO is an evolutionary computation technique developed by Eberhart and Kennedy in 1995 [11, 12], inspired by social behavior of bird flocking. PSO is a kind of algorithm to search for the best solution by simulating the movement of flocking of birds. The algorithm works by initializing a flock of birds randomly over the searching space, where each bird is called as a “particle”. These “particles” fly with a certain velocity and find the global best position after some iteration. At each iteration, each particle can adjust its velocity vector, based on its momentum and the influence of its best position (P b) as well as the best position of its neighbors (P g), and then compute a new position that the “particle” is to fly to. Supposing the dimension of searching space is D, the total number of particles is n, the position of the ith particle can be expressed as vector X i  = (x i1x i2,…, x iD ); the best position of the ith particle searching until now is denoted as P ib = (p i1p i2,…, p iD ), and the best position of the total particle swarm searching until now is denoted as vector P g = (p g1p g2,…, p gD ); the velocity of the ith particle is represented as vector V i  = (v i1v i2,…, v iD ). Then the original PSO algorithm (PSOA) [11, 12] is described as:

$$ v_{id} (t + 1) = v_{id} (t) + c_{1} \times {\text{rand}}() \times [p_{id} (t) - x_{id} (t)] + c_{2} \times {\text{rand}}() \times [p_{gd} (t) - x_{id} (t)] $$
(1)
$$ x_{id} (t + 1) = x_{id} (t) + v_{id} (t + 1)\,1 \le i \le n \quad 1 \le d \le D $$
(2)

where c 1, c 2 are the acceleration constants with positive values; rand() is a random number between 0 and 1. In addition to the c 1, and c 2 parameters, the implementation of the original algorithm also requires to place a limit on the velocity (v max). After adjusting the parameters w and v max, the PSO can achieve the best search ability.

The adaptive particle swarm optimization algorithm (APSOA) is based on the original PSO algorithm, proposed by Shi and Eberhart in 1998 [22, 23]. The APSO can be described as follows:

$$ v_{id} (t + 1) = w \times v_{id} (t) + c_{1} \times {\text{rand}}() \times [p_{id} (t) - x_{id} (t)] + c_{2} \times {\text{rand}}() \times [p_{gd} (t) - x_{id} (t)] $$
(3)
$$ x_{id} (t + 1) = x_{id} (t) + v_{id} (t + 1)\quad 1 \le i \le n\quad 1 \le d \le D $$
(4)

where w is a new inertial weight. The parameter w can reduce gradually as the generation increases. The APSO algorithm is more effective, because the searching space reduces step by step, not linearly.

3 Modified Hybrid-I algorithm with magnified gradient function

Above all, the following mathematical notations are made. x k and y i denote the kth element of the input vector and the ith element of the output vector, respectively; \( \mathop w\nolimits_{{\mathop j\nolimits_{l} \mathop j\nolimits_{l - 1} }} \)denotes the synaptic weight from the j l−1 th hidden neuron at the (l − 1)th hidden layer to the j l th hidden neuron at the lth hidden layer; \( \mathop w\nolimits_{{\mathop {ij}\nolimits_{L - 1} }} \) denotes the synaptic weight from the j L−1 th hidden neuron at the (L − 1)th hidden layer to the ith neuron at the output layer; \( \mathop w\nolimits_{{\mathop j\nolimits_{1} k}} \) denotes the synaptic weight from the kth element of the input vector to the j l th hidden neuron at the first hidden layer; \( \mathop f\nolimits_{l}^{'} ( \cdot ) \) is the derivative of the activation function \( \mathop f\nolimits_{l}^{{}} ( \cdot ) \) at the lth hidden layer; \( h_{{j_{l} }} = f_{l} (\hat{h}_{{j_{l} }} ) \) is the activation function of the j l th element at the lth hidden layer with \( \mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }} = \sum\limits_{{\mathop j\nolimits_{l - 1} }} {\mathop w\nolimits_{{\mathop j\nolimits_{l} \mathop j\nolimits_{l - 1} }} \mathop h\nolimits_{{\mathop j\nolimits_{l - 1} }} } \). The t i and y i denote the target and actual output values of the ith neuron at output layer, respectively; N l denotes the number of the neurons at the lth hidden layer; N L denotes the number of the neurons at the output layer.

In order to reduce the input-to-output mapping sensitivity, a cost function in Hybrid-I [8, 9] has been proposed as follows:

$$ E = \frac{1}{N}\sum\limits_{S = 1}^{N} {\mathop E\nolimits^{S} } = \frac{1}{N}\sum\limits_{S = 1}^{N} {\left( {\frac{1}{{\mathop {2N}\nolimits_{L} }}\sum\limits_{i = 1}^{{\mathop N\nolimits_{L} }} {\mathop {\left( {\mathop t\nolimits_{i}^{S} - \mathop y\nolimits_{i}^{S} } \right)}\nolimits^{2} } + \sum\limits_{l = 1}^{L - 1} {\mathop \gamma \nolimits_{l} \mathop E\nolimits_{h}^{lS} } } \right)} $$
(5)

where \( \mathop E\nolimits_{h}^{lS} = \frac{1}{{\mathop N\nolimits_{l} }}\sum\nolimits_{{\mathop j\nolimits_{l} }}^{{\mathop N\nolimits_{l} }} {\mathop f\nolimits^{'} \left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right)} \) denotes the additional hidden layer penalty term at lth layer; The gain γ l represents the relative significance of the hidden layer cost over the output error.

The network is trained by a steepest-descent error minimization algorithm as usual, and the synaptic weight update for the Sth stored pattern becomes [8, 9]

$$ \begin{array}{*{20}c} {\mathop {\Updelta w}\nolimits_{{\mathop j\nolimits_{l} \mathop j\nolimits_{l - 1} }}^{S} = - \mathop \eta \nolimits_{l} \frac{{\mathop {\partial E}\nolimits^{S} }}{{\mathop {\partial w}\nolimits_{{\mathop j\nolimits_{l} \mathop j\nolimits_{l - 1} }}^{S} }} = \mathop \eta \nolimits_{l} \mathop \delta \nolimits_{{\mathop j\nolimits_{l} }}^{S} \mathop h\nolimits_{{\mathop j\nolimits_{l - 1} }}^{S} } & {l = 1,2, \ldots ,L} \\ \end{array} $$
(6)

where \( \mathop \delta \nolimits_{{\mathop j\nolimits_{l} }}^{S} \) denotes the negative derivative of E S to \( \mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} \) at the lth layer.Hence, the negative derivative of E S to \( \mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} \)(\( l = 1, \ldots ,L - 1,L \)) at the hidden layer for the Sth stored pattern, i.e., \( \mathop \delta \nolimits_{{\mathop j\nolimits_{l} }}^{S} \), can be computed by back-propagation style as follows [8, 9]:

$$ \begin{aligned}&\mathop \delta \nolimits_{{\mathop j\nolimits_{l} }}^{S} = - \frac{{\mathop {\partial E}\nolimits^{S} }}{{\partial \mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} }} = \sum\limits_{{\mathop j\nolimits_{l + 1} = 1}}^{{\mathop N\nolimits_{l + 1} }} {\mathop \delta \nolimits_{{\mathop j\nolimits_{l + 1} }}^{S} \mathop w\nolimits_{{\mathop j\nolimits_{l + 1} \mathop j\nolimits_{l} }}^{S} \mathop f\nolimits^{'} \left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right))} - \frac{{\mathop \gamma \nolimits_{l} }}{{\mathop N\nolimits_{l} }}\mathop f\nolimits^{''} \left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right)\\&l = 1, {\ldots} ,L - 1 \end{aligned}$$
(7)
$$ \mathop \delta \nolimits_{{\mathop j\nolimits_{L} }}^{S} = - \frac{{\mathop {\partial E}\nolimits^{S} }}{{\partial \mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{L} }}^{S} }} = \frac{1}{{\mathop N\nolimits_{L} }}\mathop f\nolimits^{'} \left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{L} }}^{S} } \right)\left( {\mathop t\nolimits_{i}^{S} - \mathop y\nolimits_{i}^{S} } \right) $$
(8)

In this paper, we adopt the activation function for all hidden neurons at all layers, i.e., tangent sigmoid transfer function:

$$ f(x) = (1 - { \exp }( - 2x))/(1 + \exp ( - 2x)) $$
(9)

This function has the following property:

$$ f^{\prime } (x) = (1 - f(x))(1 + f(x)) $$
(10)
$$ f^{\prime \prime } (x) = - 2f(x)f^{\prime } (x) $$
(11)

In order to decrease the chance of being trapped into local minima, the Hybrid-I algorithm combined with magnified gradient function [21] is proposed in the paper. According to Eq. 10, when \( f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right) \) \( (l = 1, \ldots ,L - 1) \) included in Eqs. 7, 8 approaches extreme values (i.e., −1 or 1), \( f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right) \) \( (l = 1, \ldots ,L - 1) \) will become so small (close to zero) that \( \mathop {\Updelta \omega }\nolimits_{{\mathop j\nolimits_{l} \mathop j\nolimits_{l - 1} }} \) \( (l = 1, \ldots ,L - 1) \) will approaches zero. So the network will be trapped into a flat region so that it converges more slowly to the global optimal solution or can not converge to the global optimal solution. To overcome this problem, the factors \( f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right) \) \( (l = 1, \ldots ,L - 1) \) are magnified in this improved algorithm by using a power factor 1/K where the magnified gradient coefficient K is a positive real number greater than or equal to 1 (K ≥ 1), i.e., to replace \( f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right) \) \( (l = 1, \ldots ,L - 1) \) by \( \left( {f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right)} \right)^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 K}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$K$}}}} \) \( (l = 1, \ldots ,L - 1). \) When compared with the standard BP algorithm, the gradient term should have a larger increment when \( f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{l} }}^{S} } \right) \) \( (l = 1, \ldots ,L - 1) \) approaches zero so that the network will have lower frequency of being trapped into a flat spot and converge faster to the global optimal solution.

Moreover, since the above modified algorithm incorporates the first-order derivatives of the neural activation at hidden layers into the sum-of-square error cost function, it can reduce the input-to-output mapping sensitivity [810].

The modified local search algorithm combines Hybrid-I algorithm with magnified gradient function, and we call it MGFHIA.

4 APSO encoding a priori information from the approximated function

Since a neural network with single nonlinear hidden layer is capable of forming an arbitrarily close approximation of any continuous nonlinear mapping [2426], our discussion will be limited to the single-hidden layered feedforward neural networks (SLFN).

In the course of approximating a function, the FNN can approximate it more accurately when a priori information containing the function properties is encoded into the network. Sine the first-order derivatives of a function play an important role in the shape of the function, a priori information containing the first-order derivatives of the approximated function is considered in this paper.

Assume that the sample points of the function are selected at identical spaced intervals. In addition, these sample points, i.e.,(x i t i ), \( i = 1,2, \ldots ,N, \) where x i  = [x i1, x i2,…, x in ]T ∈ R n, t i  = [t i1, t i2,…, t im ]T ∈ R m, are assumed to be very close in space.

First, a method is presented to obtain the approximation value of the first-order partial derivative of the approximated function. According to Mean-Value theorem, the corresponding approximate estimated values of the functional first-order partial derivative can be obtained as follows:

$$ g^{\prime } (\mathop x\nolimits_{il} ) \approx {{(\mathop t\nolimits_{i + 1} - \mathop t\nolimits_{i - 1} )} \mathord{\left/ {\vphantom {{(\mathop t\nolimits_{i + 1} - \mathop t\nolimits_{i - 1} )} {(\mathop x\nolimits_{(i + 1)l} - \mathop x\nolimits_{(i - 1)l} )}}} \right. \kern-\nulldelimiterspace} {(\mathop x\nolimits_{(i + 1)l} - \mathop x\nolimits_{(i - 1)l} )}},\quad i = 2, \ldots ,N - 1.\quad l = 1,2, \ldots ,n. $$
(12)
$$ g^{\prime } (\mathop x\nolimits_{1l} ) \approx {{(\mathop t\nolimits_{2} - \mathop t\nolimits_{1} )} \mathord{\left/ {\vphantom {{(\mathop t\nolimits_{2} - \mathop t\nolimits_{1} )} {(\mathop x\nolimits_{2l} - \mathop x\nolimits_{1l} )}}} \right. \kern-\nulldelimiterspace} {(\mathop x\nolimits_{2l} - \mathop x\nolimits_{1l} )}},\quad g^{\prime } (\mathop x\nolimits_{Nl} ) \approx {{(\mathop t\nolimits_{N} - \mathop t\nolimits_{N - 1} )} \mathord{\left/ {\vphantom {{(\mathop t\nolimits_{N} - \mathop t\nolimits_{N - 1} )} {(\mathop x\nolimits_{Nl} - \mathop x\nolimits_{(N - 1)l} )}}} \right. \kern-\nulldelimiterspace} {(\mathop x\nolimits_{Nl} - \mathop x\nolimits_{(N - 1)l} )}}\quad l = 1,2, \ldots ,n. $$
(13)

Obviously, the closer the distances among the sample points are, the more accurate the corresponding approximate estimated values of the functional first-order derivation are.Assume that the SLFN, ϕ(·), is used to approximate the function g(·). Then the first-order derivative of the network output with respect to \( \mathop x\nolimits_{{\mathop j\nolimits_{1} }} \) can be obtained as:

$$ \phi^{\prime}\left( {\mathop x\nolimits_{{\mathop j\nolimits_{1} }} } \right) = \left[ {\sum\limits_{{\mathop j\nolimits_{2} = 1}}^{H} {\mathop w\nolimits_{{1,\mathop j\nolimits_{2} }} } f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{2} }} } \right)\mathop w\nolimits_{{\mathop j\nolimits_{2} \mathop j\nolimits_{1} }} , \ldots ,\sum\limits_{{\mathop j\nolimits_{2} = 1}}^{H} {\mathop w\nolimits_{{m,\mathop j\nolimits_{2} }} } f^{\prime}\left( {\mathop {\hat{h}}\nolimits_{{\mathop j\nolimits_{2} }} } \right)\mathop w\nolimits_{{\mathop j\nolimits_{2} \mathop j\nolimits_{1} }} } \right]^{T} $$
(14)

where \( \mathop w\nolimits_{{k,\mathop j\nolimits_{2} }} \) denotes the weights from j 2th hidden neuron to kth output neuron and \( \mathop w\nolimits_{{\mathop j\nolimits_{2} \mathop j\nolimits_{1} }} \) denotes the weights from j 1th input neuron to j 2th hidden neuron. H is the number of the hidden neurons.When the APSO is used to train the above SLFN, each particle represents the weights from the input layer to the hidden layer, or the ones from the hidden layer to the output layer and the corresponding thresholds. In order to encode the first-order information into APSO, a new fitness function is defined as follows:

$$ {\text{fit}} = \frac{1}{N}\sum\limits_{i = 1}^{N} {\mathop {\left\| {\mathop y\nolimits_{i} - \mathop t\nolimits_{i} } \right\|}\nolimits_{2} } + \frac{\xi }{N}\sum\limits_{i = 1}^{N} {\sum\limits_{l = 1}^{n} {\mathop {\left\| {g^{\prime } (\mathop x\nolimits_{il} ) - \phi^{\prime } (\mathop x\nolimits_{il} )} \right\|}\nolimits_{2} } } $$
(15)

where N is the number of the samples and ξ is the coefficient between 0 and 1. The first term in the right hand of Eq. 15 is the mean sum-of-square error between the target output values and the true ones, and the second term denotes the mean sum-of-square error between the approximate estimated values of first-order partial derivative values for the approximated function and the true ones. Since the new fitness function contains the first-order partial derivative information, the new APSO is referred as APSOAEFDI [20].

In order to overcome the drawbacks from gradient-based learning algorithm for FNN, the local search is combined with the global search in the improved approach. First, the SLFN is trained by APSOAEFDI first to near the global minima. Second, the network is trained again by the modified algorithm—MGFHIA. The MGFHIA penalizes the input-to-output mapping sensitivity of the network in the course of learning. Moreover, because of encoding the priori information and APSOA, the improved approach has better performance than gradient-based learning algorithm. Since the improved approach combines APSOAEFDI and MGFHIA, it is referred to as APSOAEFDI-MFGHIA.

5 Experimental results

In this section, some experiments are conducted to verify the efficiency and effectiveness of our proposed learning approach. All the simulations for BP algorithm, Hybrid-I algorithm, Hybrid-II algorithm, MHLA, APSOA-BP which combines APSOA with BP algorithm, APSOA-HILA which combines APSOA with Hybrid-I algorithm, APSOAEFDI-MHLA and APSOAEFDI-MGFHIA are carried out in MATLAB 6.5 environment running in a Pentium 4, 2.60 GHz CPU.

In the following we shall conduct the experiments with two differentiable functions, i.e., the function \( y = (1 - (40x/\pi ) + 2(40x/\pi )^{2} - 0.4(40x/\pi )^{3} )e^{ - x/2} \) and a sinc function y = sin (5x)/(5x). The activation functions of the neurons in the hidden layer for eight algorithms all are tangent sigmoid function and the output layers all are linear. The number of the hidden neurons all is 12.

As for the function, \( y = (1 - (40x/\pi ) + 2(40x/\pi )^{2} - 0.4(40x/\pi )^{3} e^{ - x/2}, \) assume that the number of the total training data is 126, which are selected from [o, π) at identically spaced intervals. A total of 125 testing samples are selected from [0.0125, π − 0.0125) at identically spaced intervals. As a result, the approximation errors of the test samples for the improved approach are shown in Fig. 1a.

Fig. 1
figure 1

The curves for the approximation errors of the test samples for two functions with the improved approach. a \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2}; \) b y = sin(5x)/(5x)

Similarly, as for the sinc function, assume that 121 training samples are selected from [0, 3] and 120 testing samples are selected from [0.0125, 2.9875]. As a result, the approximation errors of the test samples for the above function with the improved approach are shown in Fig. 1b.

In order to statistically compare the approximation accuracies, standard deviation for mean squared error (MSE) of testing data (SDMSETD) and iterated number for approximating the two functions with the above learning algorithms, we conducted the experiments 50 times for each algorithm, and the corresponding results are summarized in Tables 1 and 2.

Table 1 The average values of MSE, the standard deviation for MSE of testing data and iterated number for approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \) with eight learning algorithms
Table 2 The average values of MSE, the standard deviation for MSE of testing data and iterated number for approximating the sinc function y = sin (5x)/(5x) with eight learning algorithms

From the above results, the conclusions can be drawn as follows:

First, as for each function, the testing errors of the improved learning approach are always less than ones of other learning algorithms except for APSOAEFDI–MHLA. This result rests in the fact that the new approach combines APSOA with the a priori information of the approximated function to search the global optimum before perform local search with MGFHIA.

Second, among all learning algorithms, the learning ones which use APSOA to search global optimum converge at not more than 15,000 epochs, whereas the leaning ones which do not use APSOA converge at 30,000 epochs. Moreover, the improved approach converges even at 12,000 epochs because of incorporating a priori information and magnified gradient function into SLFN.

Third, the improved approach has slightly worse approximation accuracy than APSOAEFDI–MHLA, while it converges faster than APSOAEFDI–MHLA.

Moreover, in order to verify the efficiency and effectiveness of the proposed learning approach more thoroughly, tenfold cross validation experiments are performed for approximating the above two functions. The corresponding results are show in Tables 3 and 4.

Table 3 The MSE of approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \) for 20 times by tenfold cross-validation with eight algorithms
Table 4 The MSE of approximating the function y = sin (5x)/(5x) for 20 times by tenfold cross-validation with eight algorithms

It can be found from Tables 3 and 4 that the values of MSE of the improved learning approach are always less than ones of the other learning ones except for APSOAEFDI–MHLA. This result also supports the above conclusion that the approximation accuracy of the proposed approach is better than the ones of the other learning algorithms but APSOAEFDI–MHLA.

In the following, the corresponding parameters with respect to the improved learning approach for approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \) are discussed.

Figure 2 shows the relation between the testing errors and the particle number. It is evident that the testing error is on a downward trend with an increase in the iterated number.

Fig. 2
figure 2

The relations between the testing errors and the particle number with the improved learning approach for approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \)

Figure 3 shows the relation between the ultimate testing errors and the temporary testing errors obtained by APSOAEFDI. It can be concluded that the ultimate testing errors have an upward trend as the corresponding temporary testing errors obtained by APSOAEFDI increases.

Fig. 3
figure 3

The relations between the ultimate testing errors and the temporary testing errors obtained by APSOAEFDI with the improved learning approach for approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \)

Figure 4 shows the relation between the ultimate testing errors and the magnified gradient coefficient K in MGFHIA. On the one hand when the magnified gradient coefficient increases from 1 to 1.8, the ultimate testing errors decrease sharply. On the other hand when the magnified gradient coefficient increases from 1.8 to 3, the ultimate testing errors are on an upward trend. This shows that the ultimate testing errors may not get less as the magnified gradient coefficient gets bigger.

Fig. 4
figure 4

The relations between the ultimate testing errors and the magnified gradient coefficient in MGFHIA for approximating the function \( y = (1 - (40x/\pi ) + 2\mathop {(40x/\pi )}\nolimits^{2} - 0.4\mathop {(40x/\pi )}\nolimits^{3} )\mathop e\nolimits^{ - x/2} \)

6 Conclusions

In this paper, an improved approach for function approximation problem is proposed to obtain better approximation accuracy and faster convergence rate. In the improved approach, global search algorithm is combined with local search algorithm reasonably. First, the network is trained to search global minima by encoding the first-order derivative information of the approximated function into APSO. Second, with the trained weights produced by APSO, the SLFN is trained by the modified gradient-based local search algorithm with magnified gradient function. Moreover, the modified local search algorithm penalizes the input-to-output mapping sensitivity of network and avoids being trapped into local minima in the course of learning. Due to combined APSO and the a priori information with the local search algorithm, the improved approach has better approximation accuracy and convergence rate. Finally, simulation results are given to verify the efficiency and effectiveness of the proposed learning approach. Future research works will include how to apply the proposed learning algorithm to resolve more numerical computation problems.