Introduction

In general, statistical methods can obtain reasonable prediction accuracy for future demand conditions, they have two common limitations: (1) it is difficult to specify the most suitable model without human expertise; (2) the models generated by these methods may not be able to capture some strongly nonlinear characteristics of short-term demand data (Chan et al. 2012). Also, the widely used time series models for forecasting purpose especially auto-regressive (AR) integrated moving average (MA) (ARIMA) model (Box and Jenkins 1976) is generally applicable to linear modeling and it hardly captures the non-linearity inherent in time series data (Jaipuria and Mahapatra 2014). Further, it is very important for decision-makers to focus on alternative models when non-stationary and non-linearity play a significant role in the forecasting (Sattari et al. 2012). Therefore, according to reliable sales forecasting methods, decision-makers can response quickly to market change, maintain the inventory in a relatively low level, and control the cost of production (Du et al. 2015).

Various meta-heuristic optimization algorithms have been applied to solve different optimization problems by researchers in many different scientific areas. The main objective all of the optimization algorithms is to be operated to search the global optimum to the all optimization problems (Duman et al. 2015). After that, neural network (NN) is an artificial intelligence (AI) system, which converts information of different space into the same space through simulating human intelligence behaviors. It is effectively applied to many application fields (Lu et al. 2015). For forecasting purpose, NN neither requires any statistical information nor stationary nature of data series (Jaipuria and Mahapatra 2014). For example, the self-organizing mapping (SOM) algorithm (Kohonen 1987) is one of the most popular NN model based on the unsupervised competitive learning paradigm (Yadav and Srinivasan 2011). Next, radial basis function neural networks (RBFnns) have a number of advantages over other types of NNs and these include better approximation capabilities, simpler network structures and faster learning algorithms (Qasem et al. 2012).

Additionally, due to the omnipresence of constraints in real-world optimization problems, constrained optimization problems have received considerable attentions over recent years. There is an increasing number of nature-inspired meta-heuristic algorithms (Mezura-Montes and CoelloCoello 2011) proposed, such as genetic algorithm (GA) (Holland 1975), particle swarm optimization (PSO) (Kennedy and Eberhart 1995), and artificial bee colony (ABC) (Karaboga and Basturk 2008) algorithms.

Prediction under soft-computing models, such as NNs, intelligent algorithms and hybrid intelligent approaches were used (Anbazhagan and Kumarappan 2014). The previous researchers have adopted the RBFnn structure along with other single approaches such as PSO (Feng 2006) and GA (Sarimveis et al. 2004), to implement the learning of the network. However, as every single technique always exists with some drawbacks, hybridizing is a reasonable way to take strengths and avoid weakness. Therefore, the hybrid methods become very popular for the combinatorial optimization problem (Qiu and Lau 2014). As such, this study intends to propose a mix of SOMnn with PSO and GA based (MSPG) algorithm for training RBFnn and make suitable performance verification and comparison. The evolutionary learning mechanism of the MSPG algorithm can be used to train and find out the optimal network parameters within the solution space of the individuals generated population in RBFnn. Further, we can utilize this verified MSPG algorithm, in terms of forecasting accuracy, to make predictions in the demand estimation problem.

In summary, this study starts from the idea of “clustering first then classification (2-stage)” to integrate SOMnn and evolutionary computation algorithms (ECAs) and then proposes the MSPG algorithm. Further, the MSPG algorithm is applied on the RBFnn training to obtain the better learning performance with much higher forecasting accuracy. This is considered the contribution to the theoretical methodology in this paper. In addition, the proposed MSPG algorithm is allowed to be further embedded into business’ information system to perform its forecasting capability with high accuracy. Therefore, it can be applied to the enterprise resource planning (ERP) system in different industries to provide suppliers, resellers or retailers in the supply chain more accurate demand information for evaluation and so to lower the inventory cost. This is considered the contribution to the practical domain in this paper.

The rest of this paper is organized as follows. “Literature review” section summarizes RBFnn, SOMnn, and several ECAs. The proposed MSPG algorithm is presented in “Methodology” section. The results of experimental simulation, model evaluation, and comparison with relevant algorithms are illustrated in “Experimental simulation results” and “Model evaluation results” sections. Finally, concluding remarks are made in “Conclusions” section.

Literature review

This section presents general backgrounds associated to this research, including RBFnn, SOMnn, and evolutionary computation algorithms. The theories and applications pertaining to this study will be discussed in detail.

Radial basis function (RBF) neural network (RBFnn)

RBFnn was proposed by Duda and Hart (1973), it provides an alternative to accomplish the same work as NN does (Huang and Wang 2007). Next, the transfer function of the hidden layer is generally a non-linear Gaussian function (Wu and Liu 2012). Further, the mathematical equation which defines a RBFnn is given as (Ayala and Coelho 2016)

$$\begin{aligned} \hat{{y}}(t)=\sum _{m=1}^M {w_m \phi (r(t),c_m ,\sigma _m )}, \end{aligned}$$
(1)

where \(M\in \mathrm{N}^{+}\) is the number of neurons in the hidden layer, \(\hat{{y}}(t)\in \hbox {R}\) and \(r(t)\in \hbox {R}^{{\mathrm{n}}_{\mathrm{r}} }\) are, respectively, the network predicted output and the input vector at a given instant t; \(c_m \in \hbox {R}^{{\mathrm{n}}_{\mathrm{r}} }\) and \(\sigma _m \in \hbox {R}^{{+}}\) are, respectively, the center and the width of the mth hidden node of RBFnn. Each of the output weights is given by \(w_m \in \hbox {R}\). The Gaussian RBF is defined as (Ayala and Coelho 2016)

$$\begin{aligned} \phi (r,c,\sigma )= & {} \exp \left( {-\frac{\left\| {r-c} \right\| ^{2}}{2\sigma ^{2}}} \right) \nonumber \\= & {} \exp \left( {-\frac{1}{2\sigma ^{2}}\sum _{i=1}^{n_r } {(r_i -c_i )^{2}} } \right) \end{aligned}$$
(2)

Afterward, because of the simple topological structure and the ability to reveal how learning proceeds in an explicit manner, the RBFnn has been widely used as the universal function approximator to solve nonlinear problems (Lin and Wu 2011). In the field of prediction, Yu et al. (2010) proposes an RBFnn-ensemble forecasting model to obtain accurate prediction results and improve prediction quality further. In addition, Shafie-khah et al. (2011) proposed a novel hybrid model to forecast day-ahead electricity price, based on the wavelet transform, ARIMA models and RBFnn.

The difficulty of applying the RBFnns is in network training which should select and estimate properly the input parameters including centers and widths of the basis functions and the neuron connection weights (Tsekouras and Tsimikas 2013). Next, feature pre-processing technique in forecasting model also influences the forecasting accuracy significantly. Especially, a NN combined with pre-processed input feature data will achieve better prediction accuracy (Anbazhagan and Kumarappan 2014).

Self-organizing map (SOM) neural network (SOMnn)

A self-organizing map (SOM) neural network (SOMnn) is a nonlinear NN paradigm (Kohonen 1987). Next, learning in the SOM is unsupervised, making it useful in a variety of situations and easily modified to suit a variety of purposes (Rumbell et al. 2014). Contrary to the supervised clustering algorithms, unsupervised clustering algorithms do not need prior information that makes these algorithms more acceptable in the literature (Ozturk et al. 2015).

The key to a successful implementation of SOMnn is to find suitable centers for the Gaussian functions (Kurt et al. 2008). The SOMnn consists of M neurons arranged in a 2-D rectangular or hexagonal grid (Hadavandi et al. 2012).

Each neuron i is assigned a weight vector, \(w_i \in R^{n}\) (index \(i=(p,q)\) for 2-D map). At each training step t, a training data \(x^{t}\in R^{n}\) is randomly drawn from data set and calculates the Euclidean distances between \(x^{t}\) and all neurons. A winning neuron with weight of \(w_j \) can be found according to the minimum distance to \(x^{t}\):

$$\begin{aligned} j= \mathop {\arg \min }\limits _i \left\| {x^{t}-w_i^t } \right\| , \quad i\in \left\{ {1,2,\ldots ,M} \right\} \end{aligned}$$
(3)

Then, the SOM adjusts the weight of the winner neuron and neighborhood neurons and moves closer to the input space according to:

$$\begin{aligned} w_t^{t+1} =w_i^t +\alpha ^{t}\times h_{ji}^t \times \left[ {x^{t}-w_i^t } \right] , \end{aligned}$$
(4)

where \(\alpha ^{t}\) and \(h_{ji}^t \) are the learning rate and neighborhood kernel at time t, respectively. Both \(\alpha ^{t}\) and \(h_{ji}^t \) decrease monotonically with time and within [0, 1]. The neighborhood kernel \(h_{ji}^t \) is a function of time and distance between neighbor neuron i and winning neuron. A widely applied neighborhood kernel can be written in terms of Gaussian function:

$$\begin{aligned} h_{ji}^t =\exp \left( {-\frac{\left\| {r_j -r_i } \right\| ^{2}}{2\sigma _t^2 }} \right) , \end{aligned}$$
(5)

where \(r_j \) and \(r_i \) are the position of winner neuron and neighborhood neuron on map. \(\sigma _t \) is kernel width and decreasing with time. This process of weight-updating will be performed for a specified number of iterations (Hadavandi et al. 2012).

SOM networks’ ability to associate new data with similar previously learnt data can be applied to forecasting applications (Lopez et al. 2012). For example, Hsu et al. (2009) showed that SOM outperforms the hierarchical methods in clustering messy data and has better accuracy and robustness. Also, Lin and Wu (2009) proposed a hybrid NN model to forecast the typhoon rainfall using the SOMs and the feed-forward NNs. Further, feature pre-processing technique in forecasting model influences the forecasting accuracy significantly. Especially, a NN combined with pre-processed input feature data will achieve better prediction accuracy (Anbazhagan and Kumarappan 2014).

Evolutionary computation algorithms (ECAs)

Evolutionary computations (ECs) inherit the principles of biological evolution. This is stochastic in nature and stronger as compared to traditional optimization methods (Dey et al. 2014). Further, considering the drawbacks of traditional optimization techniques, attempts are being made to solve the optimization problems by using meta-heuristics, which are mostly nature inspired, such as PSO, artificial immune algorithm (AIA), and GA algorithms (Savsani et al. 2014). After that, PSO is a novel multi-agent optimization system inspired by social behavior metaphor (Kennedy and Eberhart 1995), while GAs are a family of computational models developed by Holland in 1975. Thus, PSO and GA are two intelligent optimization algorithms that are widely and successfully applied in various types of model parameter estimation because of their outstanding optimization capability (Yu et al. 2015b).

PSO is a swarm intelligence-based optimization technique inspired by social behavior and dynamic movement of a flock of insects, birds, and fish, which was developed by Kennedy and Eberhart (1995) (Ketabchi and Ataie-Ashtiani 2015). In a PSO system, each particle is ‘flown’ through the multidimensional search space, adjusting its position in search space according to its own experience and that of neighboring particles (Wang et al. 2014). Further, the velocities and positions of particles are updated in each time step according to the following equations (Kennedy and Eberhart 1995):

$$\begin{aligned} V_{id} (t+1)= & {} V_{id} (t)+C_1 r_{1d} (P_{id} -X_{id} )\nonumber \\&+\,C_2 r_{2d} (P_{gd} -X_{id} ); \end{aligned}$$
(6)
$$\begin{aligned} X_{id} (t+1)= & {} X_{id} (t)+V_{id} (t+1), \end{aligned}$$
(7)

where \(C_1 \) and \(C_2 \) are called cognitive and social acceleration coefficients respectively, \(r_{1d} \) and \(r_{2d} \) are two random numbers in the interval [0, 1] (Kennedy and Eberhart 1995). Afterward, PSO is based on social adaptation of knowledge for working, and all individuals are considered to be of the same generation. The particles with higher degree of constraint violation fly by the search space according to the information exchanged by their \(P_{id} \) (i.e., local best, Lbest) and \(P_{gd} \) (i.e., global best, Gbest) to search the better positions (Deng et al. 2012).

PSO shares many similarities with other EC techniques. The system is initialized with a population of random solutions and searches for optima by updating generations (Katherasan et al. 2014). Unlike GA, PSO has no evolution operators such as crossover and mutation. Compared to GA, the advantages of PSO are that PSO is easy to implement and there are few parameters to adjust (Katherasan et al. 2014). However, although the original PSO presents a high convergence velocity, it does not present the capability to escape from local minima. It occurs because the original PSO is not able to maintain diversity within the swarm whenever it is necessary during the search process. These issues affect the PSO performance, mainly in dynamic problems or high dimensional multimodal search spaces (Vitorino et al. 2015). In addition, an adaptive self-generating RBFnn model with mixed encoding PSO is utilized to optimize the RBF’s structure and parameters (Yu et al. 2009) and applied to predict the primary energy consumption of China (Yu et al. 2012b). Recently, researchers working in this area have started taking some interest on some promising approaches to numerical optimization.

On the other hand, the GA is referred to as EC technique and was proposed by Holland (1975) as an algorithm for searching an optimal solution based on survival of the fittest. The GA searches for an optimal solution through generations. Typically, the search starts with a population believed to possess the required best solution to the problem to be solved. Survival of the fittest is responsible for fostering evolution in the population to create the fittest chromosomes (Chiroma et al. 2015). The chromosomes with the best fitness values are selected for crossover and mutation whereas those with lower fitness are ignored for the reproduction. The fitness values are determined by the objective cost function of the problem. The fittest chromosomes are then selected for recombination through mutation and crossover, which is typically performed with probabilities (Chiroma et al. 2015). Next, GA is an effective optimization method for large and complex problems to escape the local optima and acquire a global optimal solution (Bagheri et al. 2015). Also, GAs present many advantages that have led to an increasing use, particularly with the rise in the processing power of computers. As they work with a set of potential solutions, GAs do not easily get trapped in local minima (Bagheri et al. 2015).

PSO and GA never require initial guesses and only the upper and lower bound must be defined (Garcia-Gonzalo and Fernandez-Martinez 2012). The superiority of such methods over other statistical/engineering ones is their ability to handle local minima/maxima points efficiently (Rezaee-Jordehi and Jasni 2013). Next, to avoid the particle to be stuck in the local minimum, Kuo and Han (2011) integrated the mutation mechanism of GA with PSO. In addition, although a great research effort has been put forward to obtain an ideal, accurately constructed and visually meaningful image clustering performance by Kuo et al. (2012), it still remains a challenge (Ozturk et al. 2015). Further, the hybrid PSO–GA algorithm has been applied for optimization in some fields, such as in primary energy demand prediction (Yu et al. 2012a) and curve fitting of manufacturing (Galvez and Iglesias 2013; Yu et al. 2015a). In addition, in the PSO–GA algorithm (Yu et al. 2012a), PSO first transforms the population into certainty generations. The best particles are retained, whereas the other particles are removed. Second, new individuals are generated by implementing the selection, crossover, and mutation operators of GA according to the remaining best particles. Third, the generated new individuals are placed in the remaining best particles to form a new population for the next generation. During the evolution process, the algorithm exchanges information several times to fully exploit the combination (Yu et al. 2012a). Recently, to strike a right balance between the performance and time complexity, a handful of meta-heuristics emerge as useful and powerful approaches to solve a wide range of optimization problems (Zhang et al. 2015a).

Methodology

The main idea underlying SOMnn is that RBFnns are local approximations, and the centers of local units (RBF neurons) are adjusted to move to the real center in the sense of feature representation (Er et al. 2005). The SOM is selected for this study since it is a fast, easy and reliable unsupervised clustering technique. SOM is used to divide the data into sub-population and hopefully reduce the complexity of the data space to more homogeneous sub-classes (Chang and Liao 2006). Further, the traditional SOM formulation includes a decaying neighborhood width over time to produce a more finely tuned output mapping (Rumbell et al. 2014).

As such, combining the automatically clustering ability of SOMnn (Kohonen 1987) with the PG algorithm, we proposed the mix of SOMnn with PG (MSPG) algorithm to improve the accuracy of function approximation by RBFnn. It provides the settings of some parameters, such as the neuron, width, and weight within RBFnn. During the process of the MSPG algorithm, SOMnn determines the number of center and its position values at first through its automatically clustering ability. The results are used as the number of neuron in RBFnn. The algorithm for the PG algorithm provides the settings of some parameters, such as the width and weight in RBFnn. The framework for the MSPG algorithm is illustrated in Fig. 1.

Fig. 1
figure 1

The framework of the proposed MSPG algorithm

The analysis of the MSPG algorithm

The proposed MSPG algorithm, which combines SOMnn and the evolutionary learning approaches of the PSO and GA, was designed to resolve the problem of network parameters training and solving with RBFnn. The MSPG algorithm applies PSO and GA approaches as the learning mechanism in PG algorithm, respectively. The pseudo code for the SOMnn method of the MSPG algorithm is illustrated in Fig. 2.

Fig. 2
figure 2

The pseudo code for the SOMnn method of the MSPG algorithm. Note Initially, trainingData and testingData contains X[] and Y[], where elements of X[] and Y[] are sampled from the input and output space, respectively

Through the approaches such as PSO and GA within PG algorithm of the MSPG algorithm, we intend to solve proper values of the parameters from the setting domain in the experiment. The pseudo code for the PG algorithm of the MSPG algorithm is illustrated in Fig. 3.

Fig. 3
figure 3

The pseudo code for the PG algorithm of the MSPG algorithm

The MSPG algorithm integrates SOMnn and virtues of PSO and GA approaches to enhance learning efficiency of RBFnn. The optimal values of parameters solution can be obtained and used in the MSPG algorithm with RBFnn to solve the problem for function approximation. This solution will enable RBFnn to make the most exact approximation toward the test functions in the experiment.

The inverse of root mean squared error (RMSE) is used as fitness function (i.e., \(\hbox {Fitness} = \hbox {RMSE}^{-1}\)) (DelaOssa et al. 2006). Next, the nonlinear function that the RBFnn hidden layer adopted is the Gaussian function shown in Eq. (2), and the fitness value of individuals in population is calculated by Eq. (8). The fitness values for relevant algorithms in the experiment are computed by maximizing the \(\hbox {RMSE}^{-1}\) (Lee 2008) defined as:

$$\begin{aligned} { Fitness}=\sqrt{\frac{N}{\sum _{j=1}^N {(y_j -\hat{{y}}_j )^{2}} }}, \end{aligned}$$
(8)

where N is the number of the testing set, \(\hat{{y}}_j \) is the predicted output of the learned RBFnn model for the jth training pattern, and \(y_j \) is the actual output.

The detailed description of the MSPG algorithm

A population of individuals undergoes a sequence of transformation by means of genetic operators to form a new population (Qasem et al. 2012). Further, each solution is called a ‘particle’ in PSO and ‘chromosome’ in GA where on the contrary to the former new solutions are not created from the parents within the evolution process (Yousefi et al. 2012). In which, the data are divided in three subsets \((X_1 ,Y_1 ),(X_2 ,Y_2 ),(X_3 ,Y_3 )\) of size \(M_1 \), \(M_2 \) and \(M_3 \), which are the training (65%), testing (25%) and validation (10%) sets respectively (Looney 1996). Therefore, the evolutionary procedures for the PG algorithm of the MSPG algorithm was performed and summarized as follows.

  1. (1)

    Initialization: The initialization corresponding to nature random selection ensures the diversity among individuals (i.e., particles in PSO approach or chromosomes in GA approach) and benefits the evolutionary procedure afterwards. An initial population with a number of individuals is generated and the initializing procedures are as follows.

    1. (a)

      Each individual in the initial population is the set of positions of neuron (i.e., \(c_{i,j}^s\)) and width (i.e., \(wd_i^s\)) on RBFnn, defined as a matrix form. Figure 4 presents the design of decoding routine for the matrix form.

      Meanwhile, the embedded values of \(C_s \) are equivalent to RBFnn hidden nodes which include the \(c_{ij}^s (i=\{ 1, \ldots , H \} ,j=\left\{ {1,} \ldots , N \right\} )\) and \(wd_i^s (i=\left\{ {1,} \ldots , H \right\} )\) of parameters solution (i.e., individuals) such as positions of neuron and width. S matrices \(C_1 ,C_2 ,\ldots ,C_s \) (i.e., population size) of size \(H\times (N+1)\) are created by setting all their elements equal to zero. For each \(C_s (s=1,2,\ldots ,S)\), a random integer \(h_s \in \left\{ {1,} \ldots , H \right\} \) from the number of center generated in SOMnn is selected.

      The results are used as the number of neuron in RBFnn. The \(\left\{ {1,} \ldots , {h_s } \right\} \) rows of the \(C_s \) are replaced by an equal number of row vectors of size \(1\times (N+1)\) that are the neurons of RBFnn associated with this individual. The \(\left\{ {h_s ,\ldots ,} \right. \left. H \right\} \) rows remain equal to zero and do not correspond to a neuron.

    2. (b)

      When the width parameter is fixed and a set of RBF neurons is provided, a RBFnn which has such a structure and an orthogonal least squares (OLS) algorithm (Chen et al. 1991) is developed for constructing parsimonious RBFnn (Chen et al. 1999) (i.e., RBFnn algorithm). Then the Gram–Schmidt scheme (Golub and Loan 1996) and Moore Penrose pseudo-inverse (Denker 1986) methods of the basis matrix are used to calculate the weights.

      For each \(C_s \), the output weights of the respective RBFnn are calculated by Eq. (9) (Denker 1986):

      $$\begin{aligned} w_s =(B_s^T \cdot B_s )^{-1}\cdot (B_s^T \cdot Y_1 )=B_s^{-1} Y_1 , \end{aligned}$$
      (9)

      where \(w_s \) is the pseudo-inverse of the design matrix \(B_s \); \(B_s\) is the \(M_1 \times h_s \) matrix containing the responses of the hidden layer to the \(X_1 \) subset of examples; \(Y_1 \) is the desired response vector in the training set. The number of columns of the \(B_s \) equals the number of neurons at the hidden layer and the number of rows equals the number of training samples. Each column of \(B_s \) corresponds to the response of the respective hidden neuron to all input data (Barra et al. 2006). The calculation of the output weights completes the formulation of \(h_s \) RBFnns, which can be represented by the pairs \((C_1 ,w_1 ),(C_2 ,w_2 ),\ldots ,(C_{h_s } ,w_{h_s } )\).

    3. (c)

      The fitness value of individual matrix in population is calculated by Eq. (8) (i.e., \(\hbox {RMSE}^{-1}\)).

  2. (2)

    PSO approach: The maximum value of Minimum selection type PSO learning method (Feng 2006) (i.e., PSO approach) will be considered the active number of RBFs for all particles and ensure that the same vector length is achieved. The solution of RBFnn correlated values of parameters that are included in the individuals of particle population, is equivalent to a set of the RBFnn solution. The PSO approach is one step which will be executed in one epoch and continue in the following process with the PG algorithm of the MSPG algorithm. This step can update the values of velocity and the embedded values of all particle matrices to record the Lbest values through Eqs. (6) and (7). The procedures of the PSO approach are as follows.

    1. (a)

      The number of the RBFnn hidden node neurons that use \(C_s \) of initialize population is regarded as the number of the neurons for each particle with PSO learning population, and thus is called particle matrix to progress the evolutionary process afterwards.

    2. (b)

      The particles in population don’t move toward any particular direction by Eq. (7) until the Lbest and Gbest of the present particle are calculated by Eq. (6).

  3. (3)

    Duplication: The population enhanced by the Minimum selection type PSO learning method (Feng 2006) is duplicated and called [PSO-only] population.

  4. (4)

    GA approach: The approach of GA evolution that includes one-cut point mutation, addition/deletion (Sarimveis et al. 2004), and uniform crossover (Syswerda 1989) operators in the population of PSO enhanced learning is called [PSO + GA] population. The operators used in GA approach are as follows.

    1. (a)

      Figure 5 illustrates the idea of uniform crossover idea schematically. Later, each row of the selected paired \(C_s \) will have equal probability to precede uniform crossover (Syswerda 1989) operator, so as to conform to the spirit of GA.

    2. (b)

      The mutation operator is one of the strategies used to ensure variability within the population and design space exploration (Rocha et al. 2014). Through the mutation operator, the values are replaced by randomly selected values from the range of the search domain in each dimension, which maintains the diversity and generates new solutions.

  5. (5)

    Reproduction: In order to force the GA to propagate more intensely the genetic material from the best parents, the roulette wheel selection (Goldberg 1989) role was used for formation of the mating pairs (Kuzmanovski et al. 2007). The [PSO-only] and [PSO + GA] populations are combined after evolutionary process. Same amount of individuals from the initial population are randomly selected by the roulette wheel (proportional) selection for the evolution afterwards.

    1. (a)

      The \((X_2 ,Y_2 )\) subset is used in this step as a testing set, in the following manner. First, the predictions \(\hat{{Y}}_{2,1} ,\hat{{Y}}_{2,2} ,\ldots ,\hat{{Y}}_{2,S} \) of the S RBFnn formulated in the previous step and the corresponding \( RMSE _s \) are computed as follows:

      $$\begin{aligned} RMSE _s =\sqrt{\frac{\sum _{s=1}^S {(Y_2 -\hat{{Y}}_{2,s} )^{2}} }{S}} \end{aligned}$$
      (10)
    2. (b)

      The pair \((C_s ,w_s)\) associated with the maximum error is replaced by the best RBFnn of the previous epoch so that the optimal solution survives in all epochs (this replacement will not take place in the first epoch). The network associated with the minimum error is stored for later use. The objective is to give more chances of survival to the network associated with smaller error values. Therefore, the probability of selection \(p_s \) of every \(C_s \) is calculated by Eq. (11), and the cumulative probability \(q_s \) is computed by Eq. (12):

      $$\begin{aligned} p_s= & {} \frac{ RMSE _s^{-1} }{\sum _{s=1}^S {( RMSE _s^{-1} )} }, \end{aligned}$$
      (11)
      $$\begin{aligned} q_s= & {} \sum _{i=1}^s {p_i } \end{aligned}$$
      (12)
  6. (6)

    Termination: The PG algorithm of the MSPG algorithm will not stop returning to step (2) unless a specific number of epochs has been achieved.

We focus on how to combine SOMnn and two evolutionary approaches to obtain the complementary learning effect. In the evolutionary process of PG (i.e., PSO + GA) algorithm, our differences with other evolutionary algorithms are: (1) The PSO and GA approaches in PG algorithm are able to take their own best calculated results to do cross learning in the next generation, and then gradually obtain the optimal solution in the whole population; (2) PG algorithm has the capability to dynamically adjust relevant parameters (i.e., inertia weight, mutation probability, and crossover probability) by decreasing linearly in a certain range. And with such way of having multiple learning factors for cooperation, it facilitates the PG algorithm to converge and further solve the optimal solution.

Fig. 4
figure 4

The design of decoding routine for the matrix form

Fig. 5
figure 5

Schematic illustration of uniform crossover between \(C_s \) and \(C_{s+1} \) with each pair of rows independently exchanging their values with probability 0.5

For population in PSO approach, the particle figures out the best solution after consulting itself and other particles, and decides the proceeding direction and velocity. Also, the memory mechanism (Xu et al. 2007) implemented in PSO approach can retain the information of previous best solutions that may get lost during the population evolution. Through the memory mechanism, the obtained parameter solution in the population will be more advanced than the initial ones to facilitate the evolutionary process afterwards. Thus, executing an evolutionary computation through the PSO approach would obtain an enhanced evolution population, which is better than the initial population.

As the algorithm proceeds, the members of the population improve gradually. Due to the property of global search with GA approach, no matter what the fitness values of the individuals in population are, they all have the chances to proceed with some genetic operators and enter the next generation of population to evolve. In this way, the PG algorithm of the MSPG algorithm meets the spirit of GA approach and ensures the genetic diversity in the future evolution process, and proceeds to obtain a new enhanced population. In addition, through the GA approach within the PG algorithm to estimate the fitness values of individual parameter solution in the population, the better solutions will be obtained gradually. Thus, the solution space in population could be changed gradually and converge toward the optimal solution. The algorithm stops after a specific number of epochs have been completed.

Table 1 Parameter setup for four benchmark problems

In the latter experiment, the MSPG algorithm stops and the RBFnn corresponding to the maximum fitness value is selected. Finally, it is validated by using the \((X_3 ,Y_3 )\) subset, which has not been utilized throughout the entire learning procedure. After those critical parameter values are set, RBFnn can initiate the training of approximation and learning through four benchmark problems. The above mentioned are our contribution to provide theoretical development.

Experimental simulation results

In this study, 1000 randomly generated data sets are divided into three parts to train RBFnn: 65% training set, 25% testing set, and 10% validation set (Looney 1996), in which we can examine the learning status and adjust the parameter setting.

Four benchmark problems

There are several test functions with many local minima, thus they can be used for comparison (Tsai et al. 2006). Continuous test function leads to excellent approximation to compensate RBFnn for the outcome of nonlinear mapping relation. The MSPG algorithm has better performance among other algorithms through the experiment in four benchmark problems, including Rosenbrock, Griewank, B2 (Shelokar et al. 2007), and Mackey-Glass time series (Whitehead and Choate 1996) continuous test functions, which are defined in “Appendix”.

In the experiment, it was performed on a PC with Intel Xeon\(^\mathrm{TM}\) CPU, running at 3.40 GHz symmetric multi-processing (SMP), and 2 GB of RAM. Simulation were programmed in the Java 2 platform, standard edition (J2SE) 1.5. In addition, Gnuplot version 4.2 open source software was also used in the analysis to present the drafting results. In this study, the search domain is two-dimensional (2D) and the unit of output is amplitude. The maximum number of epochs is set at 1000 to take as termination condition in the experiment.

Parameters setup

There are several values of parameters within RBFnn that must be set up in advance to perform training for function approximation. The proposed MSPG algorithm is better than trial and error way in the literature in that it determines the appropriate values of parameters from the verified domain to train RBFnn. Relevant algorithms start with the selection of the parameters setting for four benchmark problems shown in Table 1.

Furthermore, the Taguchi (robust design) method (Taguchi and Yokoyama 1993) (which used in this experiment for parameter setup) is a powerful experimental design tool (Olabi 2008) for solving the problems of optimizing the performance, quality and cost of a product or process in a simpler, more efficient and systematic manner than traditional trial-and-error processes (Lin et al. 2009). As such, the parameters setting for the MSPG algorithm in this study is obtained according to Taguchi experiment design (Taguchi et al. 2005) and several literatures.

In the SOMnn method of the MSPG algorithm, the maximum number of center C is 100, the learning rate \(\varepsilon \) is 0.8, the radius \(\sigma \) is 10, and the maximum number of generation G is 100,000 (Kohonen 1990). According to 4, a suitable population size is about 20–30 chromosomes. Thus, S is assigned as 25 in this study.

In the PSO approach, Shi and Eberhart (1998) introduced the parameter inertia weight k into the PSO equation to improve its performance. Suitable selection of k provides a balance between global and local explorations, and thus requiring fewer generation on average to identify a sufficiently optimal solution. The k in Eq. (13) can be expressed by the inertia weights approach (Kennedy and Eberhart 2001), as given below:

$$\begin{aligned} k=k_{\max } -\frac{k_{\max } -k_{\min } }{E}\times n, \end{aligned}$$
(13)

where \(k_{\max } \) and \(k_{\min }\) are the maximum and minimum value, respectively; n is the current number of epochs, and E represents the maximum number of epochs. The parameter selection problem is formulated as a searching problem and a method based on a PSO evolutional learning method which is applied to select a parameter set R in the searching space, in which the scaling factor is 0.75 (Feng 2006). As originally developed, k often decreases linearly from approximately 0.9 to 0.4 during a run (Amraee et al. 2007). Thus, this study adopted the k decreased linearly from 0.75 to 0.4 with the increase of epochs. Further, \(c_1 \) and \(c_2 \) in the PSO approach are assigned as 2, which represent the same weight of stochastic terms pulling the particle toward Pbest and Gbest (Wang and Lu 2006).

In the GA approach, according to Azadeh and Tarverdian (2007), \(P_m \) shows best while varying between 1 and 5%. In addition, the \(P_c \) is recommended from Holland (1975). In the meanwhile, the \(P_m \) and \(P_c \) are decreased linearly from 0.04 to 0.02 and 0.9 to 0.5 with epochs respectively in this study. Moreover, the \(P_a \) and \(P_d \) are assigned as 0.005 from Sarimveis et al. (2004).

Table 2 Parameters setup for the proposed MSPG algorithm in the experiment
Fig. 6
figure 6

The best approximation result obtained over 1000 trials of the MSPG algorithm with RBFnn trained in Mackey-Glass time series function prediction

Fig. 7
figure 7

The best approximation result obtained over 1000 trials of the proposed MSPG algorithm with RBFnn trained in continuous test functions: a original and predicted Rosenbrock function with RBFnn trained by the MSPG algorithm, b original and predicted Griewank function with RBFnn trained by the MSPG algorithm, c original and predicted B2 function with RBFnn trained by the MSPG algorithm

Since the drawback of soft computing techniques is the parameter setup, this study applies Taguchi method (Taguchi and Yokoyama 1993) for experimental design. Consequently, the statistical software MINITAB 14 was used in the analysis of parameter design for algorithm, where the signal-to-noise (S/N) ratio (Lin et al. 2009) is used to evaluate the stability of system quality in the experiment. Afterward, the Taguchi trials (Taguchi et al. 2005) were configured in an \(\hbox {L}_{9}\; (3^{4})\) orthogonal array for the MSPG algorithm after the experiment was implemented thirty times. Finally, the MSPG algorithm starts with the selection of the parameters setting shown in Table 2 to ensure consistent basis in the experiment.

Performance analysis of experimental results

The best approximation results of the MSPG algorithm for RBFnn trained in the experiment are showed in Figs. 6 and 7. As shown in Fig. 6, the best approximation results indicate that by using the MSPG algorithm under the circumstances of Mackey-Glass time series function, the performance resulted from effective learning of RBFnn can approximately accord with the curved functions.

The learning of MSPG algorithm on several RBFnn parameters solutions, which are generated by the population during the operation of evolutionary procedure in the experiment, is implemented. The MSPG algorithm is used to solve the optimal RBFnn parameters solution. It generates unrepeated 65% random training set from 1000 generated data and inputs the set to network for learning. With the same method, it then generates another unrepeated 25% random testing set to verify individual parameters solution in population and calculates the fitness value. Up to this learning stage, 90% dataset has been used by RBFnn. Once the evolutionary process has progressed for 1000 epochs, the optimal RBFnn parameters solution is obtained. Lastly, unrepeated random 10% validation set is generated to prove how the individual parameter solution approximates four benchmark problems, and the RMSE values are recorded to confirm the learning situation of RBFnn.

The learning and validation stages mentioned above were implements for 50 runs. The average RMSE values were calculated and are shown in Table 3 along with their standard deviations (SD). The results indicate that the smallest values are acquired by the MSPG algorithm with stable performance during the whole training process in the experiment, and RBFnn is able to obtain the parameters solution from the evolutionary learning process in population, which has achieved the optimal function approximation situation. Once the training of RBFnn by the MSPG algorithm is finished, the individual value of parameters (i.e., neuron, width, and weight) with its optimal solution is then the exact setting of network.

As shown in Table 3, the trends of training and validation performance are consistently small, which means RBFnn trained through the MSPG algorithm provides certain stability. Such result not only suffices for the training set and validation set, a generalization could also be made with regards to other unseen dataset. It may thus be known that over-fitting and over-training problems do not exist in the experiment adopting the MSPG algorithm.

Table 3 Comparison results for relevant algorithms in the experiment
Table 4 The statistical results for t test among relevant algorithms

Additionally, as for the verification of statistical significant difference, we obtained the results significantly while conducting the matched paired sample tests of t test with the absolute error from the predicted dataset of the source data in each algorithm. Next, the estimation verification and the t test results among relevant algorithms are shown in Table 4, the MSPG algorithm is not statistical significant (p value larger than 0.05, i.e., there is no significant deviation between the predicted values and actual values). Thus, the statistical results indicate that the MSPG algorithm has the best performance in terms of prediction accuracy among relevant algorithms.

In the next section, a real-world demand estimation case is applied to verify the accuracy and practicality for the proposed MSPG algorithm.

Model evaluation results

RBFnn has already been verified to be able to generate an accurate approximation on four benchmark problems through the proposed MSPG algorithm. The results are compared with other algorithms, illustrating the accuracy of the MSPG algorithm. Afterward, the daily sales observations of 500 \(\hbox {cm}^{3}\) containers of papaya milk were offered by a company of chain convenience stores in Taiwan’s retail industry. The analysis had assumed that the influence of external experimental factors did not exist. The trend of papaya milk sales was not interfered by any special events. Moreover, there are several values of parameters within RBFnn that must be set up in advance to perform training for the case of estimation analysis. Relevant algorithms start with parameters setting shown in Table 5 and are meant to ensure consistent basis in this case.

Table 5 Parameters setting for relevant algorithms in the demand estimation case

Input data and RBFnn learning

Most studies in the literatures use convenient ratio of splitting for in- and out-of-samples such as 70:30, 80:20, or 90:10% (Zou et al. 2007). We use the ratio of 90:10% (Zou et al. 2007) here as the basis of division. The detailed data distribution of the case is shown in Table 6.

Table 6 The observations data distribution in the demand estimation case

The application example with papaya milk for historical sales is based on time series data distribution and applied to estimation analysis. In order to obtain convergence within a reasonable number of cycles, the input and output data should be normalized and scaled to the range of 0–1 by Eq. (14) (Jin et al. 2011) for data of papaya milk.

$$\begin{aligned} x_{ni} =\frac{x_i -x_{\min } }{x_{\max } -x_{\min } }, \end{aligned}$$
(14)

where \(x_i \) is the actual value of the observed data, \(x_{\max } \) and \(x_{\min } \) are the maximum and minimum observation values of the dataset, and \(x_{ni} \) is the normalized value of the observed data. The first 90% of the observations were used for model estimation while the remaining 10% were used for validation and one-step-ahead forecasting. This study elaborates how data is input to RBFnn for estimation through relevant algorithms, and comparison with Box–Jenkins models.

Fig. 8
figure 8

The estimation results comparison of the proposed MSPG algorithm and ARMA (1, 2) model for the demand estimation case

Building the Box–Jenkins models

EViews\(^\mathrm{TM}\) 6.0 software was used the analysis of Box–Jenkins models to calculate the numerical results. If the data is stationary, model estimation can be implemented directly. This research precedes the data identification of Box–Jenkins models through augmented Dickey–Fuller (Dickey and Fuller 1981) testing. Next, the study carries out demand estimation based on ARIMA (pdq) models, the procedures can be divided into three steps (Babu and Reddy 2014): (1) identifying the model order (i.e., identifying p and q)—Akaike (1974) information criterion (i.e., \(\hbox {AIC value} = -2.3062\)) were employed to sift the optimal model out (Engle et al. 1987) (i.e., ARMA (1, 2) model); (2) estimating the model coefficients—the results of model diagnosis reveal that the values of Ljung–Box statistic (i.e., Q-statistic) (Kmenta 1986) are greater than 0.05 in result of Box–Jenkins models, in which the results are white noise (i.e., serial non-correlation) and it had been suitable fitted; (3) forecasting the data.

Table 7 The estimation errors comparison for relevant algorithms using the demand estimation case

Error measure of the estimation performance in the case

The mean absolute error (MAE), RMSE, and mean absolute percentage error (MAPE) are applied to evaluate the forecasting accuracy (Zhang et al. 2015b). The results of forecasting set for the case is shown in Fig. 8. Also, the estimation performances of above mentioned algorithms with the case data is presented in Table 7.

Among these algorithms, the results derived from RMSE, MAE, and MAPE of the proposed MSPG algorithm were the smallest ones. According to the obtained numerical results, we know that compared to traditional Box–Jenkins models, the MSPG algorithm can substantially improve the accuracy of practical demand estimation.

Moreover, the proposed MSPG algorithm in this study can be applied to the company’s internal information system in the case study to forecast the product demands. It can be further applied to the intelligent manufacturing system of the production line to generate different product demand data forecasts for different clients to manage their supply. Therefore, it is able to cope with real situation in the industry to meet the need of product demand customization (e.g., few items with large demand, many items with less demand, and many items with large demand), which adjusts supply dynamically.

Conclusions

Our study for the proposed MSPG algorithm combines the automatically clustering ability of SOMnn with the PG algorithm, which provides the settings of RBFnn parameters. The complementation of some evolutionary operations that improves the diversity of populations also increases the precision of the results. In addition, a case study and the tuning values of parameters with RBFnn using the trained algorithm has been given. The MSPG algorithm has better parameter setting of network and consequently enables RBFnn to perform better learning and approximation in four benchmark problems and application in the demand estimation case.

In the future, it may be promising to employ different evolutionary computation algorithms, such as ant colony optimization (ACO), artificial immune system (AIS), and ABC algorithms and further training different type NNs. Afterward, perhaps the sales data in the short term would be more stable and could be more beneficial to improve demand estimation. Thus, the accuracy of prediction in product sales data within shorter period can be further compared in the future. Additionally, it is common to have significant fluctuation and change in general sales prediction, and it could be the result of exogenous variables or unexpected variances such as sales force, promotional campaign, and exposure in international exhibitions. These exogenous variables were not considered in this study and thus could be considered for future work.