Abstract
Over the past few decades, time series forecasting (TSF) has been predominantly performed using different artificial neural network (ANN) models. However, the performance of ANN models in TSF has not yet been fully explored due to several issues like the determination of near-optimal ANN architecture for a time series and the efficiency of training algorithm used to determine the near-optimal weights of ANN. Motivated by this, we have proposed an adaptive differential evolution (DE)-based modelling scheme to automatically determine the near-optimal architecture of ANN for a time series under study. Additionally, we have proposed an adaptive differential evolution-based ANN training algorithm (ADE-ANNT) to determine the near-optimal weights of ANN. To make the adaptive modelling scheme consistently effective, several comparisons are made between different alternatives in the treatment of trend component and normalization techniques. Twenty-one benchmark time series datasets are being considered to assess the comparative performance of the proposed method with the established forecasting models, namely autoregressive integrated moving average, exponential smoothening with error, trend and seasonality, deep belief network and multilayer perceptron + Levenberg–Marquardt (LM) method. To assess the efficiency of the proposed ADE-ANNT training algorithm, comparisons are made with the ANN training algorithms based on recently developed evolutionary algorithms, such as TLBO-ANNT, DE-CRO-HONNT and DE-ANNT+; and the most popular LM training algorithm. Extensive statistical analysis on simulation results reveal the statistical superiority of the proposed training algorithm and proposed method when compared with their counterparts for the datasets used.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Future of most of the phenomena is a consequence of its past. Therefore, future values of a phenomenon can be extrapolated by systematically analysing its past values. Such a process of predicting the future values of a phenomenon by analysing the past observations is known as time series forecasting (TSF). Accurate TSF plays a vital role in almost every area of study including finance, economics, engineering and management science. Conventionally, statistical models like exponential smoothing, autoregressive integrated moving average (ARIMA) have been widely used in TSF [1]. These models work under the assumption of the linear correlation structure of time series data and often fail to provide a satisfactory result when the time series possess nonlinear patterns. To add to this owe, most of the real-world time series is nonlinear and often suffered from the issues of temporal as well as spatial variability. Hence, traditional statistical models cannot handle the nonlinear patterns equally well as linear patterns. This is because linear models can’t handle nonlinear patterns and vice versa [2]. Therefore, nonlinear statistical models like autoregressive heteroskedastic models (ARCH and GARCH family) have been applied in TSF to capture the nonlinear patterns. However, the existence of different variations of these models [3, 4] imposes a burden in choosing the suitable model, and thus limits its application to a generalized forecasting problem.
In the past two decades, ANN models have gained popularity in forecasting theory due to several advantages features. First, unlike traditional models, ANNs are data-driven models, i.e. instead of modelling data by a pre-defined model; ANN employs the data to determine the parameters of the model. Also, ANNs neither require any a priori assumption on model nor on the characteristics of data to be modelled. Second, ANNs are universal approximators, i.e. ANNs can model any nonlinear or linear function to any desired level of accuracy [5]. Finally, ANNs possess nonlinear modelling ability, i.e. can learn and generalize any function [6]. Considering all these advantages of ANN, thousands of research papers using different ANN models have been published to solve various problems in TSF. Despite more than two decades of research on ANN for TSF and some studies [7, 8] indicating the superior performance of ANN models than statistical models, some studies [9, 10] have also indicated inferior performance under certain datasets. This inconsistent performance of ANN models across various studies was due to several factors such as heuristic and ad hoc modelling process [3], pre-processing (e.g. detrending and normalization) of time series data [11,12,13] and the effectiveness of the training algorithms used [3]. If these factors were suitably optimized, ANNs would be established as a reliable tool for TSF.
All the ANN-based forecasting methods proposed to forecast time series have used two approaches of ANN model design. In the first approach [14,15,16,17,18], the architecture of ANN (number of inputs, number of hidden layers, number of neurons in each hidden layer) is given, and the remaining parameters (weights, learning rate, etc.) of ANN are computed in the later part of the modelling process. In this approach, neither optimal nor acceptable results can be guaranteed. Contrasting to the first approach, the second one [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36] determines the architecture and other parameters automatically using different techniques such as pruning algorithm [19, 20], growing algorithm [29], cellular automata [30] and evolutionary algorithms [21,22,23,24,25,26,27,28, 31,32,33,34,35,36]. The application of evolutionary algorithms to construct the architecture of ANN dominated the literature. It is mostly based upon genetic algorithm (GA) [21, 25, 26, 32], estimation distribution algorithm (EDA) [24, 33], particle swarm optimization (PSO) [27], grasshopper optimization algorithm (GOA) [28] and differential evolution (DE) [24, 33]. The evolutionary algorithms used either direct encoding schemes (DES) [24,25,26,27,28, 31,32,33,34,35,36] or indirect encoding scheme (IES) [21,22,23]. In DES, the chromosomes contain information about the learning parameters, architecture and parameters of the topology. In contrast, in IES, the chromosomes contain necessary information so that a constructive method can be used to delineate the architecture of ANN. Although no study has been made relating to the comparative performance of DES and IES, the DES has been used predominantly due to its simplicity in implementation.
In this paper, an adaptive modelling scheme based on DES for time series forecasting is developed. To make a robust evaluation of the proposed modelling scheme, the choice of the ANN model plays a vital role. To address this issue, Makridakis et al. [37] conducted a comparative study on different machine learning (ML) models for TSF and suggested that multilayer perceptron (MLP) provides better result than other ML models including long short-term memory (LSTM). Therefore, in this paper, a single hidden layer feedforward MLP is considered as the forecasting model. The number of neurons in the output layer is fixed to one, whereas the number of inputs and hidden neurons is automatically determined using an adaptive evolutionary algorithm. The DES is used to encode the architecture of ANN into chromosomes; meanwhile, the fitness of each chromosome is obtained after training with the proposed training algorithm (ADE-ANNT). The major contributions of the paper are as follows:
-
(i)
An adaptive DE-based modelling scheme (DEMS) has been developed to determine the near-optimal architecture of ANN for a time series under study.
-
(ii)
An adaptive DE-based ANN training algorithm (ADE-ANNT) has been proposed to determine the near-optimal weights of ANN.
-
(iii)
To make the DEMS consistently effective, several comparisons are made between different alternatives in pre-processing techniques such as treatment of trend component and normalization techniques.
The rest of this paper is organized as follows. Section 2 provides an overview of ANN and DE algorithm. The proposed ADE-ANNT training algorithm is explained in Sect. 3. Section 4 presents a detailed description of the proposed methodology. In Sect. 5, the experimental set up is described, and simulation results are analysed. Finally, the conclusions are drawn in Sect. 6.
2 Preliminaries
2.1 Artificial Neural Network (ANN)
ANNs are data-driven, self-adaptive, flexible models that can approximate an enormous class of nonlinear functions to any preferred degree of precision. Thus, different ANN models (MLP [14], radial basis function (RBF) neural network [15], LSTM [16], convolutional neural network (CNN) [17], functional link ANN (FLANN) [18], deep belief network [27], etc.) have been applied in a variety of TSF applications. Out of different ANN models, feedforward MLP with a single hidden layer is most popular in TSF [3, 37]. This is because of its complex nonlinear mapping capability and universal approximation capability [3]. The architecture (as in Fig. 1) of such MLP consists of an input, hidden and output layer. The neurons of adjacent layers are symmetrically connected by directed acyclic links. In TSF, the number of inputs and hidden neurons is flexible, whereas the number of output neurons is usually kept to one. Solving the TSF problem using MLP can be considered as identifying the underlying relationship (as in Eq. 1) existing between the output yt and the past k values of the series \( y = [y_{1} ,y_{2} , \ldots ,y_{n} ] \). Therefore, the time series data are first transformed to n-k patterns, with each pattern has k inputs \( y_{t - 1} ,y_{t - 2} , \ldots ,y_{t - k} \) and one target \( y_{t} \). Generally, the patterns are partitioned into train, validation and test sets. The training algorithms like evolutionary training algorithms [18, 38,39,40,41,42,43,44], Levenberg–Marquardt (LM) algorithm, etc. use the train and validation sets to determine the weights and other parameters of ANN. Then, the future values on the test set are computed and used to evaluate the performance of the model.
where \( \alpha_{j} \) and \( \beta_{ij} \) are the set of weights between input–hidden and hidden–output layers. \( \beta_{0j} \) and \( \alpha_{0} \) are the bias unit of hidden and output neurons. Usually, binary sigmoid activation function (as in Eq. 2) has been widely used in the neurons of the hidden layer. In contrast, the linear activation function has been used in the neurons of the output layer.
2.2 Differential Evolution
In 1997, Storn et al. [45] proposed a simple yet efficient evolutionary algorithm called differential evolution (DE). Compared to other evolutionary algorithms, DE is straightforward and easier to implement. Even though PSO and its variants are easy to code, DE variants outperformed the PSO variants and other evolutionary algorithms [46,47,48]. The recent review paper [49] summarizes the superior performance of DE than different evolutionary algorithms in CEC competitions organized from 2005 to 2018. DE is a stochastic search method, which starts with a set of chromosomes (decision vectors), with each chromosome consisting of a set of genes (decision variables). The chromosomes of a generation use mutation, crossover and greedy selection operations repeatedly to move the population towards the next generations until a near-optimal solution is achieved.
All DE algorithm variants operate in the following steps:
-
1.
Initialization of problem parameters, algorithm parameters and initial population (consisting of a set of chromosomes).
-
2.
Calculate the fitness of each chromosome/individual.
-
3.
For each chromosome (target vector) of the current generation, apply step: 3.1 to step: 3.3 to generate the chromosomes of next generation.
-
4.
If the termination criteria are satisfied, go to step-5 or else go to step-3
-
5.
Use the fittest individual of the final generation as the solution to the problem.
Let the population of gth generation is \( p_{g} = \left( {C_{g}^{1} ,C_{g}^{2} , \ldots C_{g}^{i} \ldots C_{g}^{PopSize} } \right) \), with \( C_{g}^{i} = \left( {W_{g}^{i1} ,W_{g}^{i2} , \ldots W_{g}^{ij} \ldots W_{g}^{iD} } \right) \) for i = 1,2,3…..PopSize, D = length of each chromosome,\( W_{g}^{ij} \) = jth gene of the ith individual in gth generation. The mutation for ith chromosome produces a mutant vector \( M_{g}^{i} = \left( {M_{g}^{i1} ,M_{g}^{i2} , \ldots M_{g}^{ij} \ldots M_{g}^{iD} } \right) \), which is used in the crossover with the current chromosome \( C_{g}^{i} = \left( {W_{g}^{i1} ,W_{g}^{i2} , \ldots W_{g}^{ij} \ldots W_{g}^{iD} } \right) \) to produce a trial vector \( T_{g}^{i} = \left( {T_{g}^{i1} ,T_{g}^{i2} , \ldots T_{g}^{ij} \ldots T_{g}^{iD} } \right) \)
Since the inception of DE, it has been modified and upgraded rigorously in recent years [49, 50]. However, the variants of the DE algorithm mostly vary in the nature of mutation and crossover scheme being used. Storn and Price [51, 52] suggested different mutation schemes (as in Eqs. 3 to 7) for differential evolution to generate the mutant vector. Two crossover methods such as binomial (as in Eq. 8) and exponential (as in Eq. 9) can be applied on any of the mentioned mutation strategies to generate the trial vector. Thus, a total of \( 2 \times 5 = 10 \) different DE mutation strategies can be used to generate a trial vector. Then, selection (as in Eq. 10) operation is applied to select the better chromosome between trial and target vector for the next generation.
The notations used above are in the form DE/a/b, where DE means differential evolution, a indicates the choice of base vector used in mutation (it may be a randomly chosen vector, target vector or the best vector), b indicates the number of difference vectors applied in mutation; \( M_{g}^{i} \) stands for ith mutant vector in gth generation; \( C_{g}^{best} \) for the best vector of the population in gth generation; \( C_{g}^{r} \) for a randomly chosen chromosome from the population of generation g; \( C_{g}^{i} \) is the target (donor) vector; Cr is the crossover probability; F is the scale factor; \( T_{g}^{ik} \) kth gene of ith chromosome in gth generation; rand (0,1) represents a random number between 0 and 1; and the value l and k are chosen based on the crossover probability in such manner that: \( 1 \le l \le \left( {l + k} \right) \le D \) with \( D \) is the dimension of the chromosome. Note that, all the chromosomes which are selected for any mutation scheme must be distinct to each other.
3 Proposed ANN Training Algorithm (ADE-ANNT)
In the proposed method, a large number of different ANN architectures need to be trained to obtain the best architecture for TSF. This demands a robust training algorithm that reaches a near-optimal solution in quick time. Conventionally, gradient-based algorithms are widely used for training ANN. However, the gradient-based methods have several shortcomings such as unreliable [53], slow convergence speed [54], chances to trapping to local optima is high [55], performance is subjected to initial values of algorithm parameters [54], etc. Therefore, recently, several evolutionary ANN training algorithms [18, 38,39,40,41,42,43,44] have been developed. Out of several evolutionary algorithms, the applications of DE to ANN training [38,39,40,41, 44] have dominated the literature. Some authors [38,39,40, 44] have used traditional DE variants for ANN training, whereas Slowik [41] proposed DE-ANNT + training algorithm with multiple trial vectors concept. In DE-ANNT + algorithm, for each target vector, multiple mutant vectors are generated. The mutant vector having the best fitness is used in crossover to generate a trial vector. Then, selection operation is carried out between the trial vector and target vector to generate the chromosomes for the next generation. The control parameters F and Cr are obtained based on the rate of change of the best solution (Eq. 11)\( , \) where TheBest represents the fitness of the best chromosome. In Slowik [41], the population is said to be stagnated when the value of R (Eq. 11) of the population does not change by a certain percentage. DE-ANNT + has shown better performance than extended backpropagation (EBP), EA-ANNT and DE-ANNT. Also, the results are as good as the Levenberg–Marquardt algorithm. However, the major drawback of DE-ANNT + is that it suffers from exploitation due to the generation of multiple mutant vectors for each individual in every generation. Therefore, DE-ANNT + is not so suitable for the proposed self-adaptive modelling scheme.
To develop a robust DE-based training algorithm, the choice of DE strategy plays a vital role. To address this issue, Mezura-Montes et al. [56] made a comparative study on eight different DE schemes using 13 benchmark problems and suggested that the DE/best/1/bin scheme provides robust result irrespective of the problem in hand. However, the DE/best/1 scheme has the tendency of premature convergence, which can be effectively avoided by generating multiple trial vectors upon stagnation. Therefore, by integrating DE/best/1/bin strategy with the concept of multiple trial vectors upon stagnation, an adaptive differential evolution with conditional multiple trial vectors for ANN training (ADE-ANNT) is proposed. In addition, the values of control parameters F and Cr are determined self-adaptively based on the problem in hand.
Algorithm 1 presents the pseudocode of ADE-ANNT. In this algorithm, the trainable weight set of an ANN is encoded into the chromosomes as a list of real numbers (as in Fig. 2) with each gene representing a weight of ANN. Hence, the length of each chromosome for an ANN with k inputs, q hidden neurons and one output neuron with bias component is \( q \times \left( {k + 2} \right) + 1 \). The first generation of the population consists of PopSize number of randomly selected chromosomes, with each chromosome denoting the weights of ANN. The fitness of each chromosome (weight set) is obtained by using Eq. 12. Note that, when a chromosome has a lower root mean square error (RMSE) (as in Eq. 13) than another chromosome, the chromosome having lower RMSE is considered as better. Then, the next generations of the population are repeatedly generated until the termination criteria are satisfied. The next generation of the population is generated by applying mutation, crossover and greedy selection operators on each chromosome of the current generation. DE/best/1 mutation scheme is applied to generate the mutant vector(s) using the proposed parameter adaptation scheme. The number of mutant vectors depends on the stagnation criteria, i.e. upon stagnation of the population, multiple mutant vectors are generated by using multiple scale factors. Unlike DE-ANNT + [41], all the mutant vectors are crossed over with the target vector to generate multiple trial vectors (v > 1) and then the best trial vector is used for selection. If stagnation does not occur, one scale factor is used to generate one mutant vector, which consequently generates one trial vector. Moreover, motivated by [57, 58], the scale factor and crossover probability are self-adaptively, and dynamically determined considering the problem in hand.
where n represents the number of observations, Ti and Yi are computed and actual values of ith observation.
3.1 Parameter Adaption in ADE-ANNT
Algorithm 2 presents the pseudocode of the parameter adaptation used in the proposed ADE-ANNT training algorithm. In every generation, the scale factor Fi and crossover probability Cri for each target vector is generated. When stagnation occurs, more than one trial vectors (v) are generated. To achieve this, v (> 1) number of scale factors are used to generate v mutant vectors which after crossover generates v trial vectors. If stagnation does not occur, one scale factor is used to generate one mutant vector, which consequently generates one trial vector.
Where rand(1,v) stands for v randomly generated numbers from a uniform distribution within the range (0, 1), Cauchyrnd(Fm, 0.1) is a randomly generated number from a Cauchy distribution having scale parameter 0.1 and location parameter Fm, Gaussianrnd(Crm, 0.1) is a randomly generated number from a Gaussian distribution with standard deviation 0.1 and mean Crm.
Initially, the location parameter Fm is set to 0.5 and is then updated after each generation (using Eq. 14).
where WF is the weight factor which varies between 0.8 and 1. Fsuccess is the set of successful scale factors using which better trial vectors for the next generation are generated.
The mean of the Gaussian distribution Crm is initially set to 0.5 and is then updated after each generation (using Eq. 15).
where WCr is the weight factor varying within 0.8 and 1. Crsuccess is the set of the successful crossover probabilities generating better trial vectors in the current generation.
3.2 Explanation of Parameter Adaption
Upon stagnation, multiple scale factors are used to generate multiple mutant vectors. The multiple mutant vectors are crossed over with the target vector to generate multiple trial vectors. The generation of multiple trial vectors around the existing solution assists in getting out of stagnation. This is because (1) If the difference vector \( C_{g}^{r1} - C_{g}^{r2} \) becomes small, i.e. when the vectors converge to a small domain, the larger values of scale factor assists in getting out of the suboptimal valleys/peaks causing stagnation; (2) If the difference vector \( C_{g}^{r1} - C_{g}^{r2} \) is large, the population may jump the global minima due to larger difference vectors, but the use of scale factors having smaller values assists in overcoming this problem. Thus, the generation of multiple scale factors from a uniform distribution between (0–2) helps in performing both local and global search.
When stagnation does not occur, only one scale factor is randomly generated from a Cauchy distribution with scale factor 0.1 and location parameter Fm. This is because the location parameter Fm of Cauchy distribution varies the value of F more than the conventional normal distribution. Moreover, during the adjustment of Fm, the use of the arithmetic mean guides Fm towards a higher value which aids larger perturbation to the target vectors; hence, premature convergence to local minima is avoided. Irrespective of stagnation criteria, Fsuccess records the successful scale factors that generate better trial vectors than target vector in the current generation. Thus, it increases the chances of generating better trial vectors as the generations proceed further.
The crossover probability generated is independent of stagnation the criterion, i.e. for both the cases, the crossover probability is randomly generated from a Gaussian distribution with standard deviation 0.1 and mean Crm. The adaptation of Crm is also based on the memorization of thriving crossover probabilities generating better trial vectors. The use of arithmetic mean while adjusting the value of Crm avoids the bias of Cr towards smaller values [41]. Here, we have chosen Gaussian distribution because it glorifies the chances to generate the majority of Cr values within unity [41].
4 Proposed Methodology
This section presents the proposed DE-based modelling scheme (DEMS) to automatically determine the most parsimonious ANN architecture for a time series under study. The DEMS method (as in Fig. 3) follows the wrapper approach of model selection. It uses the proposed ADE-ANNT training algorithm to determine the near-optimal weight set and fitness of ANN architectures. Algorithm 3 presents the steps of the proposed DEMS method which operates in four steps, namely (a) Initialization, (b) Pre-processing, (c) ANN modelling and (d) Forecast.
In the initialization step, the problem parameters like the length of the train (\( l_{tr} \)), validation (\( l_{v} \)) and test (\( l_{te} \)) are initialized. The number of inputs of ANN is restricted to 1 to \( k_{max} \) (computed using Eq. 16), and the number hidden layer neurons of ANN is restricted to 1 to \( q_{max} \) (calculated using Eq. 17).
In the pre-processing step, the time series is normalized to a range between (0–1). To assess the sensitivity to different normalization techniques, min–max (as in Eq. 18), decimal scaling (as in Eq. 19) and vector (as in Eq. 20) normalization techniques are considered. Also, the effect of treatment of trend by taking the first difference of the series is tested. But differencing is avoided due to deterioration in performance. Let \( y = \left[ {y_{1, } y_{2, \ldots \ldots \ldots \ldots .,} y_{n } } \right]^{T} \) be the n-dimensional time series vector and \( y^{\prime} = \left[ {y_{1}^{'} , y_{2}^{'} , \ldots ,y_{n}^{'} } \right]^{T} \) be the normalized series vector with \( y_{i}^{'} \) is the ith normalized data of \( y_{i} \).
In the ANN modelling step, the proposed adaptive DE algorithm is used to search the most parsimonious ANN architecture in the space of all possible architecture combinations. The DES is used to encode the ANN architectures into the chromosomes. Initially, \( P_{size} \) number of chromosomes are randomly chosen between 0 and 2.5, with each chromosome \( C_{j} = \left\{ {g_{1} ,g_{2} ,g_{3} ,g_{4} ,g_{5} ,g_{6} ,g_{7} ,g_{8} } \right\} \) consisting of 8 genes (4 are used for the number of inputs and four are used for the number of hidden neurons). Then, the fitness (i.e. the RMSE measure on train set) of each chromosome is calculated. To calculate the fitness, the genotypes are first converted to phenotypes (i.e. ANN architectures). The number of inputs (k) of ANN is calculated using Eq. 21 and the number of neurons in the hidden layer (q) are computed using Eq. 22.
Once the number of inputs of ANN is determined, the normalized time series is converted to \( n - k \) patterns with each pattern consisting of k inputs, one target. After the determination of \( k - q - 1 \) ANN architecture and patterns, the fitness of ANN architecture is obtained using the proposed ADE-ANNT training algorithm. Then, the DE mutation, crossover and selection operators are repeatedly applied till the termination criterion is satisfied, i.e. the error on the validation set increases. When the termination criterion is satisfied, the architecture having the best fitness is chosen as the near-optimal ANN architecture for the time series under study.
In the Forecast step, the values for the test set are predicted using the obtained k-q-1 architecture and near-optimal weight set. The predicted values are de-normalized to obtain the actual forecasts, and then forecast accuracy is measured.
5 Experimental Setup and Results
In this section, the simulation results, along with the analysis and discussions, are presented. ARIMA and ETS models are a linear form of most popular neural network MLP and hence considered as an excellent benchmark to compare with ML models [37]. Therefore, in this study, ARIMA and ETS are considered as comparative models. The appropriate ARIMA and ETS model for a time series is determined by using the Forecast package of R [59]. In addition, to evaluate the effectiveness of the proposed architecture selection DEMS method, another architecture selection method using DBN [27] is considered. Additionally, the MLP model trained using the Levenberg–Marquardt (LM) algorithm (MLP + LM) is used for comparison. The MLP + LM method is implemented using the neural network toolbox of MATLAB. In the MLP + LM method, the number of output is fixed to one, whereas the number of input is determined by analysing the autocorrelation and partial autocorrelation function of the time series. To avoid the inconsistencies arising due to improper ANN architecture and to make a fair comparison with the proposed DEMS method, the number of hidden neurons in MLP + LM method is determined by altering the neurons between 1 and 20 (as in [2, 60]) and the architecture having the smallest mean SMAPE over 50 independent simulations is considered. In all the MLP models, the binary sigmoid activation function is used in the neurons of the hidden layer, and linear activation is used in the neuron of the output layer.
Twenty-one time series datasets (described in Table 1) from time series data library [61] are considered to evaluate the methods using root mean square error (RMSE as in Eq. 12) and symmetric mean absolute percentage error (SMAPE as in Eq. 23). SMAPE is a scale-free measure which can be used to evaluate the forecasting methods across one or more time series datasets and hence used in the NN3 forecasting competition to assess the forecasting methods [13]. Therefore, in this paper, SMAPE measure is considered to evaluate the forecasting methods. Additionally, since RMSE is used to assess the fitness of the proposed ADE-ANNT training algorithm, the RMSE measure is also considered to evaluate the forecasting methods. In the proposed DEMS method, the population size \( P_{size} \) of DE is fixed to 10, the number of genes in each chromosome is fixed to 8. The genes are initialized and bound in a range (0–2.5). To make a fair comparison of the proposed DEMS method with DBN [27], the number of particles in PSO is set to 10, early stopping is used in backpropagation algorithm, and other experimental parameters are kept same to that of DBN [27]. The parameters used in the proposed ADE-ANNT training algorithm are presented in Table 2. Table 3 presents the most parsimonious ARIMA model, ETS model and ANN architecture obtained and used in the simulations. With this experimental setup, for each method, 50 independent simulations are carried out on each time series dataset (as in Table 1). The analysis on obtained results are presented in two sections: (a) Analysing results using RMSE measure, (b) Analysing results using SMAPE measure.
where n represents the number of observations, Ti and Yi are computed and actual values of ith observation.
5.1 Analysing Results Using RMSE Measure
Table 4 presents the mean RMSE over 50 independent simulations for each time series. However, to make a comparative statistical analysis of the proposed DEMS method with other methods, the Wilcoxon signed-rank test [62] is applied. Table 5 presents the individual hypothesis test results. It can be observed from Tables 4 and 5 that the DEMS method provides the best RMSE on 13 time series datasets (Acres Burned, Accidental Death, Internet Traffic, Gas Usage, Lake, Lynx, Milk, Mumps, Tasty Cola, Rainfall, Passenger, Temperature, Unemployment) with 11 statistical significant (as in Table 5). Additionally, it provides better RMSE than ETS in 16 cases, ARIMA in 15 cases (14 statistically significant), MLP + LM in 16 cases and DBN [27] in 21 cases (18 statistically significant). It provides inferior RMSE than ETS, ARIMA and MLP + LM in only 4, 5 and 4 time series datasets, respectively, and for other cases, it provides statistically equivalent RMSE.
5.2 Analysing Results Using SMAPE Measure
In this subsection, the results are analysed using the scale-free SMAPE performance measure. Therefore, the results are analysed considering each time series individually and all time series at a time. For evaluating the methods dataset wise, the mean SMAPE over 50 independent simulations is calculated and presented in Table 6. To statistically compare the proposed DEMS method with other methods, Wilcoxon signed-rank test [62] is applied, and the test results are presented in Table 7. It can be observed from Table 6 that the DEMS method provides the best SMAPE in 11 time series datasets (Accidental Death, Chickenpox, Internet Traffic, ITDUK, Lake, Milk, Passenger, Sun Spot, Tasty Cola, Temperature, Unemployment) with statistical significance in 8 cases (as in Table 7). In addition to this, it provides better SMAPE than ETS in 15 time series datasets, ARIMA in 15 time series datasets, MLP + LM in 16 time series datasets and DBN [27] in 17 (16 statistically significant) time series datasets. It provides inferior SMAPE than ETS in 4, ARIMA in 5, MLP + LM in 2 and DBN [27] in 1 time series datasets and for other cases, it provides statistically equivalent SMAPE.
To determine the best method considering all the time series datasets, Friedman and Nemenyi hypothesis test [62, 63] is applied on the SMAPEs of all time series datasets. The test results presented in Fig. 4 indicates the superiority of the DEMS method as it has the lowest mean rank (69.69) among all the methods employed in this study. In addition, since the mean rank of DEMS method is lower than other methods with at least an absolute difference of critical distance (1.3), the DEMS method is statistically superior to other methods.
To make the proposed DEMS method consistently effective, several comparisons between different designs are performed. The design strategies are intended to evaluate the effect of treatment of trend component, sensitivity to different normalization techniques and effectiveness of the proposed ADE-ANNT training algorithm.
For this, the design configuration (i.e. the full design parameter) used in the proposed method using min–max normalization technique and ADE-ANNT training algorithm is treated as the baseline. To evaluate the effect of treatment of trend component, a detrending step by taking the first difference of time series data is added before the normalization step. Here, the first difference is considered because detrending data by making the first difference is the most suitable method for building the ANN model for nonlinear and stochastic time series [12]. To evaluate the sensitivity to different normalization techniques, min–max (as in Eq. 18), decimal scaling (as in Eq. 19) and vector (as in Eq. 20) normalization techniques are considered. To evaluate the sensitivity to different training algorithms, the proposed ADE-ANNT training algorithm, TLBO [64] for ANN training (TLBO-ANNT), DE-ANNT + [41] (with five trial vectors), DE-CRO-HONNT [43] and Levenberg–Marquardt (LM) algorithm are considered. Additionally, to evaluate the effectiveness of architecture selection using DEMS, results from MLP + LM (architecture selection using ACF & PACF and trained with LM) method is compared with DEMS + LM (architecture obtained using DEMS method and trained with LM algorithm). In the comparison of architecture selection methods, the DBN [27] method is not considered because of its inferior statistical performance than the MLP + LM method (as in Fig. 4). In addition, to avoid the bias of DEMS method towards ADE-ANNT training algorithm (since the architectures are selected using ADE-ANNT training algorithm), the LM algorithm is used to train the models in DEMS + LM method. By making these modifications, 50 independent simulations are conducted for each time series dataset. To evaluate the sensitivity of the DEMS method considering different strategies, a one-way analysis of variance (ANOVA) test is conducted on the obtained SMAPEs. The test results are presented in Table 8. It indicates that the architectures selected using the DEMS method are statistically insensitive (p-value > 0.05) to different normalization techniques. However, the chosen architectures using the DEMS method are sensitive to detrending using the first difference of the series (p-value < 0.05) and training algorithms (p-value < 0.05). Additionally, the performance of the ANN model is sensitive to the architecture selection method (p-value < 0.05).
To identify the effect of treating the trend component by using the first difference of the series, Friedman and Nemenyi hypothesis test [62, 63] is applied on the obtained SMAPEs, and the test results are presented in Fig. 5. It can be observed that the proposed method performs statistically better when detrending using the first difference is avoided. Contrasting to the claim of the superior performance of ANN models with detrending using the first difference by Qi and Zhang [12], in this study, the performance of ANN model deteriorates significantly when detrending using the first difference is used. However, in Qi and Zhang [12] method, the architecture of the ANN model is chosen in an ad hoc manner. In addition, they suggested that the conclusions may change with different parameters. From simulation results, it is claimed that ANN with an appropriate architecture can effectively handle the trend component of the time series. And in such a case, detrending using the first difference deteriorates the performance of the model significantly.
To identify the best training algorithm among all the algorithms considered in this study, Friedman and Nemenyi hypothesis test [62, 63] is applied to the obtained SMAPEs. The test results shown in Fig. 6 indicate the clear superiority of the proposed ADE-ANNT training algorithm. The most competing algorithm with the proposed ADE-ANNT algorithm is LM. However, the absolute difference in mean rank (1.51) is more than the critical distance (CD) (1.3). Hence, their difference is statistically significant. Moreover, the memory consumption by the proposed ADE-ANNT algorithm is always less than or equal to that of DE-ANNT + [41], since, in ADE-ANNT algorithm, multiple trial vectors are generated only upon stagnation. In the worst-case (i.e. stagnation occurs at every generation, which is a rare case) both the algorithms have the same memory consumption. The DE-ANNT + [41] algorithm has lower memory consumption than the LM algorithm. Thus, the ADE-ANNT algorithm has lower memory consumption than LM algorithm as well. In the proposed wrapper approach-based DEMS method, a variety of ANN architectures need to be trained. Thus, considering the memory consumption and forecasting accuracy, ADE-ANNT algorithm is more suited in the proposed modelling scheme.
Table 8 indicates that the DEMS + LM and MLP + LM methods provide statistically different overall forecasting accuracy considering SMAPE measure. In both methods, the LM training algorithm is used but on different architectures. In DEMS + LM method, the architectures are obtained using the proposed DEMS (as in Table 3) method whereas, in MLP + LM method, the architectures are obtained using ACF and PACF function of the series. To identify the better method, Friedman and Nemenyi hypothesis test [62, 63] is applied to the obtained SMAPEs, and the test results are shown in Fig. 7. It indicates the superiority of the proposed architecture selection DEMS method.
6 Conclusion
In this paper, a systematic investigation is made to explore the full potential of multilayer feedforward ANN, especially single hidden layer MLP in TSF. For this, a DE-based modelling scheme (DEMS) is developed to determine the most parsimonious architecture of MLP for a time series under study. In addition, an adaptive DE-based ANN training algorithm (ADE-ANNT) is proposed to determine the near-optimal weight set of ANN. Extensive statistical analyses on obtained results indicate the statistical superiority of the proposed DEMS method than the established statistical models ARIMA and ETS. The architectures selected using the proposed DEMS method provide statistically superior result than the MLP + LM and DBN [27] methods. The results also indicate the statistical superiority of ADE-ANNT algorithm when compared with LM, DE-ANNT + [41], DE-CRO-HONNT [43] and TLBO-ANNT [64]. In addition to statically superior SMAPE, the ADE-ANNT algorithm has lower memory consumption than DE-ANNT + [41] and the most popular LM algorithm.
The experiments reveal that the ANN models are quite capable of handling the trend component if the near-optimal architecture for a time series is obtained and in such a case detrending by taking the first difference of the series deteriorates the forecasting accuracy. Therefore, in the proposed DEMS method, treatment of trend component is avoided. It is also observed that in TSF, the ANN models are insensitive to different normalization techniques.
The proposed DEMS method is fully automatic and can be applied in various real-world time series with least human intervention. The DEMS method is computationally expensive because of the use of hierarchical DE algorithms for simultaneous optimization of architecture and weight set of ANN. However, it supports parallel implementation to reduce computational time. In addition, the proposed DEMS method can also be applied to optimize the architecture of other ANN models like FLANN, Pi-Sigma neural network, RBF neural network, etc.
References
Makridakis, S.G.; Steven, C.; Wheelwright, R.J.H.: Forecasting Methods and Applications. Wiley, New York (2008)
de Oliveira, J.F.L.; Ludermir, T.B.: A hybrid evolutionary decomposition system for time series forecasting. Neurocomputing 180, 27–34 (2016). https://doi.org/10.1016/j.neucom.2015.07.113
Zhang, G.; Eddy Patuwo, B.; Hu, Y.M.: Forecasting with artificial neural networks: the state of the art. Int. J. Forecast. 14, 35–62 (1998). https://doi.org/10.1016/S0169-2070(97)00044-7
Enders, W.: Applied Econometric Time Series. Wiley, New York (2014)
Wei, W.W.S.: Time Series Analysis. Oxford University Press, Oxford (2013)
Zhang, G.P.: Neural networks for time-series forecasting. In: Rozenberg, G., Bäck, T., Kok, J.N. (eds.) Handbook of Natural Computing, pp. 461–477. Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-540-92910-9_14
Swanson, N.R.; White, H.: Forecasting economic time series using flexible versus fixed specification and linear versus nonlinear econometric models. Int. J. Forecast. 13, 439–461 (1997). https://doi.org/10.1016/S0169-2070(97)00030-7
Hill, T.; O’Connor, M.; Remus, W.: Neural network models for time series forecasts. Manag. Sci. 42, 1082–1092 (1996). https://doi.org/10.1287/mnsc.42.7.1082
Heravi, S.; Osborn, D.R.; Birchenhall, C.R.: Linear versus neural network forecasts for European industrial production series. Int. J. Forecast. 20, 435–446 (2004). https://doi.org/10.1016/S0169-2070(03)00062-1
Callen, J.L.; Kwan, C.C.; Yip, P.C.; Yuan, Y.: Neural network forecasting of quarterly accounting earnings. Int. J. Forecast. 12, 475–482 (1996). https://doi.org/10.1016/S0169-2070(96)00706-6
Panigrahi, S.; Behera, H.S.: Effect of normalization techniques on univariate time series forecasting using evolutionary higher order neural network. Int. J. Eng. Adv. Technol. 3, 280–285 (2013)
Min, Q.; Zhang, G.P.: Trend time-series modeling and forecasting with neural networks. IEEE Trans. Neural Netw. 19, 808–816 (2008). https://doi.org/10.1109/TNN.2007.912308
Crone, S.F.; Hibon, M.; Nikolopoulos, K.: Advances in forecasting with neural networks? Empirical evidence from the NN3 competition on time series prediction. Int. J. Forecast. 27, 635–660 (2011). https://doi.org/10.1016/j.ijforecast.2011.04.001
Panigrahi, S.; Karali, Y.; Behera, S.H.: Time series forecasting using evolutionary neural network. Int. J. Comput. Appl. 75, 13–17 (2013). https://doi.org/10.5120/13146-0553
Yang, Z.; Mourshed, M.; Liu, K.; Xu, X.; Feng, S.: A novel competitive swarm optimized RBF neural network model for short-term solar power generation forecasting. Neurocomputing. 397, 415–421 (2020). https://doi.org/10.1016/j.neucom.2019.09.110
Xiangxue, W.; Lunhui, X.; Kaixun, C.: Data-driven short-term forecasting for urban road network traffic based on data processing and LSTM-RNN. Arab. J. Sci. Eng. 44, 3043–3060 (2019). https://doi.org/10.1007/s13369-018-3390-0
Poorzaker Arabani, S.; Ebrahimpour Komleh, H.: The improvement of forecasting ATMS cash demand of Iran banking network using convolutional neural network. Arab. J. Sci. Eng. 44, 3733–3743 (2019). https://doi.org/10.1007/s13369-018-3647-7
Panigrahi, S.; Behera, H.S.: Nonlinear time series forecasting using a novel self-adaptive TLBO-MFLANN model. Int. J. Comput. Intell. Stud. 8, 4 (2019). https://doi.org/10.1504/IJCISTUDIES.2019.10019170
Minku, F.L.; Ludermir, T.B.: Clustering and co-evolution to construct neural network ensembles: an experimental study. Neural Netw. 21, 1363–1379 (2008). https://doi.org/10.1016/j.neunet.2008.02.001
Stepniewski, S.W.; Keane, A.J.: Pruning backpropagation neural networks using modern stochastic optimization techniques. Neural Comput. Appl. 5, 76–98 (1997). https://doi.org/10.1007/BF01501173
Miller, G.F.; Todd, P.M.; Hegde, S.U.: Designing Neural Networks using Genetic Algorithms. In: Proceedings of the 3rd International Conference on Genetic Algorithms, George Mason University, Fairfax, Virginia, USA, pp. 379–384 (1989)
Kitano, H.: Designing neural networks using genetic algorithms with graph generation system. Complex Syst. 4, 461–476 (1990)
Gruau, F.: Genetic synthesis of Boolean neural networks with a cell rewriting developmental process. In: [Proceedings] COGANN-92: International Workshop on Combinations of Genetic Algorithms and Neural Networks, pp. 55–74. IEEE Computer Society Press (1992)
Donate, J.P.; Li, X.; Sánchez, G.G.; de Miguel, A.S.: Time series forecasting by evolving artificial neural networks with genetic algorithms, differential evolution and estimation of distribution algorithm. Neural Comput. Appl. 22, 11–20 (2013). https://doi.org/10.1007/s00521-011-0741-0
Saboo, A., Sharma, A., Dash, T.: GASOM: Genetic Algorithm Assisted Architecture Learning in Self Organizing Maps. Presented at the (2017)
Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Lv, J.: Automatically designing CNN architectures using the genetic algorithm for image classification. IEEE Trans. Cybern. 10, 1–15 (2020). https://doi.org/10.1109/TCYB.2020.2983860
Kuremoto, T.; Kimura, S.; Kobayashi, K.; Obayashi, M.: Time series forecasting using a deep belief network with restricted Boltzmann machines. Neurocomputing 137, 47–56 (2014). https://doi.org/10.1016/j.neucom.2013.03.047
Ewees, A.A.; Elaziz, M.A.; Alameer, Z.; Ye, H.; Jianhua, Z.: Improving multilayer perceptron neural network using chaotic grasshopper optimization algorithm to forecast iron ore price volatility. Resour. Policy 65, 101555 (2020). https://doi.org/10.1016/j.resourpol.2019.101555
Baffes, P.T., Zelle, J.M.: Growing layers of perceptrons: introducing the Extentron algorithm. In: [Proceedings 1992] IJCNN International Joint Conference on Neural Networks. pp. 392–397. IEEE
Gutierrez, G.; Sanchis, A.; Isasi, P.; Molina, J.M.; Galvan, I.M.: Non-direct encoding method based on cellular automata to design neural network architectures. Comput. Inf. 24, 225–247 (2005)
Yao, Xin; Liu, Yong: Making use of population information in evolutionary artificial neural networks. IEEE Trans. Syst. Man Cybern. Part B 28, 417–425 (1998). https://doi.org/10.1109/3477.678637
Peralta, J., Gutierrez, G., Sanchis, A.: ADANN: Automatic design of artificial neural networks. In: Proceedings of the 2008 GECCO Conference Companion on Genetic and Evolutionary Computation—GECCO’08, pp. 1863–1870. ACM Press, New York (2008)
Peralta, J., Gutierrez, G., Sanchis, A.: Time series forecasting by evolving artificial neural networks using genetic algorithms and estimation of distribution algorithms. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2010)
Xu, X., Li, Y.: Comparison between particle swarm optimization, differential evolution and multi-parents crossover. In: 2007 International Conference on Computational Intelligence and Security (CIS 2007), pp. 124–127. IEEE (2007)
Ash, T.: Dynamic node creation in backpropagation networks. In: International Joint Conference on Neural Networks, vol. 2, p 623. IEEE (1989)
Balkin, S.D.; Ord, J.K.: Automatic neural network modeling for univariate time series. Int. J. Forecast. 16, 509–515 (2000). https://doi.org/10.1016/S0169-2070(00)00072-8
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V.: Statistical and machine learning forecasting methods: concerns and ways forward. PLoS ONE 13, e0194889 (2018). https://doi.org/10.1371/journal.pone.0194889
Choi, T.J., Cheong, Y.-G., Ahn, CW: A Performance Comparison of Crossover Variations in Differential Evolution for Training Multi-layer Perceptron Neural Networks. Presented at the (2018)
Pattanayak, R.M., Behera, H.S., Panigrahi, S.: A Novel Hybrid Differential Evolution-PSNN for Fuzzy Time Series Forecasting. Presented at the (2020)
Slowik, A., Bialko, M.: Training of artificial neural networks using differential evolution algorithm. In: 2008 Conference on Human System Interactions, pp. 60–65. IEEE (2008)
Slowik, A.: Application of an adaptive differential evolution algorithm with multiple trial vectors to artificial neural network training. IEEE Trans. Ind. Electron. 58, 3160–3167 (2011). https://doi.org/10.1109/TIE.2010.2062474
Sahu, K.K.; Panigrahi, S.; Behera, H.S.: A novel chemical reaction optimization algorithm for higher order neural network training. J. Theor. Appl. Inf. Technol. 53, 402–409 (2013)
Panigrahi, S.: A novel hybrid chemical reaction optimization algorithm with adaptive differential evolution mutation strategies for higher order neural network training. Int. Arab J. Inf. Technol. 14, 18–25 (2017)
Karali, Y.; Panigrahi, S.: Behera, HS: A novel differential evolution based algorithm for higher order neural network training. J. Theor. Appl. Inf. Technol. 56, 355–361 (2013)
Storn, R.; Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. 11, 341–359 (1997). https://doi.org/10.1023/A:1008202821328
Awad, N.H., Ali, M.Z., Suganthan, P.N., Reynolds, R.G.: An ensemble sinusoidal parameter adaptation incorporated with L-SHADE for solving CEC2014 benchmark problems. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 2958–2965. IEEE (2016)
Awad, N.H., Ali, M.Z., Suganthan, P.N.: Ensemble sinusoidal differential covariance matrix adaptation with Euclidean neighborhood for solving CEC2017 benchmark problems. In: 2017 IEEE Congress on Evolutionary Computation (CEC). pp. 372–379. IEEE (2017)
Akhmedova, S., Stanovov, V., Semenkin, E.: LSHADE algorithm with a rank-based selective pressure strategy for the circular antenna array design problem. In: Proceedings of the 15th International Conference on Informatics in Control, Automation and Robotics. pp. 159–165. SCITEPRESS—Science and Technology Publications (2018)
Pant, M.B.; Zaheer, H.; Garcia-Hernandez, L.; Abraham, A.: Differential evolution: a review of more than two decades of research. Eng. Appl. Artif. Intell. 90, 103479 (2020). https://doi.org/10.1016/j.engappai.2020.103479
Das, S.; Mullick, S.S.; Suganthan, P.N.: Recent advances in differential evolution—an updated survey. Swarm Evol. Comput. 27, 1–30 (2016). https://doi.org/10.1016/j.swevo.2016.01.004
Price, K.; Storn, R.M.; Lampinen, J.A.: Differential Evolution: A practical approach to global optimization. Springer, Berlin (2005)
Price, K.V.: An introduction to differential evolution. In: Corne, D., Dorigo, M., Glover, F., Dasgupta, D., Moscato, P., Poli, R., Price, K.V. (eds.) New Ideas in Optimization, pp. 79–108 (1999)
Grippo, L.: A class of unconstrained minimization methods for neural network training. Optim. Methods Softw. 4, 135–150 (1994). https://doi.org/10.1080/10556789408805583
Jacobs, R.A.: Increased rates of convergence through learning rate adaptation. Neural Networks. 1, 295–307 (1988). https://doi.org/10.1016/0893-6080(88)90003-2
Gori, M.; Tesi, A.: On the problem of local minima in backpropagation. IEEE Trans. Pattern Anal. Mach. Intell. 14, 76–86 (1992). https://doi.org/10.1109/34.107014
Mezura-Montes, E., Velázquez-Reyes, J., Coello Coello, C.A.: A comparative study of differential evolution variants for global optimization. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation—GECCO’06, p. 485. ACM Press, New York (2006)
Weber, M.; Tirronen, V.; Neri, F.: Scale factor inheritance mechanism in distributed differential evolution. Soft. Comput. 14, 1187–1207 (2010). https://doi.org/10.1007/s00500-009-0510-5
Islam, S.M.; Das, S.; Ghosh, S.; Roy, S.; Suganthan, P.N.: An adaptive differential evolution algorithm with novel mutation and crossover strategies for global numerical optimization. IEEE Trans. Syst. Man Cybern. Part B 42, 482–500 (2012). https://doi.org/10.1109/TSMCB.2011.2167966
Hyndman, R.J.; Khandakar, Y.: Automatic time series forecasting: the forecast package for R. J. Stat. Softw. (2008). https://doi.org/10.18637/jss.v027.i03
Panigrahi, S.; Behera, H.S.: A hybrid ETS–ANN model for time series forecasting. Eng. Appl. Artif. Intell. 66, 49–59 (2017). https://doi.org/10.1016/j.engappai.2017.07.007
Hyndman, R.: YY: tsdl: Time Series Data Library. v0.1.0., https://pkg.yangzhuoranyang.com/tsdl/articles/tsdl.html. Accessed 2 June 2019
Hollander, M.; Wolfe, A.D.; Chicken, E.: Nonparametric Statistical Methods. Wiley, New York (2015)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Rao, R.V.; Savsani, V.J.; Vakharia, D.P.: Teaching–learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput. Des. 43, 303–315 (2011). https://doi.org/10.1016/j.cad.2010.12.015
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Panigrahi, S., Behera, H.S. Time Series Forecasting Using Differential Evolution-Based ANN Modelling Scheme. Arab J Sci Eng 45, 11129–11146 (2020). https://doi.org/10.1007/s13369-020-05004-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-020-05004-5