1 Introduction

Future of most of the phenomena is a consequence of its past. Therefore, future values of a phenomenon can be extrapolated by systematically analysing its past values. Such a process of predicting the future values of a phenomenon by analysing the past observations is known as time series forecasting (TSF). Accurate TSF plays a vital role in almost every area of study including finance, economics, engineering and management science. Conventionally, statistical models like exponential smoothing, autoregressive integrated moving average (ARIMA) have been widely used in TSF [1]. These models work under the assumption of the linear correlation structure of time series data and often fail to provide a satisfactory result when the time series possess nonlinear patterns. To add to this owe, most of the real-world time series is nonlinear and often suffered from the issues of temporal as well as spatial variability. Hence, traditional statistical models cannot handle the nonlinear patterns equally well as linear patterns. This is because linear models can’t handle nonlinear patterns and vice versa [2]. Therefore, nonlinear statistical models like autoregressive heteroskedastic models (ARCH and GARCH family) have been applied in TSF to capture the nonlinear patterns. However, the existence of different variations of these models [3, 4] imposes a burden in choosing the suitable model, and thus limits its application to a generalized forecasting problem.

In the past two decades, ANN models have gained popularity in forecasting theory due to several advantages features. First, unlike traditional models, ANNs are data-driven models, i.e. instead of modelling data by a pre-defined model; ANN employs the data to determine the parameters of the model. Also, ANNs neither require any a priori assumption on model nor on the characteristics of data to be modelled. Second, ANNs are universal approximators, i.e. ANNs can model any nonlinear or linear function to any desired level of accuracy [5]. Finally, ANNs possess nonlinear modelling ability, i.e. can learn and generalize any function [6]. Considering all these advantages of ANN, thousands of research papers using different ANN models have been published to solve various problems in TSF. Despite more than two decades of research on ANN for TSF and some studies [7, 8] indicating the superior performance of ANN models than statistical models, some studies [9, 10] have also indicated inferior performance under certain datasets. This inconsistent performance of ANN models across various studies was due to several factors such as heuristic and ad hoc modelling process [3], pre-processing (e.g. detrending and normalization) of time series data [11,12,13] and the effectiveness of the training algorithms used [3]. If these factors were suitably optimized, ANNs would be established as a reliable tool for TSF.

All the ANN-based forecasting methods proposed to forecast time series have used two approaches of ANN model design. In the first approach [14,15,16,17,18], the architecture of ANN (number of inputs, number of hidden layers, number of neurons in each hidden layer) is given, and the remaining parameters (weights, learning rate, etc.) of ANN are computed in the later part of the modelling process. In this approach, neither optimal nor acceptable results can be guaranteed. Contrasting to the first approach, the second one [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36] determines the architecture and other parameters automatically using different techniques such as pruning algorithm [19, 20], growing algorithm [29], cellular automata [30] and evolutionary algorithms [21,22,23,24,25,26,27,28, 31,32,33,34,35,36]. The application of evolutionary algorithms to construct the architecture of ANN dominated the literature. It is mostly based upon genetic algorithm (GA) [21, 25, 26, 32], estimation distribution algorithm (EDA) [24, 33], particle swarm optimization (PSO) [27], grasshopper optimization algorithm (GOA) [28] and differential evolution (DE) [24, 33]. The evolutionary algorithms used either direct encoding schemes (DES) [24,25,26,27,28, 31,32,33,34,35,36] or indirect encoding scheme (IES) [21,22,23]. In DES, the chromosomes contain information about the learning parameters, architecture and parameters of the topology. In contrast, in IES, the chromosomes contain necessary information so that a constructive method can be used to delineate the architecture of ANN. Although no study has been made relating to the comparative performance of DES and IES, the DES has been used predominantly due to its simplicity in implementation.

In this paper, an adaptive modelling scheme based on DES for time series forecasting is developed. To make a robust evaluation of the proposed modelling scheme, the choice of the ANN model plays a vital role. To address this issue, Makridakis et al. [37] conducted a comparative study on different machine learning (ML) models for TSF and suggested that multilayer perceptron (MLP) provides better result than other ML models including long short-term memory (LSTM). Therefore, in this paper, a single hidden layer feedforward MLP is considered as the forecasting model. The number of neurons in the output layer is fixed to one, whereas the number of inputs and hidden neurons is automatically determined using an adaptive evolutionary algorithm. The DES is used to encode the architecture of ANN into chromosomes; meanwhile, the fitness of each chromosome is obtained after training with the proposed training algorithm (ADE-ANNT). The major contributions of the paper are as follows:

  1. (i)

    An adaptive DE-based modelling scheme (DEMS) has been developed to determine the near-optimal architecture of ANN for a time series under study.

  2. (ii)

    An adaptive DE-based ANN training algorithm (ADE-ANNT) has been proposed to determine the near-optimal weights of ANN.

  3. (iii)

    To make the DEMS consistently effective, several comparisons are made between different alternatives in pre-processing techniques such as treatment of trend component and normalization techniques.

The rest of this paper is organized as follows. Section 2 provides an overview of ANN and DE algorithm. The proposed ADE-ANNT training algorithm is explained in Sect. 3. Section 4 presents a detailed description of the proposed methodology. In Sect. 5, the experimental set up is described, and simulation results are analysed. Finally, the conclusions are drawn in Sect. 6.

2 Preliminaries

2.1 Artificial Neural Network (ANN)

ANNs are data-driven, self-adaptive, flexible models that can approximate an enormous class of nonlinear functions to any preferred degree of precision. Thus, different ANN models (MLP [14], radial basis function (RBF) neural network [15], LSTM [16], convolutional neural network (CNN) [17], functional link ANN (FLANN) [18], deep belief network [27], etc.) have been applied in a variety of TSF applications. Out of different ANN models, feedforward MLP with a single hidden layer is most popular in TSF [3, 37]. This is because of its complex nonlinear mapping capability and universal approximation capability [3]. The architecture (as in Fig. 1) of such MLP consists of an input, hidden and output layer. The neurons of adjacent layers are symmetrically connected by directed acyclic links. In TSF, the number of inputs and hidden neurons is flexible, whereas the number of output neurons is usually kept to one. Solving the TSF problem using MLP can be considered as identifying the underlying relationship (as in Eq. 1) existing between the output yt and the past k values of the series \( y = [y_{1} ,y_{2} , \ldots ,y_{n} ] \). Therefore, the time series data are first transformed to n-k patterns, with each pattern has k inputs \( y_{t - 1} ,y_{t - 2} , \ldots ,y_{t - k} \) and one target \( y_{t} \). Generally, the patterns are partitioned into train, validation and test sets. The training algorithms like evolutionary training algorithms [18, 38,39,40,41,42,43,44], Levenberg–Marquardt (LM) algorithm, etc. use the train and validation sets to determine the weights and other parameters of ANN. Then, the future values on the test set are computed and used to evaluate the performance of the model.

$$ y_{t} = f\left( {\alpha_{0} + \mathop \sum \limits_{j = 1}^{q} \alpha_{j} g\left( {\beta_{0j} + \mathop \sum \limits_{i = 1}^{k} \beta_{ij} y_{t - 1} } \right)} \right) $$
(1)

where \( \alpha_{j} \) and \( \beta_{ij} \) are the set of weights between input–hidden and hidden–output layers. \( \beta_{0j} \) and \( \alpha_{0} \) are the bias unit of hidden and output neurons. Usually, binary sigmoid activation function (as in Eq. 2) has been widely used in the neurons of the hidden layer. In contrast, the linear activation function has been used in the neurons of the output layer.

$$ sigmoid\left( x \right) = \frac{1}{{1 + e^{ - x} }} $$
(2)
Fig. 1
figure 1

Architecture of three-layer ANN

2.2 Differential Evolution

In 1997, Storn et al. [45] proposed a simple yet efficient evolutionary algorithm called differential evolution (DE). Compared to other evolutionary algorithms, DE is straightforward and easier to implement. Even though PSO and its variants are easy to code, DE variants outperformed the PSO variants and other evolutionary algorithms [46,47,48]. The recent review paper [49] summarizes the superior performance of DE than different evolutionary algorithms in CEC competitions organized from 2005 to 2018. DE is a stochastic search method, which starts with a set of chromosomes (decision vectors), with each chromosome consisting of a set of genes (decision variables). The chromosomes of a generation use mutation, crossover and greedy selection operations repeatedly to move the population towards the next generations until a near-optimal solution is achieved.

All DE algorithm variants operate in the following steps:

  1. 1.

    Initialization of problem parameters, algorithm parameters and initial population (consisting of a set of chromosomes).

  2. 2.

    Calculate the fitness of each chromosome/individual.

  3. 3.

    For each chromosome (target vector) of the current generation, apply step: 3.1 to step: 3.3 to generate the chromosomes of next generation.

    1. 3.1

      Apply mutation operator (anyone from Eqs. 3 to 7) to generate the mutant (donor) vector

    2. 3.2

      Generate the trial vector by applying crossover (either Eqs. 8 or 9) between the target vector and mutant vector.

    3. 3.3

      Perform selection (Eq. 10) between target vector and trial vector

  4. 4.

    If the termination criteria are satisfied, go to step-5 or else go to step-3

  5. 5.

    Use the fittest individual of the final generation as the solution to the problem.

Let the population of gth generation is \( p_{g} = \left( {C_{g}^{1} ,C_{g}^{2} , \ldots C_{g}^{i} \ldots C_{g}^{PopSize} } \right) \), with \( C_{g}^{i} = \left( {W_{g}^{i1} ,W_{g}^{i2} , \ldots W_{g}^{ij} \ldots W_{g}^{iD} } \right) \) for i = 1,2,3…..PopSize, D = length of each chromosome,\( W_{g}^{ij} \) = jth gene of the ith individual in gth generation. The mutation for ith chromosome produces a mutant vector \( M_{g}^{i} = \left( {M_{g}^{i1} ,M_{g}^{i2} , \ldots M_{g}^{ij} \ldots M_{g}^{iD} } \right) \), which is used in the crossover with the current chromosome \( C_{g}^{i} = \left( {W_{g}^{i1} ,W_{g}^{i2} , \ldots W_{g}^{ij} \ldots W_{g}^{iD} } \right) \) to produce a trial vector \( T_{g}^{i} = \left( {T_{g}^{i1} ,T_{g}^{i2} , \ldots T_{g}^{ij} \ldots T_{g}^{iD} } \right) \)

Since the inception of DE, it has been modified and upgraded rigorously in recent years [49, 50]. However, the variants of the DE algorithm mostly vary in the nature of mutation and crossover scheme being used. Storn and Price [51, 52] suggested different mutation schemes (as in Eqs. 3 to 7) for differential evolution to generate the mutant vector. Two crossover methods such as binomial (as in Eq. 8) and exponential (as in Eq. 9) can be applied on any of the mentioned mutation strategies to generate the trial vector. Thus, a total of \( 2 \times 5 = 10 \) different DE mutation strategies can be used to generate a trial vector. Then, selection (as in Eq. 10) operation is applied to select the better chromosome between trial and target vector for the next generation.

$$ {\mathbf{DE}}/{\mathbf{best}}/1: M_{g}^{i} = C_{g}^{best} + F \times \left( {C_{g}^{r1} - C_{g}^{r2} } \right) $$
(3)
$$ {\mathbf{DE}}/{\mathbf{rand}}/1: M_{g}^{i} = C_{g}^{r1} + F \times \left( {C_{g}^{r2} - C_{g}^{r3} } \right) $$
(4)
$$ {\mathbf{DE}}/{\mathbf{rand}}/2: M_{g}^{i} = C_{g}^{r1} + F \times \left( {C_{g}^{r2} - C_{g}^{r3} } \right) + F \times \left( {C_{g}^{r4} - C_{g}^{r5} } \right) $$
(5)
$$ {\mathbf{DE}}/{\mathbf{best}}/2: M_{g}^{i} = C_{g}^{best} + F \times \left( {C_{g}^{r1} - C_{g}^{r2} } \right) + F \times \left( {C_{g}^{r3} - C_{g}^{r4} } \right) $$
(6)
$$ {\mathbf{DE}}/{\mathbf{target}} - {\mathbf{to}} - {\mathbf{best}}/1: M_{g}^{i} = C_{g}^{i} + F \times \left( {C_{g}^{best} - C_{g}^{r1} } \right) + F \times \left( {C_{g}^{best} - C_{g}^{r2} } \right) $$
(7)
$$ \varvec{T}_{\varvec{g}}^{{\varvec{ik}}} = \left\{ {\begin{array}{*{20}l} {\varvec{M}_{\varvec{g}}^{{\varvec{ik }}} } \hfill & {if rand\left( {0,1} \right) < Cr} \hfill \\ {\varvec{W}_{\varvec{g}}^{{\varvec{ik}}} } \hfill & {otherwise} \hfill \\ \end{array} } \right.{\text{for }}\;k = 1,2, \ldots {\text{D}} $$
(8)
$$ \varvec{T}_{\varvec{g}}^{{\varvec{ik}}} = \left\{ {\begin{array}{*{20}l} {\varvec{M}_{\varvec{g}}^{{\varvec{ik }}} } \hfill & {\varvec{l} \le k \le \left( {l + t} \right)} \hfill \\ {\varvec{W}_{\varvec{g}}^{{\varvec{ik}}} } \hfill & {otherwise} \hfill \\ \end{array} } \right.{\text{for }}\;k = 1,2, \ldots {\text{D}} $$
(9)
$$ \varvec{C}_{{\varvec{g} + 1}}^{\varvec{i}} = \left\{ {\begin{array}{*{20}l} {\varvec{T}_{\varvec{g}}^{{\varvec{i }}} } \hfill & {\varvec{Fitness}\left( {\varvec{T}_{\varvec{g}}^{{\varvec{i }}} } \right)\varvec{ }\;\varvec{is}\;\varvec{ better}\;\varvec{ than }\;\varvec{Fitness}\left( {\varvec{C}_{\varvec{g}}^{{\varvec{i }}} } \right)} \hfill \\ {\varvec{C}_{\varvec{g}}^{\varvec{i}} } \hfill & {\varvec{otherwise}} \hfill \\ \end{array} } \right. $$
(10)

The notations used above are in the form DE/a/b, where DE means differential evolution, a indicates the choice of base vector used in mutation (it may be a randomly chosen vector, target vector or the best vector), b indicates the number of difference vectors applied in mutation; \( M_{g}^{i} \) stands for ith mutant vector in gth generation; \( C_{g}^{best} \) for the best vector of the population in gth generation; \( C_{g}^{r} \) for a randomly chosen chromosome from the population of generation g; \( C_{g}^{i} \) is the target (donor) vector; Cr is the crossover probability; F is the scale factor; \( T_{g}^{ik} \) kth gene of ith chromosome in gth generation; rand (0,1) represents a random number between 0 and 1; and the value l and k are chosen based on the crossover probability in such manner that: \( 1 \le l \le \left( {l + k} \right) \le D \) with \( D \) is the dimension of the chromosome. Note that, all the chromosomes which are selected for any mutation scheme must be distinct to each other.

3 Proposed ANN Training Algorithm (ADE-ANNT)

In the proposed method, a large number of different ANN architectures need to be trained to obtain the best architecture for TSF. This demands a robust training algorithm that reaches a near-optimal solution in quick time. Conventionally, gradient-based algorithms are widely used for training ANN. However, the gradient-based methods have several shortcomings such as unreliable [53], slow convergence speed [54], chances to trapping to local optima is high [55], performance is subjected to initial values of algorithm parameters [54], etc. Therefore, recently, several evolutionary ANN training algorithms [18, 38,39,40,41,42,43,44] have been developed. Out of several evolutionary algorithms, the applications of DE to ANN training [38,39,40,41, 44] have dominated the literature. Some authors [38,39,40, 44] have used traditional DE variants for ANN training, whereas Slowik [41] proposed DE-ANNT + training algorithm with multiple trial vectors concept. In DE-ANNT + algorithm, for each target vector, multiple mutant vectors are generated. The mutant vector having the best fitness is used in crossover to generate a trial vector. Then, selection operation is carried out between the trial vector and target vector to generate the chromosomes for the next generation. The control parameters F and Cr are obtained based on the rate of change of the best solution (Eq. 11)\( , \) where TheBest represents the fitness of the best chromosome. In Slowik [41], the population is said to be stagnated when the value of R (Eq. 11) of the population does not change by a certain percentage. DE-ANNT + has shown better performance than extended backpropagation (EBP), EA-ANNT and DE-ANNT. Also, the results are as good as the Levenberg–Marquardt algorithm. However, the major drawback of DE-ANNT + is that it suffers from exploitation due to the generation of multiple mutant vectors for each individual in every generation. Therefore, DE-ANNT + is not so suitable for the proposed self-adaptive modelling scheme.

$$ Rate\; of\;change\;of\;best\;solution \left( R \right) = TheBest_{i} \div TheBest_{i - 1} $$
(11)

To develop a robust DE-based training algorithm, the choice of DE strategy plays a vital role. To address this issue, Mezura-Montes et al. [56] made a comparative study on eight different DE schemes using 13 benchmark problems and suggested that the DE/best/1/bin scheme provides robust result irrespective of the problem in hand. However, the DE/best/1 scheme has the tendency of premature convergence, which can be effectively avoided by generating multiple trial vectors upon stagnation. Therefore, by integrating DE/best/1/bin strategy with the concept of multiple trial vectors upon stagnation, an adaptive differential evolution with conditional multiple trial vectors for ANN training (ADE-ANNT) is proposed. In addition, the values of control parameters F and Cr are determined self-adaptively based on the problem in hand.

Algorithm 1 presents the pseudocode of ADE-ANNT. In this algorithm, the trainable weight set of an ANN is encoded into the chromosomes as a list of real numbers (as in Fig. 2) with each gene representing a weight of ANN. Hence, the length of each chromosome for an ANN with k inputs, q hidden neurons and one output neuron with bias component is \( q \times \left( {k + 2} \right) + 1 \). The first generation of the population consists of PopSize number of randomly selected chromosomes, with each chromosome denoting the weights of ANN. The fitness of each chromosome (weight set) is obtained by using Eq. 12. Note that, when a chromosome has a lower root mean square error (RMSE) (as in Eq. 13) than another chromosome, the chromosome having lower RMSE is considered as better. Then, the next generations of the population are repeatedly generated until the termination criteria are satisfied. The next generation of the population is generated by applying mutation, crossover and greedy selection operators on each chromosome of the current generation. DE/best/1 mutation scheme is applied to generate the mutant vector(s) using the proposed parameter adaptation scheme. The number of mutant vectors depends on the stagnation criteria, i.e. upon stagnation of the population, multiple mutant vectors are generated by using multiple scale factors. Unlike DE-ANNT + [41], all the mutant vectors are crossed over with the target vector to generate multiple trial vectors (v > 1) and then the best trial vector is used for selection. If stagnation does not occur, one scale factor is used to generate one mutant vector, which consequently generates one trial vector. Moreover, motivated by [57, 58], the scale factor and crossover probability are self-adaptively, and dynamically determined considering the problem in hand.

$$ {\text{Fitness}} = - 1 \times {\text{RMSE}} $$
(12)
$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \left( {Y_{i} - T_{i} } \right)^{2} } $$
(13)

where n represents the number of observations, Ti and Yi are computed and actual values of ith observation.

Fig. 2
figure 2

Encoding weights of ANN on a Chromosome

figure a

3.1 Parameter Adaption in ADE-ANNT

Algorithm 2 presents the pseudocode of the parameter adaptation used in the proposed ADE-ANNT training algorithm. In every generation, the scale factor Fi and crossover probability Cri for each target vector is generated. When stagnation occurs, more than one trial vectors (v) are generated. To achieve this, v (> 1) number of scale factors are used to generate v mutant vectors which after crossover generates v trial vectors. If stagnation does not occur, one scale factor is used to generate one mutant vector, which consequently generates one trial vector.

figure b

Where rand(1,v) stands for v randomly generated numbers from a uniform distribution within the range (0, 1), Cauchyrnd(Fm, 0.1) is a randomly generated number from a Cauchy distribution having scale parameter 0.1 and location parameter Fm, Gaussianrnd(Crm, 0.1) is a randomly generated number from a Gaussian distribution with standard deviation 0.1 and mean Crm.

Initially, the location parameter Fm is set to 0.5 and is then updated after each generation (using Eq. 14).

$$ W_{f} = 0.7 + 0.3 \times {\text{rand}}\left( {0,1} \right) $$
(14)
$$ F_{m} = W_{f} \times F_{m} + \left( {1 - W_{f} } \right) \times mean\left( {F_{success} } \right) $$

where WF is the weight factor which varies between 0.8 and 1. Fsuccess is the set of successful scale factors using which better trial vectors for the next generation are generated.

The mean of the Gaussian distribution Crm is initially set to 0.5 and is then updated after each generation (using Eq. 15).

$$ W_{cr} = 0.7 + 0.3 \times {\text{rand}}\left( {0,1} \right) $$
(15)
$$ Cr_{m} = W_{cr} \times Cr_{m} + \left( {1 - W_{cr} } \right) \times {\text{mean}}\left( {Cr_{success} } \right) $$

where WCr is the weight factor varying within 0.8 and 1. Crsuccess is the set of the successful crossover probabilities generating better trial vectors in the current generation.

3.2 Explanation of Parameter Adaption

Upon stagnation, multiple scale factors are used to generate multiple mutant vectors. The multiple mutant vectors are crossed over with the target vector to generate multiple trial vectors. The generation of multiple trial vectors around the existing solution assists in getting out of stagnation. This is because (1) If the difference vector \( C_{g}^{r1} - C_{g}^{r2} \) becomes small, i.e. when the vectors converge to a small domain, the larger values of scale factor assists in getting out of the suboptimal valleys/peaks causing stagnation; (2) If the difference vector \( C_{g}^{r1} - C_{g}^{r2} \) is large, the population may jump the global minima due to larger difference vectors, but the use of scale factors having smaller values assists in overcoming this problem. Thus, the generation of multiple scale factors from a uniform distribution between (0–2) helps in performing both local and global search.

When stagnation does not occur, only one scale factor is randomly generated from a Cauchy distribution with scale factor 0.1 and location parameter Fm. This is because the location parameter Fm of Cauchy distribution varies the value of F more than the conventional normal distribution. Moreover, during the adjustment of Fm, the use of the arithmetic mean guides Fm towards a higher value which aids larger perturbation to the target vectors; hence, premature convergence to local minima is avoided. Irrespective of stagnation criteria, Fsuccess records the successful scale factors that generate better trial vectors than target vector in the current generation. Thus, it increases the chances of generating better trial vectors as the generations proceed further.

The crossover probability generated is independent of stagnation the criterion, i.e. for both the cases, the crossover probability is randomly generated from a Gaussian distribution with standard deviation 0.1 and mean Crm. The adaptation of Crm is also based on the memorization of thriving crossover probabilities generating better trial vectors. The use of arithmetic mean while adjusting the value of Crm avoids the bias of Cr towards smaller values [41]. Here, we have chosen Gaussian distribution because it glorifies the chances to generate the majority of Cr values within unity [41].

4 Proposed Methodology

This section presents the proposed DE-based modelling scheme (DEMS) to automatically determine the most parsimonious ANN architecture for a time series under study. The DEMS method (as in Fig. 3) follows the wrapper approach of model selection. It uses the proposed ADE-ANNT training algorithm to determine the near-optimal weight set and fitness of ANN architectures. Algorithm 3 presents the steps of the proposed DEMS method which operates in four steps, namely (a) Initialization, (b) Pre-processing, (c) ANN modelling and (d) Forecast.

Fig. 3
figure 3

Graphical abstract of the proposed method

figure c

In the initialization step, the problem parameters like the length of the train (\( l_{tr} \)), validation (\( l_{v} \)) and test (\( l_{te} \)) are initialized. The number of inputs of ANN is restricted to 1 to \( k_{max} \) (computed using Eq. 16), and the number hidden layer neurons of ANN is restricted to 1 to \( q_{max} \) (calculated using Eq. 17).

$$ k_{max} = 10 \times \log_{10} \left( {l_{tr} + l_{v} } \right) $$
(16)
$$ q_{max} = 2 \times k_{max} $$
(17)

In the pre-processing step, the time series is normalized to a range between (0–1). To assess the sensitivity to different normalization techniques, min–max (as in Eq. 18), decimal scaling (as in Eq. 19) and vector (as in Eq. 20) normalization techniques are considered. Also, the effect of treatment of trend by taking the first difference of the series is tested. But differencing is avoided due to deterioration in performance. Let \( y = \left[ {y_{1, } y_{2, \ldots \ldots \ldots \ldots .,} y_{n } } \right]^{T} \) be the n-dimensional time series vector and \( y^{\prime} = \left[ {y_{1}^{'} , y_{2}^{'} , \ldots ,y_{n}^{'} } \right]^{T} \) be the normalized series vector with \( y_{i}^{'} \) is the ith normalized data of \( y_{i} \).

$$ y_{i}^{\prime } = \frac{{y_{i} - \min_{y} }}{{\max_{y} - \min_{y} }} $$
(18)
$$ y_{i}^{\prime } = \frac{{y_{i} }}{{10^{{\log_{10} \left( {\max_{y} } \right)}} }} $$
(19)
$$ y_{i}^{\prime } = \frac{{y_{i} }}{{\sqrt {\mathop \sum \nolimits_{j = 1}^{n} y_{j}^{2} } }} $$
(20)

In the ANN modelling step, the proposed adaptive DE algorithm is used to search the most parsimonious ANN architecture in the space of all possible architecture combinations. The DES is used to encode the ANN architectures into the chromosomes. Initially, \( P_{size} \) number of chromosomes are randomly chosen between 0 and 2.5, with each chromosome \( C_{j} = \left\{ {g_{1} ,g_{2} ,g_{3} ,g_{4} ,g_{5} ,g_{6} ,g_{7} ,g_{8} } \right\} \) consisting of 8 genes (4 are used for the number of inputs and four are used for the number of hidden neurons). Then, the fitness (i.e. the RMSE measure on train set) of each chromosome is calculated. To calculate the fitness, the genotypes are first converted to phenotypes (i.e. ANN architectures). The number of inputs (k) of ANN is calculated using Eq. 21 and the number of neurons in the hidden layer (q) are computed using Eq. 22.

$$ k = \frac{{k_{max} \times \left( {g_{1} + g_{2} + g_{3} + g_{4} } \right)}}{10} $$
(21)
$$ q = \frac{{q_{max} \times \left( {g_{5} + g_{6} + g_{7} + g_{8} } \right)}}{10} $$
(22)

Once the number of inputs of ANN is determined, the normalized time series is converted to \( n - k \) patterns with each pattern consisting of k inputs, one target. After the determination of \( k - q - 1 \) ANN architecture and patterns, the fitness of ANN architecture is obtained using the proposed ADE-ANNT training algorithm. Then, the DE mutation, crossover and selection operators are repeatedly applied till the termination criterion is satisfied, i.e. the error on the validation set increases. When the termination criterion is satisfied, the architecture having the best fitness is chosen as the near-optimal ANN architecture for the time series under study.

In the Forecast step, the values for the test set are predicted using the obtained k-q-1 architecture and near-optimal weight set. The predicted values are de-normalized to obtain the actual forecasts, and then forecast accuracy is measured.

5 Experimental Setup and Results

In this section, the simulation results, along with the analysis and discussions, are presented. ARIMA and ETS models are a linear form of most popular neural network MLP and hence considered as an excellent benchmark to compare with ML models [37]. Therefore, in this study, ARIMA and ETS are considered as comparative models. The appropriate ARIMA and ETS model for a time series is determined by using the Forecast package of R [59]. In addition, to evaluate the effectiveness of the proposed architecture selection DEMS method, another architecture selection method using DBN [27] is considered. Additionally, the MLP model trained using the Levenberg–Marquardt (LM) algorithm (MLP + LM) is used for comparison. The MLP + LM method is implemented using the neural network toolbox of MATLAB. In the MLP + LM method, the number of output is fixed to one, whereas the number of input is determined by analysing the autocorrelation and partial autocorrelation function of the time series. To avoid the inconsistencies arising due to improper ANN architecture and to make a fair comparison with the proposed DEMS method, the number of hidden neurons in MLP + LM method is determined by altering the neurons between 1 and 20 (as in [2, 60]) and the architecture having the smallest mean SMAPE over 50 independent simulations is considered. In all the MLP models, the binary sigmoid activation function is used in the neurons of the hidden layer, and linear activation is used in the neuron of the output layer.

Twenty-one time series datasets (described in Table 1) from time series data library [61] are considered to evaluate the methods using root mean square error (RMSE as in Eq. 12) and symmetric mean absolute percentage error (SMAPE as in Eq. 23). SMAPE is a scale-free measure which can be used to evaluate the forecasting methods across one or more time series datasets and hence used in the NN3 forecasting competition to assess the forecasting methods [13]. Therefore, in this paper, SMAPE measure is considered to evaluate the forecasting methods. Additionally, since RMSE is used to assess the fitness of the proposed ADE-ANNT training algorithm, the RMSE measure is also considered to evaluate the forecasting methods. In the proposed DEMS method, the population size \( P_{size} \) of DE is fixed to 10, the number of genes in each chromosome is fixed to 8. The genes are initialized and bound in a range (0–2.5). To make a fair comparison of the proposed DEMS method with DBN [27], the number of particles in PSO is set to 10, early stopping is used in backpropagation algorithm, and other experimental parameters are kept same to that of DBN [27]. The parameters used in the proposed ADE-ANNT training algorithm are presented in Table 2. Table 3 presents the most parsimonious ARIMA model, ETS model and ANN architecture obtained and used in the simulations. With this experimental setup, for each method, 50 independent simulations are carried out on each time series dataset (as in Table 1). The analysis on obtained results are presented in two sections: (a) Analysing results using RMSE measure, (b) Analysing results using SMAPE measure.

Table 1 Description of time series datasets
Table 2 Simulated parameters in ADE-ANNT
Table 3 Most parsimonious models identified and used in the simulation
$$ SMAPE = \frac{1}{n}\mathop \sum \limits_{j = 1}^{n} \frac{{\left| {Y_{j} - T_{j} } \right|}}{{\left( {\left| {Y_{j} } \right| + \left| {T_{j} } \right|} \right)/2}} $$
(23)

where n represents the number of observations, Ti and Yi are computed and actual values of ith observation.

5.1 Analysing Results Using RMSE Measure

Table 4 presents the mean RMSE over 50 independent simulations for each time series. However, to make a comparative statistical analysis of the proposed DEMS method with other methods, the Wilcoxon signed-rank test [62] is applied. Table 5 presents the individual hypothesis test results. It can be observed from Tables 4 and 5 that the DEMS method provides the best RMSE on 13 time series datasets (Acres Burned, Accidental Death, Internet Traffic, Gas Usage, Lake, Lynx, Milk, Mumps, Tasty Cola, Rainfall, Passenger, Temperature, Unemployment) with 11 statistical significant (as in Table 5). Additionally, it provides better RMSE than ETS in 16 cases, ARIMA in 15 cases (14 statistically significant), MLP + LM in 16 cases and DBN [27] in 21 cases (18 statistically significant). It provides inferior RMSE than ETS, ARIMA and MLP + LM in only 4, 5 and 4 time series datasets, respectively, and for other cases, it provides statistically equivalent RMSE.

Table 4 RMSE for each dataset (best values are presented in bold face)
Table 5 Individual hypothesis test results using RMSE indicating the superior (+), inferior (−) or equivalent (≈) method with respect to the proposed DEMS method

5.2 Analysing Results Using SMAPE Measure

In this subsection, the results are analysed using the scale-free SMAPE performance measure. Therefore, the results are analysed considering each time series individually and all time series at a time. For evaluating the methods dataset wise, the mean SMAPE over 50 independent simulations is calculated and presented in Table 6. To statistically compare the proposed DEMS method with other methods, Wilcoxon signed-rank test [62] is applied, and the test results are presented in Table 7. It can be observed from Table 6 that the DEMS method provides the best SMAPE in 11 time series datasets (Accidental Death, Chickenpox, Internet Traffic, ITDUK, Lake, Milk, Passenger, Sun Spot, Tasty Cola, Temperature, Unemployment) with statistical significance in 8 cases (as in Table 7). In addition to this, it provides better SMAPE than ETS in 15 time series datasets, ARIMA in 15 time series datasets, MLP + LM in 16 time series datasets and DBN [27] in 17 (16 statistically significant) time series datasets. It provides inferior SMAPE than ETS in 4, ARIMA in 5, MLP + LM in 2 and DBN [27] in 1 time series datasets and for other cases, it provides statistically equivalent SMAPE.

Table 6 SMAPE for each dataset (best values are presented in bold face)
Table 7 Individual hypothesis test results using SMAPE indicating the superior (+), inferior (−) or equivalent (≈) method with respect to the proposed DEMS method

To determine the best method considering all the time series datasets, Friedman and Nemenyi hypothesis test [62, 63] is applied on the SMAPEs of all time series datasets. The test results presented in Fig. 4 indicates the superiority of the DEMS method as it has the lowest mean rank (69.69) among all the methods employed in this study. In addition, since the mean rank of DEMS method is lower than other methods with at least an absolute difference of critical distance (1.3), the DEMS method is statistically superior to other methods.

Fig. 4
figure 4

Ranks of forecasting methods using SMAPE (p = 0.0 and Critical Distance = 1.3)

To make the proposed DEMS method consistently effective, several comparisons between different designs are performed. The design strategies are intended to evaluate the effect of treatment of trend component, sensitivity to different normalization techniques and effectiveness of the proposed ADE-ANNT training algorithm.

For this, the design configuration (i.e. the full design parameter) used in the proposed method using min–max normalization technique and ADE-ANNT training algorithm is treated as the baseline. To evaluate the effect of treatment of trend component, a detrending step by taking the first difference of time series data is added before the normalization step. Here, the first difference is considered because detrending data by making the first difference is the most suitable method for building the ANN model for nonlinear and stochastic time series [12]. To evaluate the sensitivity to different normalization techniques, min–max (as in Eq. 18), decimal scaling (as in Eq. 19) and vector (as in Eq. 20) normalization techniques are considered. To evaluate the sensitivity to different training algorithms, the proposed ADE-ANNT training algorithm, TLBO [64] for ANN training (TLBO-ANNT), DE-ANNT + [41] (with five trial vectors), DE-CRO-HONNT [43] and Levenberg–Marquardt (LM) algorithm are considered. Additionally, to evaluate the effectiveness of architecture selection using DEMS, results from MLP + LM (architecture selection using ACF & PACF and trained with LM) method is compared with DEMS + LM (architecture obtained using DEMS method and trained with LM algorithm). In the comparison of architecture selection methods, the DBN [27] method is not considered because of its inferior statistical performance than the MLP + LM method (as in Fig. 4). In addition, to avoid the bias of DEMS method towards ADE-ANNT training algorithm (since the architectures are selected using ADE-ANNT training algorithm), the LM algorithm is used to train the models in DEMS + LM method. By making these modifications, 50 independent simulations are conducted for each time series dataset. To evaluate the sensitivity of the DEMS method considering different strategies, a one-way analysis of variance (ANOVA) test is conducted on the obtained SMAPEs. The test results are presented in Table 8. It indicates that the architectures selected using the DEMS method are statistically insensitive (p-value > 0.05) to different normalization techniques. However, the chosen architectures using the DEMS method are sensitive to detrending using the first difference of the series (p-value < 0.05) and training algorithms (p-value < 0.05). Additionally, the performance of the ANN model is sensitive to the architecture selection method (p-value < 0.05).

Table 8 ANOVA results considering different strategies

To identify the effect of treating the trend component by using the first difference of the series, Friedman and Nemenyi hypothesis test [62, 63] is applied on the obtained SMAPEs, and the test results are presented in Fig. 5. It can be observed that the proposed method performs statistically better when detrending using the first difference is avoided. Contrasting to the claim of the superior performance of ANN models with detrending using the first difference by Qi and Zhang [12], in this study, the performance of ANN model deteriorates significantly when detrending using the first difference is used. However, in Qi and Zhang [12] method, the architecture of the ANN model is chosen in an ad hoc manner. In addition, they suggested that the conclusions may change with different parameters. From simulation results, it is claimed that ANN with an appropriate architecture can effectively handle the trend component of the time series. And in such a case, detrending using the first difference deteriorates the performance of the model significantly.

Fig. 5
figure 5

Ranks of direct versus detrending using the first difference of the series using SMAPE (p = 0.0 and Critical Distance = 0.4)

To identify the best training algorithm among all the algorithms considered in this study, Friedman and Nemenyi hypothesis test [62, 63] is applied to the obtained SMAPEs. The test results shown in Fig. 6 indicate the clear superiority of the proposed ADE-ANNT training algorithm. The most competing algorithm with the proposed ADE-ANNT algorithm is LM. However, the absolute difference in mean rank (1.51) is more than the critical distance (CD) (1.3). Hence, their difference is statistically significant. Moreover, the memory consumption by the proposed ADE-ANNT algorithm is always less than or equal to that of DE-ANNT + [41], since, in ADE-ANNT algorithm, multiple trial vectors are generated only upon stagnation. In the worst-case (i.e. stagnation occurs at every generation, which is a rare case) both the algorithms have the same memory consumption. The DE-ANNT + [41] algorithm has lower memory consumption than the LM algorithm. Thus, the ADE-ANNT algorithm has lower memory consumption than LM algorithm as well. In the proposed wrapper approach-based DEMS method, a variety of ANN architectures need to be trained. Thus, considering the memory consumption and forecasting accuracy, ADE-ANNT algorithm is more suited in the proposed modelling scheme.

Fig. 6
figure 6

Ranks of training algorithms using SMAPE (p = 0.0 and Critical Distance = 1.3)

Table 8 indicates that the DEMS + LM and MLP + LM methods provide statistically different overall forecasting accuracy considering SMAPE measure. In both methods, the LM training algorithm is used but on different architectures. In DEMS + LM method, the architectures are obtained using the proposed DEMS (as in Table 3) method whereas, in MLP + LM method, the architectures are obtained using ACF and PACF function of the series. To identify the better method, Friedman and Nemenyi hypothesis test [62, 63] is applied to the obtained SMAPEs, and the test results are shown in Fig. 7. It indicates the superiority of the proposed architecture selection DEMS method.

Fig. 7
figure 7

Ranks of DEMS + LM and MLP + LM method using SMAPE (p = 0.0 and Critical Distance = 0.4)

6 Conclusion

In this paper, a systematic investigation is made to explore the full potential of multilayer feedforward ANN, especially single hidden layer MLP in TSF. For this, a DE-based modelling scheme (DEMS) is developed to determine the most parsimonious architecture of MLP for a time series under study. In addition, an adaptive DE-based ANN training algorithm (ADE-ANNT) is proposed to determine the near-optimal weight set of ANN. Extensive statistical analyses on obtained results indicate the statistical superiority of the proposed DEMS method than the established statistical models ARIMA and ETS. The architectures selected using the proposed DEMS method provide statistically superior result than the MLP + LM and DBN [27] methods. The results also indicate the statistical superiority of ADE-ANNT algorithm when compared with LM, DE-ANNT + [41], DE-CRO-HONNT [43] and TLBO-ANNT [64]. In addition to statically superior SMAPE, the ADE-ANNT algorithm has lower memory consumption than DE-ANNT + [41] and the most popular LM algorithm.

The experiments reveal that the ANN models are quite capable of handling the trend component if the near-optimal architecture for a time series is obtained and in such a case detrending by taking the first difference of the series deteriorates the forecasting accuracy. Therefore, in the proposed DEMS method, treatment of trend component is avoided. It is also observed that in TSF, the ANN models are insensitive to different normalization techniques.

The proposed DEMS method is fully automatic and can be applied in various real-world time series with least human intervention. The DEMS method is computationally expensive because of the use of hierarchical DE algorithms for simultaneous optimization of architecture and weight set of ANN. However, it supports parallel implementation to reduce computational time. In addition, the proposed DEMS method can also be applied to optimize the architecture of other ANN models like FLANN, Pi-Sigma neural network, RBF neural network, etc.