1 Introduction

The solution to the forecasting problem can be realized by many different methods. It is one of the most widely used scientific methods to obtain forecasts with a model created over the historical values of the time series using lagged variables. The models can be linear or non-linear. In particular, forecasting models created with artificial neural networks, where nonlinear designs are flexible and can change based on data thanks to hidden layers, have become popular in recent years. Although artificial neural networks are flexible approaches and data-driven methods, they have problematic aspects in solving the forecasting problem due to problems such as overfitting problems, local optimum traps and difficulties in determining hyperparameters. The most widely used multilayer perceptron has recently been replaced by deep artificial neural networks. Recurrent deep artificial neural networks are frequently preferred for forecasting purposes due to their structure that allows time dependence by the time series problem. Long short-term memory (LSTM), which offers a solution to the disappearing and exploding gradient problem encountered in recurrent deep networks by using gates in its architecture, is the most commonly used deep recurrent artificial neural network. While LSTM gets rid of the problems of derivative-based training algorithms thanks to gates, it invites overfitting problems by increasing the number of parameters. The literature of LSTM is given below and it is understood that it is used in solving forecasting problems that arise in different fields.

The LSTM artificial neural network, which was first introduced by Hocreiter and Schmidhuber (1997), is widely used in forecasting studies. When the studies are examined, it has been determined that LSTM has started to be used intensively in fields such as energy, hydrology, economy and health. Fischer and Krauss (2018), Yao et al. (2018a; b), Seviye et al. (2018), Zhang and Yang (2019), Cen and Wang (2019), Vidya and Hari (2020), Yadav et al. (2020), Liu et al. (2021), Tang et al. (2022), Firouzjaee and Khaliliyan (2022), Kumar et al. (2022a, b) and Peng et al. (2020) used LSTM in their study. Livieris et al. (2020), Pirani et al. (2022), Ensafi et al. (2022), Freeboroug and Zyl (2022), Arunkumar et al. (2022), Kumar et al. (2022a, b), Sirisha et al. (2022) and Singha and Panse (2022) compared LSTM with different methods in their studies. In addition, LSTM, classical methods and hybrid approaches in which they are combined are used to solve the forecasting problem. A hybrid method was presented by Quadir et al. (2022). Many studies have been carried out in the field of engineering using LSTM. Some of these were proposed by Dong et al. (2017), Zhang et al. (2018), Weng et al. (2019), Huang et al (2022), Senanayake et al. (2022) and Silka et al. (2022). In addition, LSTM was compared with different forecasting methods in the literature by Barrera-Animas et al. (2022). When the studies are examined, it is seen that LSTM is used in the field of health as in many other fields. Chimmula and Zhang (2020) used LSTM to solve health data. In the field of tourism, there are many studies in which LSTM is used. The studies of Bi et al. (2020), Zhang et al. (2021) and Kumar et al. (2022a, b) can be given as an example of these studies. Solgi et al. (2021), Du et al. (2022a, b), Kilinc and Yurtsever (2022) and Li et al. (2022) used the LSTM method in the field of hydrology. Jiang and Hu (2018), Gong et al. (2022), Bilgili et al. (2022) and Karasu and Altan (2022) used LSTM in the field of energy. Besides, Zha et al. (2022), Yazici et al. (2022) and Ning et al. (2022) used the LSTM method in the field of energy. Tian et al. (2018), Liu et al. (2018), Karevan and Kim and Cho (2019), Karevan and Suykens (2020), Veeramsetty et al. (2021), Yu et al. (2022) and Du et al. (2022a, b) used hybrid approaches based on LSTM. Artificial intelligence optimization algorithms have been used in the training of LSTM. Genetic algorithm (GA) and particle swarm optimization (PSO) is the most preferred algorithms. Chen et al. (2018), Chung and Shin (2018) and Stajkowski et al. (2020) used GA as an optimization technique for the LSTM. Moalla et al. (2017), Yao et al. (2018a, b), Shao et al. (2019), Qiu et al. (2020) and Gundu and Simon (2021) used PSO for the training of LSTM. Bas et al. (2021) PSO used for training of Pi-Sigma. Liu and Song (2022) used granular neural networks. Fan et al. (2021), used a deep-learning approach for forecasting. Fuzzy inference systems and fuzzy time series are used for forecasting. Chen and Jian (2017), Chen and Jian (2017), Zeng et al. (2019), Chen et al. (2019), Bas et al. (2022a), Egrioglu et al. (2022), Pant and Kumar (2022a) and Pant and Kumar (2022b) used inference systems and fuzzy time series in their study.

The motivation of this study is the absence of a training algorithm in the literature for training an LSTM network designed for the solution of the forecasting problem, which reduces the variation due to the initial random weights based on PSO and is less affected in overfitting problems. The contributions of this study are given below:

  • A training algorithm suitable for the designed LSTM is proposed for the forecasting problem.

  • As the proposed training algorithm is based on PSO, it is less likely to get caught in local optimum traps.

  • In the proposed training algorithm, different from other algorithms using PSO, the variability due to random starting weights has been reduced with a strategy called the restart strategy, and the results have been made more stable.

  • In the proposed training algorithm, a solution to the overfitting problem is presented by defining the early stopping condition based on the proportional error, unlike other algorithms using PSO.

  • Since the proposed algorithm does not need derivatives, it does not have the problem that the derivative cannot be calculated in certain regions of the search space.

In the second part of the study, the LSTM architecture, which is specific to the LSTM artificial neural network and forecasting problem, is given. In the third part of the study, the proposed new training algorithm is introduced. In the fourth chapter, application results for the FTSE time series are given. In the fifth chapter, the results and discussion are given.

2 LSTM artificial neural network for forecasting

LSTM is a deep artificial neural network with feedback connections formed by neurons containing gates. In the LSTM deep neural network, the first and most important feature to understand is the structure of an LSTM neuron. The structure of an LSTM cell is given in Fig. 1. There are input, forget, cell candidate and output gates in an LSTM neuron. Each gate in the LSTM cell has its input weights, recurrent weights and biases. Input gate, forget gate and cell candidate outputs are calculated as in the following equations.

Fig. 1
figure 1

The structure of the LSTM cell

$${\mathrm{i}}_{\mathrm{t}}=\upsigma \left({\mathrm{W}}_{\mathrm{i}}{\mathrm{x}}_{\mathrm{t}}+{\mathrm{R}}_{\mathrm{i}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{b}}_{\mathrm{i}}\right)$$
(1)
$${\mathrm{f}}_{\mathrm{t}}=\upsigma \left({\mathrm{W}}_{\mathrm{f}}{\mathrm{x}}_{\mathrm{t}}+{\mathrm{R}}_{\mathrm{f}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{b}}_{\mathrm{f}}\right)$$
(2)
$${\mathrm{g}}_{\mathrm{t}}=\upsigma \left({\mathrm{W}}_{\mathrm{g}}{\mathrm{x}}_{\mathrm{t}}+{\mathrm{R}}_{\mathrm{g}}{\mathrm{h}}_{\mathrm{t}-1}+{\mathrm{b}}_{\mathrm{g}}\right)$$
(3)

After calculating the outputs of the input gate, forget gate and cell candidate gate by using Eqs. (13), the cell state value is calculated as in Eq. (4) using the previous cell state and the outputs of these gates.

$${c}_{t}={f}_{t}*{c}_{t-1}+{i}_{t}*{g}_{t}$$
(4)

The output of the output gate is calculated with the (5) formula. Then, the hidden state value is calculated by using Eq. (6) and the output gate output and the cell state outputs.

$${o}_{t}=\sigma \left({W}_{o}{x}_{t}+{R}_{o}{h}_{t-1}+{b}_{o}\right)$$
(5)
$${h}_{t}={o}_{t}*\mathrm{tanh}({c}_{t})$$
(6)

As a result, the inputs of an LSTM are inputs of the model (\({x}_{t}\)), cell state and hidden state with one step delay (\({c}_{t-1}\), \({h}_{t-1}\)) while the output of the LSTM cell is cell state and hidden state (\({c}_{t}\), \({h}_{t}\)). Different LSTM deep neural networks can be revealed by different designs of LSTM cells. The architectural structure used in this study for the single variable time series forecasting problem is given in Fig. 2. In this architecture created for the forecasting problem, the inputs of the system are the lagged variables of the time series \(({y}_{t})\).

Fig. 2
figure 2

The architecture of LSTM deep recurrent artificial neural network for forecasting problem

The input of an LSTM cell is lagged variable of \({x}_{t}=({y}_{t},{y}_{t-1,\dots ,}{y}_{t-p+1})\) according to the time step in Fig. 2. For example, the input for the LSTM cell in the lower right corner of the architecture is \({x}_{t-1}=({y}_{t-1},{y}_{t-2,\dots ,}{y}_{t-p})\) in Fig. 2. The output of the LSTM deep recurrent artificial neural network is one step forecast of time series (\({\widehat{y}}_{t})\). The output of the network is calculated with the following formula and \(\sigma\) represents the logistic activation function.

$${\widehat{y}}_{t}=\sigma \left({W}_{FC}{h}_{t}+{b}_{FC}\right)$$
(7)

In Eq. (7), \({h}_{t}\) is the output of the last thick-edged neuron in the architecture. The architecture in Fig. 2 has \(m\) time steps, \(n\) hidden layers and \(p\) inputs or features. The number of neurons in all hidden layers is equal to m. The weight and bias values of all LSTM cells in the same hidden layer are taken equally. This parameter sharing both reduces the number of parameters and enables a common LSTM cell that presents the same mathematical model in all time steps. These weights and bias values change in different hidden layers, that is, increasing the number of hidden layers increases the number of parameters of the network, while the number of time steps is not effective on the number of parameters.

Stochastic gradient descent algorithms can be used for training the LSTM given in Fig. 2. In stochastic gradient descent algorithms, the actual gradient is estimated by sampling. The real gradient is estimated by calculating the gradient over a random subgroup selected from the training data. One of the most commonly used stochastic gradient algorithms is Duchi et al. It is the AdaGrad algorithm proposed in (2011). In this algorithm, a scaling factor obtained from the outer product of the gradient vector is used. Another algorithm that can be used in the training of LSTM is the RMSROP algorithm. This algorithm is the minibatch version of the RROP algorithm that works using the sign of the gradient. The most frequently used training algorithm of LSTM, which is included in many packages, is the Adam algorithm given in Kingma and Ba (2014). The Adam algorithm is an algorithm with fast convergence properties. There are also more advanced versions of this algorithm in the literature. The Adam algorithm is implemented with the following steps.

Algorithm 1. Adam Algorithm

  1. Step 1.

    Algorithms’ initial parameter values are determined. The initial parameter values can be used as the default values of the Matlab package programme Neural Netwrok toolbox as in other studies in the literature. \(\alpha\): Step size\({\beta }_{1}\) and \({\beta }_{2}\): Exponential decay rates for the moment estimates\(\varepsilon\): Correction term

  2. Step 2.

    Initial values are set as zero.

    $${m}_{0}={v}_{0}=t=0$$
  3. Step 3.

    Initial weight and bias values (\({\theta }_{0})\) are determined. These values are realised by generating random numbers within a certain range.

  4. Step 4.

    Do \(t=t+1\). The iteration counter is incremented by one.

  5. Step 5.

    Calculate the following equations:

    $${g}_{t}=\nabla \mathrm{f}$$
    (8)
    $${m}_{t}={\beta }_{1}{m}_{t-1}+(1-{\beta }_{1}){g}_{t}$$
    (9)
    $${v}_{t}={\beta }_{2}{v}_{t-1}+(1-{\beta }_{2}){g}_{t}^{2}$$
    (10)
    $${\widehat{m}}_{t}={m}_{t}/(1-{\beta }_{1}^{t})$$
    (11)
    $${\widehat{v}}_{t}={v}_{t}/(1-{\beta }_{2}^{t})$$
    (12)
    $${\theta }_{t}={\theta }_{t-1}-\alpha {\widehat{m}}_{t}/(\sqrt{{\widehat{v}}_{t}}+\varepsilon )$$
    (13)

In Eq. (10), \({g}_{t}^{2}\) is a vector containing diagonal elements of the Hessian matrix. The calculations are performed sequentially and in order.

  1. Step 6.

    Repeat Steps 4–5 until convergence is achieved. Convergence is achieved by resetting the gradient vector to zero.

3 A new training algorithm for LSTM based on particle swarm optimization

The common point of AdaGrad, RMSROP and Adam algorithms is that they work with derivative information and operate with a single initial parameter vector set. This situation increases the possibility of the methods being caught in local optimum traps and the methods can be greatly influenced by the selection of initial parameters. In this study, a new training algorithm based on particle swarm optimization is designed, which has a lower risk of getting caught in local optimum traps and reduces the originating variance in determining the initial random weights. The Proposed algorithm works with two strategies called restarting and early stopping. The restarting strategy reduces the variance calculated for the change in the results caused by the random variation of the initial parameter values. The early stop condition can provide a solution to the overfitting problem by working with a proportional error. With these strategies, Bas et al. (2022), successful training results were obtained for the simple deep recurrent neural network. Moreover, since the proposed training algorithm is based on particle swarm optimization, it explores the search space with a set of these sets instead of a single initial parameter set, increasing the chance of avoiding the local optimum trap. The proposed training algorithm is presented below in steps.

Algorithm 2. A new training algorithm for LSTM

  1. Step 1.

    The parameters of the PSO algorithm are selected according to the handled network.

    • \({c}_{1}^{initial}:\) The starting value of the cognitive coefficient, the default value is 1.

    • \({c}_{1}^{final}:\) The ending value of the cognitive coefficient, the default value is 2.

    • \({c}_{2}^{initial}:\) The starting value of the social coefficient, the default value is 2.

    • \({c}_{2}^{final}:\) The ending value of the social coefficient, the default value is 1.

    • \({w}^{initial} :\) The starting value for inertia weight, the default value is 0.4.

    • \({w}^{final} :\) The ending value for inertia weight, the default value is 0.9.

    • \(vmaps\): The bound value for the velocities, the default value is 1.

    • \(limit1:\) The limit value for the re-starting strategy, the default value is 30.

    • \(limit2:\) The limit value for the early stopping rule, the default value is 20.

    • \(maxitr:\) The maximum number of iterations, the default value is 1000.

    • \(pn\): The number of particles, the default value is 30.

Counters for restarting and early stopping strategies and iteration counter are reset (\(rsc=0\), \(esc=0, t=0).\) It is also possible to use alternative values for selecting the initial parameters.

  1. Step 2.

    Initial position values and initial velocity values in particle swarm optimization are created.

The positions of a particle constitute all the weights and bias values of an LSTM network. The total number of weights and biases is \(4m\left(p+m+1\right)+m+1\) in an LSTM cell because the dimensions of weights and biases are \({W}_{i}:p\times m\), \({R}_{i}:m\times m\), \({b}_{\mathrm{i}}:1xm\), \({W}_{f}:p\times m\), \({R}_{f}:m\times m\), \({b}_{\mathrm{f}}:1xm\), \({W}_{g}:p\times m\), \({R}_{g}:m\times m\), \({b}_{\mathrm{g}}:1xm\), \({W}_{o}:p\times m\), \({R}_{o}:m\times m\), \({b}_{\mathrm{o}}:1xm\), \({W}_{FC}:m\times 1\) and \({b}_{FC}:1\times 1\)). For the LSTM network given in Fig. 2, the total weight and number of biases vary according to the number of hidden layers. This number is \(4m\left(p+m+1\right)+4\left(n-1\right)m\left(2m+1\right)+m+1\). For example, for \(n=2, m=3, p=4\), the total number of weights and biases is 184 in the LSTM network. The weights and biases are generated from \([\mathrm{0,1}]\) intervals. All velocities are generated from \([-vmaps\), \(vmaps]\) interval. \({P}_{i,j}^{(t)}\) is the jth position of the ith particle at the tth iteration. \({V}_{i,j}^{(t)}\) is the jth velocity of ith particle at the tth iteration.

  1. Step 3.

    As the fitness function to be used for the minimization of the PSO, the mean square error criterion given below is chosen. At this stage, it is possible to choose a different criterion, different choices will not affect the operation of the algorithm, but may affect the results. “vmaps” can also be a value that controls whether parameters are changed too much or too little.

    $${MSE}_{j}^{t}=\frac{1}{ntrain}\sum_{t=1}^{ntrain}{\left({y}_{t}-{\widehat{y}}_{t}\right)}^{2}, j=\mathrm{1,2},\dots ,pn$$
    (14)
  2. Step 4.

    The iteration counter is increased \(t=t+1\).

  3. Step 5.

    Pbest and gbest are constituted. Pbest is a memory of all particles in the solution set, equal to their initial positions in the first step. Pbest is checked at each iteration, and the positions of the developing particles are memorized in Pbest. In a given iteration, the elements of Pbest represent the best individual memories of all particles so far. “gbest” is the memory of the swarm and is the best particle of Pbest. In this step, gbest and pbest vectors are generated only for the first iteration. In subsequent iterations these vectors are updated.

  4. Step 6.

    The values of the cognitive, social coefficients and inertia weight parameters are calculated by the Eqs. (1517).

    $${w}^{(t)}={(w}^{initial}-{w}^{final})\frac{maxitr-t}{maxitr}+{w}^{final}$$
    (15)
    $${c}_{1}^{(t)}={(c}_{1}^{final}-{c}_{1}^{initial})\frac{t}{maxitr}+{c}_{1}^{initial}$$
    (16)
    $${c}_{2}^{(t)}={({c}_{2}^{initial}-c}_{2}^{final})\frac{maxitr-t}{maxitr}+{c}_{2}^{final}$$
    (17)

Equations (1517) are used to linearly decrease or increase the coefficients. These equations allow the strengthening of global search by focusing on the strength of local search at the beginning and the final point on one point.

  1. Step 7.

    The new velocities and positions are calculated by using the Eqs. (1820). The \({r}_{1}\) and \({r}_{2}\) random numbers are generated from \([\mathrm{0,1}]\) intervals. In addition, Eq. 3 ensures that the velocities are fixed within a certain range.

    $${V}_{i,j}^{(t)}={w}^{(t)}{V}_{i,j}^{(t-1)}+{c}_{1}^{(t)}{r}_{1}\left({Pbest}_{i,j}^{(t)}-{P}_{i,j}^{(t)}\right)+{c}_{2}^{(t)}{r}_{2}\left({gbest}_{j}^{(t)}-{P}_{i,j}^{(t)}\right)$$
    (18)
    $${V}_{i,j}^{(t)}=\mathrm{min}(vmaps,\mathrm{max}\left(-vmaps,{V}_{i,j}^{\left(t\right)}\right))$$
    (19)
    $${P}_{i,j}^{(t)}={P}_{i,j}^{(t-1)}+{V}_{i,j}^{(t)}$$
    (20)
  2. Step 8.

    Using Eq. (14), the fitness functions of all particles are calculated. A different function could have been selected as fitness function.

  3. Step 9.

    Pbest and gbest are updated. The fitness values calculated for the new positions of the particles are compared with the corresponding fitness value of Pbest, and if there is a new particle that lowers the fitness value according to its memory in Pbest, the relevant line of Pbest is replaced with new positions. Otherwise, no changes are made. After Pbest, the fitness value of the line with the best fitness value of the current Pbest is compared with the fitness value of “gbest”. If an improvement is achieved, “gbest” is changed, otherwise, it is not changed.

  4. Step 10.

    The restarting strategy counter is increased (\(rsc=rsc+1\)) and checked. If the \(rsc>limit1\) then all positions and velocities are re-generated and the \(rsc\) is taken as zero but “Pbest” and “gbest” are saved in this step.

  5. Step 11.

    The early stopping rule is checked. The \(esc\) counter is increased depending on the following condition.

    $$esc = \left\{ {\begin{array}{*{20}c} {esc + 1,} & {if\frac{{MSEbest^{{\left( t \right)}} - MSEbest^{{\left( {t - 1} \right)}} }}{{MSEbest^{{\left( t \right)}} }} < 10^{{ - 3}} } \\ {0,} & {otherwise} \\ \end{array} } \right.$$
    (21)

The early stopping rule is \(esc>limit2.\) If the rule is satisfied, the algorithm is stopped otherwise go to Step 4.

  1. Step 12.

    The state of reaching the maximum number of iterations is checked. If \(t>maxitr\), the algorithm is stopped and the best solution is taken as “gbest”. Otherwise, it returns to Step 6.

A flowchart is given in Fig. 3 for a better understanding of the proposed new training algorithm.

Fig. 3
figure 3

Flowchart for the proposed training algorithm

Another important problem of LSTM given in Fig. 2 is hyperparameter selection. The properties of the analyzed data set may be important for the selection of hyperparameters. In general, it is preferred to select the hyperparameter values from a determined set by trial and error. In this method, determining the possible values of the hyperparameters is important in terms of determining the optimum calculation time. In this study, possible values of hyperparameters are determined by considering the observation frequency of the time series of interest. The following algorithm based on data partition is used for hyperparameter selection. The hyperparameters of the LSTM are considered as \(m\), \(n\) and \(p\).

Algorithm 3. Hyperparameter selection processes

  1. Step 1.

    The possible values of the hyperparameters are determined. These values are lower and upper bound values for the hyperparameter values. The possible range of these values can be chosen according to the components contained in the time series. For example, in time series with seasonality, the possible values for the number of inputs of the model can be taken to include the period of the series.

    $$m\in [{m}_{1},{m}_{2}]$$
    $$n\in [{n}_{1},{n}_{2}]$$
    $$p\in [{p}_{1},{p}_{2}]$$
  2. Step 2.

    Learning samples are divided into three parts training, validation and test set in a block structure. The reason for using the block structure is to ensure that the test set consists of more up-to-date data due to time dependence in the time series. While the LSTM is being trained with the training data, the hyperparameters are selected with the validation set. The test set is used to compare the performance of LSTM with other methods. The data fragmentation method is given in Fig 4. In Fig. 4\(ntrain, nval\) and \(ntest\) are the number of observations for training, validation and test sets, respectively.

Fig. 4
figure 4

The data partition for the hyperparameter selection

The lengths of the validity and test sets are determined by the frequency of observation of the time series. For example, for a five-day series, both the validation and test set can be taken as 20 to cover one month.

  1. Step 3.

    Using Algorithm 2, train the LSTM using the training data for all possible hyperparameter values and calculate the root of mean square error values for the validation set.

Algorithm 2 is repeated \(({m}_{2}-{m}_{1}+1)\times ({n}_{2}-{n}_{1}+1)\times ({p}_{2}-{p}_{1}+1)\) times in total according to possible parameter values.

$${RMSE}_{val}=\sqrt{\frac{1}{nval}\sum_{k=ntrain+1}^{ntrain+nval}{\left({y}_{t}-{\widehat{y}}_{t}\right)}^{2}}$$
(22)

The error measure in this step can be chosen as a different error measure instead of RMSE.

  1. Step 4.

    The best hyperparameter values are selected. The best hyperparameter values (\(mbest, nbest, pbest\)) are the hyperparameter values that give the lowest \({RMSE}_{val}\) value.

The flow diagram of the hyperparameter selection algorithm is given in Fig. 5.

Fig. 5
figure 5

The flowchart of the hyperparameter selection algorithm

4 Applications

The performance of the proposed new training algorithm in this study is investigated over ten-time series randomly selected from the closing prices of the FTSE 100 index. The FTSE 100 index is a market-capitalization-weighted index of UK-listed blue-chip companies. The index is part of the FTSE UK Series and is designed to measure the performance of the 100 largest companies traded on the London Stock Exchange. The information of the randomly selected time series is given in Table 1 and their graphs are given in Fig. 6. The random selection of the time series allows the closing values of different periods of the year to be included in the training, validation and test set each time and to make a more comprehensive and general comparison.

Table 1 The information about the randomly selected time series from FTSE 100 closing prices
Fig. 6
figure 6

The randomly selected ten time series graphs from FTSE 100 Closing Prices

In the application, the performance of the proposed method is compared with the LSTM artificial neural network trained with the Adam algorithm. When the most frequently used LSTM training algorithm in the literature is the Adam algorithm, the main success criterion of this study is to be able to reveal a more successful training algorithm than this algorithm. In addition, the performance of the proposed training algorithm is compared with Pi-Sigma, a high-order artificial neural network, and a simple recurrent deep artificial neural network (SRNN), which is a deep artificial neural network. Pi-Sigma and SRNN artificial neural networks are also trained with a training algorithm based on the PSO algorithm, similar to the proposed method.

In the implementation of all methods, the hyperparameter selection problem is performed with the hyperparameter selection algorithm used for the proposed method, ensuring a fair comparison. In the applications, the validation and test set lengths are taken as 60 for all series and all methods.

For the best hyperparameters, all methods are trained 30 times using different random initial weights, and the mean, standard deviation, and minimum and maximum statistics of the RMSE values obtained for the test set are calculated for these 30 replicates. In addition, the performance of the proposed method over these 30 replicates is compared with all methods separately using Wilcox’s Signed Rank Test, and the p-values obtained for the significance obtained are given in the last column of the table. P-values less than the probability of type 1 error \((\alpha )\) determined in the last column indicate a statistically significant difference. All these results are presented in Table 2 for all series. In addition, the best hyperparameter values selected on the validation set for all methods are given in Table 3. Since Pi-Sigma ANN is not a deep artificial neural network, the number of hidden layers is fixed and 1.

Table 2 The RMSE statistics for methods
Table 3 The best hyperparameter values for all methods

When Table 2 is examined, it is seen that the most successful method for Series 1 is Pi-Sigma. For Series 1, it is seen that LSTM-ADAM produces lower mean RMSE values than the proposed method. It is also seen that LSTM-ADAM has a lower standard deviation and maximum statistics. When the proposed method is compared with LSTM-ADAM, it is understood that LSTM-ADAM has a statistically significant difference from the proposed method according to the Wilcoxon signed rank test and produces better forecasting results. Besides, it is seen that the proposed method for Series 1 and the results of SRNN have statistically equivalent performance according to Wilcoxon signed rank test.

When the results given for Series 2 in Table 2 are examined, it is understood that the proposed method has more successful RMSE statistics than all methods and these differences are statistically significant according to Wilcox’s signed rank test. When the results for Series 3 are examined in Table 2, it is seen that the most successful method is SRNN and it produces statistically more successful forecasting results than the proposed method. In addition, it is seen that the proposed method produces more successful forecasting results than the LSTM-ADAM method and this difference is statistically significant according to the Wilcoxon signed-rank test. When the results for Series 4 are examined in Table 2, it is seen that the most successful method is the recommended method according to the mean, standard deviation and maximum statistics, only the pi-sigma method produces a lower RMSE than all other methods according to the minimum statistics. However, it is understood that the results produced by the proposed method have a statistically lower median than all other methods according to the Wilcoxon signed-rank test.

When the results for Series 5 are examined in Table 2, it is seen that the most successful method is the recommended method according to the mean, minimum and maximum statistics, only the Pi-sigma method is lower than all other methods according to the standard deviation statistics. However, it is understood that the results produced by the proposed method have a statistically lower median than all other methods except SRNN according to the Wilcoxon signed-rank test. It is understood that the forecasting performance of the proposed method and the SRNN method are equivalent according to the Wilcoxon signed-rank test.

When the results for Series 6 are examined in Table 2, it is seen that the most successful method is SRNN according to the mean, minimum and maximum statistics. However, it is understood that the forecasting performance produced by the proposed method is statistically equivalent to SRNN, Pi-sigma. In addition, it is seen that the proposed method produces more successful and statistically significant results than the LSTM-ADAM method.

When the results for Series 7 are examined in Table 2, it is seen that the most successful method is the recommended method according to the standard deviation, minimum and maximum statistics. Pi-sigma ANN produced lower mean statistics. In addition, Pi-sigma ANN produces forecasting results with a lower median according to Wilcox’s signed rank test. In addition, it is seen that the proposed method produces more successful and statistically significant results in all statistics than LSTM-ADAM and SRNN methods according to the Wilcoxon signed-rank test. When the results for Series 8 are examined in Table 2, a situation similar to the results obtained for Series 7.

When the results for Series 9 are examined in Table 2, it is seen that the best results are obtained from the LSTM-ADAM algorithm. Besides, it is understood that the performance of the proposed method is equivalent to Pi-sigma and SRNN according to Wilcox’s signed rank test. Finally, when the results for Series 10 in Table 2 are examined, it is seen that the best mean and maximum statistics results are obtained from the proposed method. In addition, it is understood that the SRNN method produces the best results according to the minimum and standard deviation statistics. It is understood that the method with the lowest median according to Wilcox’s signed rank test is the recommended method and produces RMSE statistics with a statistically significant difference and a lower median. When Table 3 is examined, it is understood that the best hyperparameter values vary considerably according to the series and the applied method. Although the proposed method and LSTM-ADAM methods have the same artificial neural network structure, different training algorithms may cause different hyperparameter selections.

5 Conclusion and discussions

In this study, an LSTM architecture for solving the forecasting problem with LSTM and a new PSO-based training algorithm for solving LSTM in this architectural structure is proposed. The proposed training algorithm has superior features compared to the Adam algorithm, which is the most frequently used and derivative-based training algorithm in the literature. Since the proposed training algorithm is based on PSO and does not require derivatives, there is no problem in the search space regions where the error function, which is possible for the Adam algorithm, cannot be differentiated.

In addition, the restart strategy in the proposed method reduces the variation caused by the differences in the random selection of the starting weights and ensures the homogeneity of the results. It was expected and observed that the early stopping condition could cure the overfitting problems. All these findings are supported by the results obtained in the application. While the success of the proposed method in practice is 50%, Pi-Sigma 30%, SRNN 10% and LSTM-ADAM 10%. In addition, when the proposed method is compared with LSTM-ADAM, it is seen that the proposed method produces more successful forecasting results in the LSTM-ADAM method in 80% of all series. All these findings are supported by the results obtained using Wilcox’s signed rank test. As a result, an effective and new training algorithm has been introduced for the solution of the univariate time series forecasting problem.

In future studies, it is planned to improve the results further by making adjustments in the LSTM architecture that will positively affect the forecasting performance. In addition, the use of different artificial intelligence optimization algorithms in the training algorithm is included in the plans of our future studies as another subject that should be investigated.