1 Introduction

Multilayer perceptron artificial neural networks (MLPANNs) can be adversely affected by outliers. Because of complex composition of inputs, an outlier as an input can cause too small or too large output. If the dataset has outlier(s), the training algorithm should be a robust algorithm or a robust artificial neural network (ANN) should be used to obtain satisfactory forecasting results.

In the literature, Chen and Jain [1] and Hsiao et al. [2] proposed a robust learning algorithm based on M-estimator. Lee et al. [3] presented a robust learning algorithm for radial basis neural networks. El Melegy et al. [4] and Rusuecki [5] put forward least median squares algorithm for ANNs. Thomas [6] proposed a robust learning algorithm for MLPANNs. Some robust ANNs were also proposed in the literature. Bors and Pitas [7] proposed median radial basis function ANN for datasets with outlier(s). Majhi et al. [8] proposed Wilcoxon ANN as a robust neural network (NN). Aladag et al. [9] introduced median neural network. In this study, network uses median of incoming signals as an aggregation functions. Yolcu et al. [10] proposed trimmed ANN.

ANNs with multiplicative neuron model are admissible alternatives to MLPANNs for forecasting aim. Ghosh and Shin [11] proposed Pi-Sigma neural network (PSNN) as a high-order neural network. Yadav et al. [12] put forward single neuron multiplicative neuron model artificial neural network (SMNMANN). These ANNs can produce more accurate forecast results than others. There are also different neuron models in the literature. Chen et al. [13] and Zhou et al. [14] used dendritic neuron model. Attia et al. [15] proposed generalized neuron model, Aladag et al. [9] proposed median neuron model and Yolcu et al. [10] proposed trimmed mean neuron model.

Outlier(s) can affect the outputs of the NN more than NNs with additive neuron model when the multiplicative neuron model (MNM) is used in the architecture. Bas et al. [16] proposed a robust training algorithm for SMNMANN. While in some studies various heuristic-based approaches have been used to forecast the datasets from different areas such as the study proposed by Barati and Sharifian [17], different types of NN are taken advantage of in other studies put forward by Berenguer et al. [18] and Haviluddin [19]. Moreover, Chow and Cho [20] described the development of new approaches to rainfall forecasting using NN. Cogollo and Velásquez [21] analysed the development of new forecasting models based on NNs. Thomas and Suhner [22] proposed a new pruning approach to determine optimal structure of NN. Beheshti et al. [23] used some meta-heuristic algorithms to train an ANN for improving the accuracy of rainfall forecasting. Dey et al. [24] proposed an approach based on both gene expression and programming and NNs to forecast unsteady forced convection over a cylinder. Kiakojoori and Khorasani [25] dwell on the problem health monitoring and prognosis of aircraft gas tribune engines by using two different dynamic NNs, and Li [26], to predict traffic flow, presented a new prediction tool combining the NN and fuzzy system called as dynamic fuzzy NNs.

In this study, a new robust artificial neural network is proposed for the purpose of forecasting. This new robust artificial neural network uses MNM and median neuron model (MdNM) in the architecture. The proposed new network is called as Median-Pi artificial neural network (MdPNN). Moreover, MdPNN is trained by particle swarm optimization (PSO). The proposed network can be used for the purpose of forecasting and also it can be modified for various purposes such as classification and prediction problem.

In the second section of the paper, the proposed MdPNN is introduced and an algorithm is given to explain how to compute an output for a learning sample. In the third section, the training algorithm is introduced for the proposed network model. Application results are given in Section 4 and the obtained results are discussed in the last section.

2 Median-Pi artificial neural network

Robust architectures for ANNs can be provided by robust statistics such as median and trimmed mean. In this paper, Pi-sigma neural network is modified by using MdNM instead of additive (sigma) neuron model. The proposed ANN is a high-order network and it is less affected by outlier(s) in a dataset under favour of median neuron model used in hidden layer. Figure 1 shows the architecture of the proposed MdPNN.

Fig. 1
figure 1

Architecture of Median-Pi artificial neural network

In Fig. 1, M and Π represent MdNN and MNM, respectively. The architecture given in Fig. 1 represents MdPNN structure with kth order and m-input. The inputs of the MdPNN are composed of lagged variables of time series. W represents weights between inputs and hidden layers and it is a matrix with m × k dimension. There are k neurons in the hidden layer and k represents the order of the network. In the hidden layer of the NN, median neuron models are employed. In a hidden layer’s neuron of MdNN, median of incoming signals constitutes the output of neuron. Let (s1 , s2 ,  …  , sm) be incoming signals and yl(l = 1, 2,  ⋯ , k) be output of the lth neuron and lth be the output of MdNN can be represented as in Eq. (2).

$$ {y}_l=\mathrm{Median}\left({s}_1,{s}_2,\dots, {s}_m\right) $$
(1)
$$ {f}_1\left({y}_l\right)={y}_l $$
(2)

In the hidden layer neurons, the activation functions are linear and θ1 , θ2 ,  …  , θk are bias terms. In the output layer, there is only a single neuron and MNM is used in this layer. The weights between hidden layer and output layer are taken as one and bias term is taken as zero. The output of MdPNN is obtained as in Eq. (3). The activation function is sigmoid activation function in the output layer and the output of MdPNN is calculated as in Eq. (4).

$$ y=\mathrm{Prod}\left({y}_1,{y}_2,\dots, {y}_k\right) $$
(3)
$$ {f}_2(y)=\frac{1}{1+ \exp \left(- y\right)} $$
(4)

An algorithm is given below for the explanation of how to compute an output for a learning sample. In the algorithm, the input values of learning sample are represented by x1 , x2 ,  …  , xm.

Algorithm 1

Computation of output for MdPNN

  1. Step 1

    Outputs for the hidden layer neurons are computed by using incoming signals as below.

$$ {h}_j={f}_1\left(\mathrm{Median}\left\{{w}_{1 j}{x}_1+{\theta}_j,{w}_{2 j}{x}_2+{\theta}_j,\dots, {w}_{m j}{x}_m+{\theta}_j\right\}\right),\kern0.75em j=1,2,\dots, k $$
(5)
  1. Step 2

    Output of the network is computed by using outputs of hidden layer and sigmoid activation function.

$$ o={f}_2\left({\prod}_{j=1}^k{h}_j\ \right)=\frac{1}{1+\mathit{\exp}\left(-{\prod}_{j=1}^k{h}_j\right)} $$
(6)

3 Training of MdPNN by particle swarm optimization

Derivative-based algorithms have been commonly used to train ANNs in the literature. Back propagation algorithms are one of the most preferred ones for MLPNNs. Artificial intelligence optimization techniques have been also used for the training of NNs and they have important advantages such as working without derivatives and having capability of not falling trap of local optimum. The training algorithm based on PSO is proposed for the MdPNN. PSO is an artificial intelligence optimization technique and it can provide good results for numerical optimization problems. In PSO, there is no need to compute the derivate of cost function. PSO was proposed by Kennedy and Eberhart [27]. Shi and Eberhart [28] and Ma et al. [29] made some modifications on the algorithm. The training of the proposed network is performed by using modified PSO. Because median function uses order of dataset and it is not easy to compute derivatives of median function. PSO is feasible for this kind of objective functions. Besides, genetic algorithm and artificial bee colony can be used instead of PSO but we preferred PSO because of its simple structure.

The training algorithm for the proposed network is given as in Algorithm 2. In Algorithm 2, PSO method which has some minor modifications of the modified PSO in Aladag et al. [9] is explained.

Algorithm 2

Training Algorithm for MdPNN

  1. Step 1

    The parameters of the PSO are determined.

pn :

Number of particles

vmaps :

Upper bound for the velocities

c 1 i :

Lower bound for cognitive coefficient

c 1 f :

Upper bound for cognitive coefficient

c 2 i :

Lower bound for social coefficient

c 2 f :

Upper bound for social coefficient

w 1 :

Lower bound for the inertia weight

w 2 :

Upper bound for the inertia weight

maxitr :

Maximum number of iterations

  1. Step 2

    Initial positions and velocities of particles are generated.

The positions of particles are composed of weights and biases of MdPNN. In Fig. 2, the structure of a particle is presented.

Fig. 2
figure 2

Structure of a particle

There are totally (k × m + k) positions in particle. The first k × m positions represent the weights between input and hidden layer neurons; the last k positions represent biases for hidden layer neurons. All initial values for positions are generated from uniform distribution with (0, 1) parameters. The position j and velocity j for the particle i are demonstrated with \( {P}_{i, j}^t \) and \( {V}_{i, j}^t \) (i = 1 , 2 ,  …  , pn ; j = 1 , 2 ,  …  , k × m + k), respectively. Velocities are generated from uniform distribution with (−vmaps, vmaps) parameters.

  1. Step 3

    Fitness values for each particle are calculated.

To calculate the outputs of the network, Algorithm 1 is applied for each learning sample in the training set. Outputs and target values are represented by \( {\widehat{x}}_t \) and xt, respectively. Root mean squared error (RMSE) given in Eq. (7) for training set is preferred as fitness function.

$$ {RMSE}_i=\sqrt{\frac{1}{negt}\sum_{t=1}^{negt}{\left({x}_t-{\widehat{x}}_t^i\right)}^2}\kern1em t=1,2,\dots, negt; i=1,2,\dots, pn $$
(7)

where \( {\widehat{x}}_t^i\kern0.5em \) represents the output of the network for the time t from particle i and negt is the number of learning sample in the training set.

  1. Step 4

    Pbest and Gbest are determined.

Pbest is a matrix whose elements are the positions corresponding to the particles’ best individual performance and Gbest is the best particle, which has the best fitness function value, found so far. In the first iteration, Pbest is same as initial positions of particles and Gbest is the best particle.

\( {Pb}_{i, j}^t \) :

Pbest value for ith particle, jth position in tth iteration.

\( {Pg}_j^t \) :

Gbest value for jth position in tth iteration.

  1. Step 5

    The velocities and positions are updated.

Firstly, cognitive and social coefficients and inertia weight values are calculated by using Eqs. 810.

$$ {c}_1^t=\left({c}_{1 f}-{c}_{1 i}\right)\frac{t}{maxitr}+{c}_{1 i} $$
(8)
$$ {c}_2^t=\left({c}_{2 f}-{c}_{2 i}\right)\frac{t}{maxitr}+{c}_{2 i} $$
(9)
$$ {w}^t=\left({w}_2-{w}_1\right)\frac{maxitr- t}{maxitr}+{w}_1 $$
(10)

Secondly, velocities and positions are updated by using Eqs. 1113.

$$ {V}_{i, j}^{t+1}=\left[{w}^t\times {V}_{i, j}^t+{c}_1^t\times {rand}_1\times \left({ P b}_{i, j}^t-{P}_{i, j}^t\right)+{c}_2^t\times {rand}_2\times \left({ P g}_j^t-{P}_{i, j}^t\right)\right] $$
(11)
$$ {V}_{i, j}^{t+1}= \min \left( vmaps,\mathit{\max}\left(- vmaps,{V}_{i, j}^{t+1}\right)\right) $$
(12)
$$ {P}_{i, j}^{t+1}={P}_{i, j}^t+{V}_{i, j}^{t+1} $$
(13)
  1. Step 6

    Fitness values for each particle are calculated.

This step is applied like in Step 3.

  1. Step 7

    Pbest and Gbest are updated.

  2. Step 8

    Stopping criteria is checked.

The algorithm is stopped if it is reached to the maximum iteration number or if RMSE error criterion value for Gbest is smaller than predetermined threshold value. Otherwise, go back to Step 5.

4 Application results

The forecasting performance of the proposed network was firstly investigated on real time series datasets from CIF-2016. There are 72 time series data in CIF-2016. These time series have different numbers of observations and they are observed monthly. The first 20 time series used in this paper have 108 observations and seasonal component. The graphs of time series are given in Figs. 3, 4, 5, 6 and 7.

Fig. 3
figure 3

Line graphs of time series 1–4

Fig. 4
figure 4

Line graphs of time series 5–8

Fig. 5
figure 5

Line graphs of time series 9–12

Fig. 6
figure 6

Line graphs of time series 13–16

Fig. 7
figure 7

Line graphs of time series 17–20

Observation dates of time series cannot be given in these figures because they are not declared in CIF-2016. It is clear that all-time series have different properties. Linear trend, upward trend, downward trend, quadratic trend, seasonality and structural breaks can be seen from Figs. 3, 4, 5, 6 and 7.

The results of the proposed MdPNN were compared with MdANN, TrMANN and PSNN because the algorithm of the proposed NN model and other methods are robust methods. Besides, 20 contaminated time series were analysed by adding outliers. In contamination process, five times or 10 times of maximum observation were added to time series. After these contamination processes, all time series were analysed by using MP-ANN, M-ANN, Tr-MANN and PSNN. All networks were trained by using PSO. The parameters of PSO were taken as pn = 30, vmaps = 1, c1i = 2 , c1f = 1, c2i = 1, c2f = 2, w1=0.4, w2=0.9, maxitr = 200. The first 96 observations were used as training set, and the last 12 observations were taken as test set in all applications for all datasets.

The inputs of neural networks were taken as (xt − 1, xt − 2) or (xt − 1, xt − 2, xt − 3, xt − 4) or (xt − 1, xt − 2, xt − 3, xt − 4, xt − 5, xt − 6, xt − 7, xt − 8, xt − 9, xt − 10, xt − 11, xt − 12). This selection of inputs means that m was taken as 2, 4 and 12. The order of the networks (number of hidden layer neurons) was taken as 2. As a result of this experiment design, there are six possible situations for all datasets and this is given in Table 1.

Table 1 The cases of applications

For each time series, NNs were trained 50 times by using random initial weights in all cases. The RMSE criterion values were computed for test sets. The means and minimum of RMSE values for 50 repetitions were given in Tables 2 and 3 for all-time series. The best results, in terms of mean and minimum statistics, were highlighted by bold style for five and 10 times outlier in Tables 2 and 3. Moreover, the success rates of the models in Tables 2 and 3 were summarized in Tables 4 and 5. The detailed tables are given in supplementary files.

Table 2 RMSE values for test set results for all datasets according to mean statistics
Table 3 RMSE values for test set results for all data sets according to minimum statistics
Table 4 Summarized results for all data sets according to mean statistics
Table 5 Summarized results for all data sets according to minimum statistics

Table 4 presents the success rates with regard to mean statistic. In this table, it is seen that MdPNN has the best performance for 17 out of 20 time series (85% success rate) in case of five times outlier. The proposed model has also the best performance for nine out of 20 time series (45% success rate) in case of 10 times outlier. Moreover, Table 5 presents success rates with regard to minimum statistics. This table shows that MdPNN has the best performance for 19 out of 20 time series (95% success rate) in case of five times outlier, and the proposed model has also the best performance for 13 out of 20 time series (65% success rate) in case of 10 times outlier. The visual views of these figures are given in Figs. 8 and 9.

Fig. 8
figure 8

The graphs of summarized results for mean statistics

Fig. 9
figure 9

The graphs of summarized results for minimum statistics

Kruskal Walis H-test was applied as a statistical test according to minimum statistics. The statistical results obtained from this test were given in Tables 6 and 7. In Table 6, it is clear that the probability value is smallerler than 0.05 and also the probability value is smaller than 0.10 in Table 7. So, there are differences among the applied methods in each case. Moreover, the proposed method has minimum median value.

Table 6 Kruskal Walis H-test results obtained from minimum statistics for five times outlier
Table 7 Kruskal Walis H-test results obtained from minimum statistics for 10 times outlier

Secondly, Australian beer consumption (Janacek [30], pp.84) between 1956 Q1 and 1994 Q1 was used to examine the performance of proposed method. This time series is a good benchmark for time series forecasting in the literature. The last 16 observations of the time series were used for test set. The graph of time series is given in Fig. 10.

Fig. 10
figure 10

Australian beer consumption data

Australian beer consumption time series data was contaminated by adding an outlier (five times of maximum value) to the training data. The contaminated time series data was analysed by proposed network, and the results compared the results of other methods which were taken from Bas et al. [16]. The summarized RMSE values were given in Table 8.

Table 8 The test data results for Australian beer consumption time series

In Table 8, MNM-BP-ANN is the back propagation learning algorithm based on SMNMANN, MNM-PSO-ANN is SMNMANN based on PSO and R-MNM-ANN is robust SMNMANN. The best result was obtained from the proposed MdPNN.

5 Discussion and conclusions

Several kinds of NN, for many years, have been successfully used for time series forecasting problems. Nevertheless, there are some issues that need to be solved about them. One of them is how the performance of the models will be affected when the sets have outlier(s). In this paper, a new robust artificial neural network is introduced for forecasting of time series. Moreover, the proposed NN model is a high-order neural network as well as its robust architecture. In MdPNN, median neuron model and multiplicative neuron model are collaborated. According to application results, it can be said that MdPNN provides more successful forecasting performance than the other robust NNs in the literature. Particularly, the performance of the MdPNN is better than other methods in case of smaller outlier (five times of maximum value). When it comes to the outlier obtained by injecting 10 times of maximum observation to datasets, the performance of MdPNN goes down to 45 and 65% and it has still the best success rates.

The performance of the ANNs can be increased by using different robust statistics. From this point of view, in the future studies, different robust statistics such as trimmed mean can be used to modify the proposed neural network. Moreover, the training algorithm of the proposed method can be modified for a large number of outliers.