1 Introduction

Time series refers to the sequence of the same statistical indicators according to time sequence occurring within the same time interval and has the characteristics of large data volume, large noise, and fast data update. It is a very important and complex data that are widely existed in various fields, such as gross domestic product (GDP) (Jovic et al. 2019), stock price (Wang et al. 2011; Zhu and Wang 2010; Gupta and Wang 2010; Fang et al. 2014), population (Chi et al. 2019), unemployment rate (Li et al. 2014), traffic flow (Hou and Mai 2013; Yang and Hu 2016), precipitation (Ramana et al. 2013), carbon dioxide concentration (Besteiro et al. 2017) and so on (Wang et al. 2001; 2001).

Time series prediction is forecasting the future data based on the existing historical data. Since the time series contains lots of information and rules, it is very important to find the hidden rules in this field and predict the unknown situation in advance more precisely (Taylor et al. 2006; Li and Shi 2010; Camara 2016). Through accurate forecasting results, people can arrange work and take measures ahead of time to prevent unfavorable situations and minimize losses (Samsudin et al. 2010). For example, stock prices forecast can enable investors to effectively avoid risks (Selvamuthu et al. 2019); predicting precipitation can enable people to do preventive work in advance (Devi et al. 2017); forecasting power load could provide certain decision-making direction for power participants (Bozkurt et al. 2017).

Time series prediction is to use statistical techniques and methods to establish a mathematical model which uses past values as input and future values as outputs, then find out the function that satisfies the changes of the sequence data. Subsequently, quantitatively estimate the future development trend of the data (Ding et al. 2008; Jia 2014). In previous time series prediction studies, most scholars first judged the attributes of the sequence data and then selected the appropriate model for prediction. If the time series data approximately satisfy linear, a linear prediction method can be used, mainly including an autoregressive model, a moving average model, a self-sliding moving average model and so on. These models require linear functional relationships of future and historical values of time series; otherwise, the above linear prediction method will cause large prediction errors; therefore, the nonlinear method should be used when data satisfy nonlinear. However, the time series data collected under actual conditions is usually complex and nonlinear. Artificial neural network has the advantages of self-organization and strong nonlinearity in solving complex nonlinear problems (Li et al. 2013; Szoplik 2015; Doucoure et al. 2016). It can actively find the rule from the sample data and approximate the nonlinear function with arbitrary precision. These advantages of neural network make it obtain good prediction effect in nonlinear prediction, and it is widely used in time series prediction.

The above explains a time series with a linear or nonlinear relationship. When the nonlinear time series contains a certain non-negligible linear relationship, the BP neural network with highly nonlinear fitting characteristics may not be able to express the implicit relationship between the sample data perfectly and completely. On the contrary, it will lead to a decline in the prediction accuracy of time series and accuracy of the prediction result. To solve this problem, this paper adopts an improved BP neural network (BPNN–DIOC), which adds input-to-output connections based on BPNN. In addition, eight groups of time series datasets were used to compare the prediction performance of BPNN–DIOC network and BPNN.

2 Description of neural network

2.1 Back-propagation neural network (BPNN)

Artificial Neural Network (ANN) is a data processing system consisting of a large number of simple and highly interconnected processing elements (Cui et al. 2005). It is the abstraction and simulation of the human brain, which can imitate the human brain for complex parallel information processing, grasp the internal rules of the data and achieve a highly nonlinear mapping.

The most widely used artificial neural network is BPNN, which is a multilayer forward network based on error back-propagation (BP) learning algorithm. In the practical application of BPNN, the specific structure of BPNN should be determined first, namely the number of hidden layers and neurons required by the input layer, hidden layer and output layer. For determining the number of hidden layers: Kolmogorov theorem indicates that only three layers of BPNN can approximate any continuous function, so it is generally enough to select one layer of hidden layer (Hornik et al. 1989); for determining the number of neurons each layer, the number of input and output nodes depends on the dimension of the training sample, and there is no definite selection method to determine the number of nodes in the middle hidden layer.

Figure 1 shows the structure of BPNN, it can be seen that the BPNN is composed of three parts: the input layer, the hidden layer and the output layer. There is no connection between the neurons in the same layer and between the neurons in the non-adjacent layers and only a forward connection between neurons in adjacent layers. Obviously, the BPNN has an outstanding nonlinear mapping ability, which is shown that the relationship between input and output neurons is represented by n nonlinear terms (n is the number of neurons in the hidden layer). Therefore, the corresponding output of BPNN is:

$$\begin{aligned} {{O}_{k}}= & {} \sum \limits _{j=1}^{n}{{{{w}}_{kj}}}{{y}_{j}}+{{\gamma }_{k}}\end{aligned}$$
(1)
$$\begin{aligned} {{y}_{j}}= & {} {{f}_{{}}}\left( \sum \limits _{i=1}^{m}{{{w}_{ji}}}{{x}_{i}}+{{\theta }_{j}}\right) \end{aligned}$$
(2)

where \({{O}_{k}}\) is the output vector and \({{y}_{j}}\) is the hidden layer output; n is the number of hidden nodes and m is the number of input layer neurons; \({{{w}}_{kj}}\) is the weight between the hidden and output nodes; \({{w}_{ji}}\) is the weight between the input to hidden nodes; \({{\gamma }_{k}}\) is the threshold of the output layer neurons and \({{\theta }_{j}}\) is the threshold of the hidden neurons; f is the transfer function of the hidden neurons.

Fig. 1
figure 1

The topology of BPNN

2.2 Back-propagation neural network with direct input-to-output connections (BPNN–DIOC)

We have been using BPNN to achieve a nonlinear mapping between input and output. However, most problems are a combination of nonlinear and linear problems in real life, so BPNN may not be able to express the implicit relationship between input and output sample data completely and accurately.

In fact, not only the learning algorithm affects the prediction accuracy and generalization ability of the BPNN, but also the network topology structure has a certain influence on the prediction performance. In other word, both learning algorithm and network topology structure have certain influence on prediction performance when BPNN is used for prediction and estimation. and the generalization ability of the network for unknown samples is also affected by the topology of the network.

2.2.1 Overview of previous work

Peng et al. (1992) proposed an improved neural network algorithm, including the combined representation of linear and nonlinear terms that map input to output. Pao et al. (1994) proposed a random vector functional-link (RVFL) network with a combination of random weights and functional links that has direct connection from input layer to output layer. Looney (1996) extended the radial basis function neural network (RBFNN) architecture to a more robust radial basis function link network (RBFLN), which also has the direct connection from input layer to output layer and can obtain more accurate results than RBF neural networks. However, such networks have not been fully researched and developed since then. Ren et al. (2016) and Zhang and Suganthan (2016) demonstrated that the RVFL network that adds input-to-output connections in RWSLFN can improve the network’s generalization ability compared with RWSLFN network through the examples of prediction and classification , that is, the input-to-output connections in the network has a significant positive impact on the prediction effect of the network.

2.2.2 Structure of BPNN–DIOC

Inspired by the above work, BPNN–DIOC model is adopted in this paper, which improves the capability of BPNN with highly nonlinear fitting capability to solve the nonlinear and linear synthesis problems.

Figure 2 shows the structure of the BPNN–DIOC. Obviously, the BPNN–DIOC adds the direct linear input-to-output connections based on the conventional BPNN and reveals the combination function of the linear and nonlinear mapping of the input variables. Therefore, BPNN–DIOC shows the relationship between the input and output, which is expressed by m linear terms and n nonlinear terms approximately. Therefore, the corresponding output of BPNN–DIOC is:

$$\begin{aligned} {{O}_{k}}=\sum \limits _{i=1}^{m}{{{\beta }_{ki}}}{{x}_{i}}+\sum \limits _{j=1}^{n}{{{w}_{kj}}}{{y}_{j}}+{{\gamma }_{k}} \end{aligned}$$
(3)

where \({{\beta }_{ki}}\) is the linear connection weight from the input-to-output neurons, the remaining parameters are shown in Formula 1 and 2.

Similar to BPNN, BPNN–DIOC uses training algorithms to adjust network parameters through an iterative process. Nevertheless, the main difference between these two models is that the direct input-to-output connections of the BPNN–DIOC simulates the linear components of the data compares to BPNN model.

Fig. 2
figure 2

The topology of BPNN–DIOC

3 Experimental configuration

3.1 Datasets

This paper selects 8 groups of common time series data to explore the performance of BPNN–DIOC model in time series prediction. Their statistics features include the length, min/max, median, average, and standard deviation of each datasets shown in Table 1.

Table 1 Summary of the eight time series datasets
Table 2 BPNN with different configurations

3.2 Variations on BPNN

The difference between BPNN–DIOC and BP neural network is whether there is direct mapping between input layer and output layer. In this paper, four different network models are obtained based on whether input-to-output connections and output layer thresholds are added in the BPNN. The four different configurations of BPNN and their formulas are shown in Table 2. M1, M3 indicate that the input layer is not connected to the output layer.

M2, M4 model indicate that the input layer is connected to the output layer. In Table 2, h is the output of hidden layer neurons; O is the output of the output layer neurons; X is the input of the network; \({W}_{1}\) is the connection weights from the input layer to the hidden layer; \({W}_{21}\) is the connection weights from the input layer to the output layer; \({W}_{22}\) is the connection weights from the hidden layer to the output layer; \(\theta \) is the threshold of the hidden neurons; \(\beta \) is the threshold of the output neurons; f is the transfer function.

4 Assessment on eight time series data

Time series prediction is to speculate the future value based on historical data. If time series is \(\{{x}_{n}\}\), its general form can be described as:

$$\begin{aligned} {{x}_{n+k}}=f({x}_{n},{x}_{n-1},\ldots ,{x}_{n-(m-1)}) \end{aligned}$$
(4)

where k is the number of prediction steps; m represents the input dimension of the model. When \({{k}=1}\), it is the simplest single-step prediction; when \({k>1}\), it is the multi-step prediction. This article only discusses single-step prediction of time series, that is, multiple time steps that are used for rolling predict the next time step.

4.1 Select input and output variables

In this paper, 8 datasets are selected, of which dataset 1, dataset 2, dataset 3, dataset 4 and dataset 5 are monthly datasets, i.e., one data per month; dataset 6 is a weekly dataset, i.e., one data per week; dataset 7 and dataset 8 are one data every half hour. The input and output pattern of the sample is shown in Table 3.

Table 3 The input and output patterns for neural network training
Table 4 Weights and thresholds of linear neural network after training

According to the control variable method, the same initial conditions were used for different models to remove the influence of the initial conditions on the experimental results. The number of neurons in the hidden layer was tested from 1:30 to find out the best accuracy of the test set and obtain the number of neurons in the hidden layer under the optimal precision of the test set. Due to the randomness of neural network training, each network structure was trained 10 times, and then the average prediction accuracy of the test set was calculated. Finally, the optimal topology structure was obtained for rolling prediction in 8 datasets.

4.2 Error measures

There are too many factors that affect data, including predictability, unknown and all kinds of unexpected conditions. Therefore, errors will inevitably occur in the prediction work. Errors are impossible to eliminate, only try to improve the prediction method or learning algorithm to reduce them. In this paper, in order to analyze the prediction effect of the four different neural networks, Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) are used to measure the predictive performance of the network. They are defined as shown in Eqs. 6 and 7.

$$\begin{aligned} \hbox {RMSE}= & {} \sqrt{\frac{{1}}{{n}}{{\sum \limits _{k=1}^{n}{\left( {T}_{k}-{O}_{k} \right) }^{2}}}}\end{aligned}$$
(5)
$$\begin{aligned} \hbox {MAPE}= & {} \frac{{1}}{{n}}\sum \limits _{k=1}^{n}{\left| \frac{{T}_{k}-{O}_{k}}{{T}_{k}} \right| }\times \text {100 }\% \end{aligned}$$
(6)

where T is the target vector, O is the output vector and n is the length of data. \({\hbox {MAPE}}\) is an extension of MSE and \({\hbox {MAPE}}\) is the preferred measure for industry practitioners.

4.3 Prediction results and analysis

For each time series, the first 70% was used for training and the remaining 30% was used for testing. Due to the randomness of neural network training, each network structure was trained 10 times.

4.3.1 Linear analysis

The linear neural network has a similar structure as the single-layer perceptron, which is also composed of the input layer and the output layer, and the neurons in the output layer have the ability of information processing. The only difference between them is that the activation function of the perceptron is a hard transfer function, while the linear neuron uses the linear transfer function \({\hbox {purelin}}\), so the output of the linear neural network can be arbitrary value, rather than only two values. The output of linear neural network can be calculated by the following formula 7:

$$\begin{aligned} y=\hbox {purelin}(v)=\hbox {purelin}(\mathbf {w}\cdot \mathbf {p}+b)=\mathbf {w}\cdot \mathbf {p}+b \end{aligned}$$
(7)

It can be seen from the above formula that the linear neural network can be approximated as a linear function, but cannot complete the calculation of approximating a nonlinear function.

Fig. 3
figure 3

The RMSE curve of M3 and M4 models in dataset 7

In order to analyze whether there is a linear factor in the system, this paper uses linear neural networks to predict time series first. Table 4 shows the weights and thresholds of the network after the training of each dataset, that is to say, for each dataset, it can be expressed as a linear relationship as shown in Equation 9.

$$\begin{aligned} {{x}_{13}}={{w}_{1}}{{x}_{1}}+{{w}_{2}}{{x}_{2}}+{{w}_{3}}{{x}_{3}}+\cdots +{{w}_{12}}{{x}_{12}}+b \end{aligned}$$
(8)

4.3.2 Performance evaluation between different models

In this paper, the experiments are simulated on MATLAB 2012a version. For four different BPNN variant networks, the transfer functions of the hidden layer and the output layer are logsig and purelin, respectively, and the network training function is traingda. Set the number of hidden neurons to 1–30 and train the network in turn, and then the average value of RMSE and MAPE was taken as the prediction precision of each network structure. Furthermore, in order to compare the results of the experiment, the four different models of each datasets were trained in the same initial conditions.

Table 5 Optimal hidden nodes

Hidden layer nodes Optimization

The experiment is repeated by adjusting the number of hidden nodes under the other parameters remain unchanged and the optimal number of hidden neurons is determined according to the minimum output error in the training process. The number of optimal hidden neurons required for each dataset of four different network structures is shown in Table 5. Figures 3 and 4 show the RMSE and MAPE change curves of dataset 7 with the increase in hidden neurons during the training of M3 and M4 models, respectively.

Apparently, the number of hidden neurons needed of BPNN–DIOC model is much less than the conventional BPNN. So it is useful to add that input-to-output connections based on BPNN reduce the number of neurons required by the hidden layer and delete some input-to-hidden weights that are less importance for training results of network. Consequently, the BPNN–DIOC model simplifies the network structure greatly and reduces the amount of weight adjustment.

Fig. 4
figure 4

The MAPE curve of M3 and M4 models in dataset

Table 6 The average RMSE and MAPE of four different models and linear neural network

Performance Optimization

The prediction results of 8 datasets using linear neural networks are shown in the right column of Table 6. The performance of four models was measured by RMSE and MAPE, and the average RMSE and MAPE is tabulated in Table 6. It can be seen that, compared with BPNN, 1–8 data sets with input and output connected network of RMSE and MAPE decreased significantly. However, there is no significant difference in the prediction results of the network with or without the output threshold. The prediction structure of the linear neural network is shown in Fig. 5. Figure 6 is the improved percentage of RMSE for BPNN–DIOC and the RMSE for linear neural network. It can be seen from Table 6 and Fig. 6, in dataset 1, the prediction result using linear neural network is similar to that of BPNN, indicating that there is a certain linear predictability between the data. Moreover, BPNN–DIOC greatly improves the prediction accuracy compared with BPNN, and the RMSE is reduced from 0.0992 to 0.0329. In dataset 2, the prediction result using linear neural network is poor, which is quite different from the prediction result of BPNN, indicating that there is no obvious linear predictability between the data. Moreover, BPNN–DIOC has a smaller improvement in prediction accuracy compared with BPNN, and the RMSE is reduced from 0.0821 to 0.0662. In dataset 3, the prediction result using linear neural network is similar to that of BPNN, indicating that there is a certain linear predictability between data. Moreover, BPNN–DIOC greatly improves the prediction accuracy compared with BPNN, and the RMSE is reduced from 0.0653 to 0.0383. In dataset 4, the prediction result using linear neural network is similar to that of BPNN, indicating that there is a certain linear predictability between data. Moreover, BPNN–DIOCgreatly improves the prediction accuracy compared with BPNN, and the RMSE is reduced from 0.1348 to 0.1070. In dataset 5, the prediction result using linear neural network is poor, which is quite different from the prediction result of BPNN, indicating that there is no obvious linear predictability between the data. Moreover, BPNN–DIOC has a smaller improvement in prediction accuracy compared with BPNN, and the RMSE is reduced from 0.0592 to 0.0427. In dataset 6, the prediction result using linear neural network is similar to that of BPNN, indicating that there is a certain linear predictability between data. Moreover, BPNN–DIOC greatly improves the prediction accuracy compared with BPNN, and the RMSE is reduced from 0.0784 to 0.0517. Since datasets 7–8 are collected every half hour, the 12-dimensional input and 1-dimensional output structure cannot well interpret the relationship between the data. Therefore, the prediction result of linear neural network has little relationship with the prediction results of BPNN–DIOC, but it can be seen that the prediction result of BPNN–DIOC is still superior to that of BPNN. Therefore, for data that have a linear relationship, BPNN–DIOC plays an important role in time series prediction, it can obtain better prediction accuracy than BPNN.

Fig. 5
figure 5

Prediction structure of linear neural network

Fig. 6
figure 6

Improved percentage of RMSE for BPNN–DIOC and the RMSE for linear neural network

The Wilcoxon signed-ranks test is a non-parametric alternative to the paired t test, which ranks the differences in performances of two classifiers for each data set, ignoring the signs, and compares the ranks for the positive and the negative differences. The differences are ranked according to their absolute values. Ranks of \({{d_i} = \mathrm{{ }}0}\) are split evenly among the sums; if there is an odd number of them, one is ignored. Let N be the number of pairs.

$$\begin{aligned} {R^ + }= & {} \sum \limits _{{d_i} > 0} {\hbox {rank}({d_i}) + \frac{1}{2}} \sum \limits _{{d_i} = 0} {\hbox {rank}({d_i})}\end{aligned}$$
(9)
$$\begin{aligned} {R^ - }= & {} \sum \limits _{{d_i} < 0} {\hbox {rank}({d_i}) + \frac{1}{2}} \sum \limits _{{d_i} = 0} {\hbox {rank}({d_i})}\end{aligned}$$
(10)
$$\begin{aligned} T= & {} \min ({R^ + },{R^ - }) \end{aligned}$$
(11)
$$\begin{aligned} z= & {} \frac{{T - \frac{1}{4}N(N + 1)}}{{\sqrt{\frac{1}{{24}}N(N + 1)(2N + 1)} }},\quad (N(N + 1)/2) > 25)\nonumber \\ \end{aligned}$$
(12)

With \({\alpha = \mathrm{{ }}0.05}\), the null-hypothesis can be rejected if z is smaller than a given value.

In order to explore whether the output bias has a significant effect on prediction results, we applied the Wilcoxon signed-rank test on the two pairs: M1 and M3, M2 and M4. The p values shown in Tables 7 and 8 are bigger than 0.05, indicating that the output bias has no significant effect on prediction effect.

In order to explore whether the input-to-output connections has a significant effect on prediction results, we applied the Wilcoxon signed-rank test on the two pairs: M1 and M2, M3 and M4. The p values shown in Tables 7 and 8 are smaller than 0.05, indicating that the input-to-output connections has a significant effect on prediction effect.

In general, BPNN–DIOC can improve the prediction accuracy owing to the input-to-output connections. It strengthens the description of the linear relationship in the time series data of the entire network and improves the generalization ability of the network.

Table 7 Wilcoxon signed-rank test of the BPNN whether has the output bias
Table 8 Wilcoxon signed-rank test of the BPNN whether has the input-to-output connections

5 Conclusion

When the target system contains linear component, the traditional method is to use BPNN network with highly nonlinear fitting characteristics to approximate the system. There is no doubt that the effect of using nonlinear network to approximate linear system is worse than that of linear model. Therefore, this paper adopts BPNN–DIOC for time series prediction, joining a linear connection between the input layer and the output layer of the based BPNN, a linear and nonlinear combined network is formed to enhance the generalization ability of the network and fully express the implicit relationship between the input and output samples. This paper discusses the influence on the input-to-output connections and output layer bias on the prediction results based on 8 sets of datasets, the following conclusions can be drawn:

  1. 1.

    During network training, BPNN–DIOC can reduce the number of neurons required by the hidden layer by adding input-to-output connections than BPNN, it deletes some input-to-hidden weights that are less important for training results of network. The total number of connections in BP networks could be reduced if \({\left( {m + q} \right) \times p > m \times q}\), where m is the number of input layer nodes, q is the number of output layer nodes and p is the reduction in the number of the hidden layer nodes. So the BPNN–DIOC could simplify the network structure greatly.

  2. 2.

    The direct input-to-output connections can improve the prediction accuracy significantly. Moreover, the better linear fitting of the data, the better prediction effect of BPNN–DIOC. However, the output bias has no significant effect on network prediction result.

  3. 3.

    The prediction result of linear neural network has relationship with the prediction results of BPNN–DIOC. In general, BPNN–DIOC plays an important role in time series prediction for data that has a linear relationship, which can obtain a better prediction accuracy than BPNN.

Therefore, adding a connection from input-to-output based on BPNN can map the input to the output of the network more completely and describe the characteristics of the time series data more accurately. Thence, the BPNN–DIOC network provides a more general framework for prediction model.