DWE-IL: a new incremental learning algorithm for non-stationary time series prediction via dynamically weighting ensemble learning

Yu, Huihui; Dai, Qun

doi:10.1007/s10489-021-02385-4

DWE-IL: a new incremental learning algorithm for non-stationary time series prediction via dynamically weighting ensemble learning

Published: 26 April 2021

Volume 52, pages 174–194, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

DWE-IL: a new incremental learning algorithm for non-stationary time series prediction via dynamically weighting ensemble learning

Download PDF

Huihui Yu¹ &
Qun Dai¹

1064 Accesses
12 Citations
Explore all metrics

Abstract

In this work, an Incremental Learning Algorithm via Dynamically Weighting Ensemble Learning (DWE-IL) is proposed to solve the problem of Non-Stationary Time Series Prediction (NS-TSP). The basic principle of DWE-IL is to track real-time data changes by dynamically establishing and maintaining a knowledge base composed of multiple basic models. It trains the base model for each non-stationary time series subset, and finally combine each base model with dynamically weighting rules. The emphasis of the DWE-IL algorithm lies in the update of data weights and base model weights and the training of the base model. Finally, the experimental results of the DWE-IL algorithm on six non-stationary time series datasets are presented and compared with those of several other excellent algorithms. It can be concluded from the experimental results that the DWE-IL algorithm provides a good solution to the challenges of the NS-TSP tasks and has significantly superior performance over other comparative algorithms.

Several Novel Dynamic Ensemble Selection Algorithms for Time Series Prediction

Article 30 November 2018

EnsP^KDE&IncL^KDE: a hybrid time series prediction algorithm integrating dynamic ensemble pruning, incremental learning, and kernel density estimation

Article 22 August 2020

Time Series Forecasting Through a Dynamic Weighted Ensemble Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A time series is a series of observations collected in a fixed sampling interval. Many dynamic processes in the real world can be modeled as time series, such as stock price changes, weather prediction, number of sunspots, and disease incidence, etc. If we want to predict these processes to help production and life, we need to understand the knowledge of time series prediction (TSP). TSP refers to the process of predicting future trends based on historical time series generation models. Time series can be divided into stationary time series and non-stationary time series (NS-TS). In the narrow sense, stationary means that the probability distribution of a sequence does not change with time, while in the broad sense, stationary means that there are primary and secondary moments in a sequence, the mean value is constant, and the covariance is a function related to the sampling interval. On the contrary, non-stationary means that the mean and covariance of a sequence change over time. The change is affected by many factors, some of which play a long-term and decisive role, making the change of a sequence shows a certain trend and regularity. Others play a short-term and inconclusive role.

Many traditional machine learning methods usually use batch learning or off-line learning paradigms to solve problems. With the explosion of real-time data, it has become increasingly difficult for off-line machine learning algorithms to play a role. In order to deal with the situation that data arrive continuously rather than once, on-line learning paradigm has been proposed. Generally speaking, there are two main application scenarios of on-line learning. One is to improve the existing off-line learning methods to enhance their efficiency and scalability. The second is using on-line learning directly to deal with the arrival of new samples. As a classic on-line forecasting process, the non-stationary time series prediction (NS-TSP) inevitably needs that the algorithm proposed for it has the on-line forecasting ability, which is a basic premise.

Regarding NS-TSP, there are the following challenges: 1) NS-TS is considered a deterministic chaotic time series. It is sensitive to initial conditions, so it is unrealistic to use historical data to make long-term predictions; 2) NS-TS has non-linear and non-stationary characteristics, which leads to the “stability -plasticity” dilemma. How to balance the two is a difficult problem.; 3) NS-TS has high-noise characteristics, which means that no matter what model is built, the complete information of the sequence cannot be obtained, so we need to eliminate the influence of noise; 4) Some NS-TS have periodic characteristics. How to respond accurately and quickly when the historical environment reappears is a question worthy of our consideration.

The early researchers mainly focus on the investigation of linear and stationary TSP problems. However, with the development of theory and technology, more and more evidences show that the time series in real life are mostly NS-TS. Therefore, some prediction methods for NS-TS are proposed. The existing NS-TSP methods can be roughly divided into three categories:

1)
Traditional statistical methods. The traditional statistical methods are the earliest methods applied to NS-TSP, among which the representative methods are Auto-Regressive Integrated Moving Average (ARIMA) model [1] and Generalized Auto-Regressive Conditional Heteroscedasticity (GARCH) model [2]. The ARIMA model has been widely used in the NS-TSP task. However, it needs to smooth the NS-TS through a difference process, which inevitably affects its prediction accuracy and efficiency. The GARCH model is a method that uses past changes and variance to predict future changes. It can fit sequences with fluctuation characteristics well. But like ARIMA, it also needs to smooth NS-TS.
2)
Computational intelligence methods. The representative methods of computational intelligence methods include Artificial Neural Networks (ANNs) [3], Fuzzy Logic (FL) [4], and Support Vector Machine (SVM) [5], etc. As a nonlinear prediction model, ANNs have good self-learning and function approximation abilities, being widely used in the field of NS-TSP. However, ANNs often require a lot of training data, their structures and parameters are difficult to determine, their convergence speed is slow, and they are easy to fall into the local optima. FL generates uncertain knowledge by generating fuzzy NS-TS, simulating human thinking, and then transforming it into accurate prediction values to complete the prediction task. But how to generate suitable fuzzy rules is a challenging issue to be addressed. SVM has a simple structure and strong generalization ability. It has been well applied in the field of NS-TSP, but it is only suitable for small sample learning.
3)
Combination prediction methods. Experiments show that it is difficult for a single model and method to fully reflect the overall change law of the prediction object, especially for tasks with highly uncertain features such as NS-TSP, and therefore, combination prediction methods have emerged and become the most popular methods at the moment. The combination prediction methods obtain comprehensive information by combining multiple methods, including the traditional statistical methods and computational intelligence methods mentioned above, to improve their prediction performance and stability. There are three ways to implement combination forecasting methods: 1) through incremental learning (IL) [6] and ensemble learning (EL) [7], such as the research work carried out in [8, 9]; 2) through the staged idea, such as the research work carried out in [10]. The basic idea is to decompose NS-TS, using several models to predict them separately, and then combining their prediction results; 3) through the monitoring mechanism, such as the research work carried out in [11]. The basic idea is to track changes by monitoring the performance of models and trigger different model update mechanisms based on the degree of changes.

The mainstream IL and EL methods in non-stationary environment include sliding window technique, concept drift detection technique and data block one. Sliding window technology is a widely used method to deal with non-stationary forecasting tasks. It uses a simple forgetting mechanism to maintain a series of newest data to capture data changes in the current environment. The determination of an appropriate window size is a very important problem in sliding window technology. Large windows contain enough information to make the model maintain good generalization performance under the condition of stable data distribution. However, with large windows, the model cannot be updated timely to adapt to data changes. Conversely, a small window can respond to changes in a non-stationary environment in a timely manner, but it may also result in poor performance due to less available information available. Therefore, the adaptive sliding window technology, whose window size can be adapted autonomously according to relevant rules, can better adapt to non-stationary environment. Different from sliding window technique, concept drift detection technique is an explicit method to deal with non-stationary environment. It can select the appropriate model update method by measuring the change degree of data in non-stationary environment. While in the data block technology, each base model is trained based upon a data block, so each base model contains the historical data distribution information of its training data block, which increases the diversity and stability of the ensemble model.

IL refers to the process in which the model continuously learns new knowledge from new data and saves most of the previously learned knowledge, which is very similar to the human learning model. The idea of IL is when the new data arrives, it is not necessary to re-learn all the data, but to update the original knowledge with the new data. IL paradigm has been extensively employed in various research areas, such as image recognition [12], document classification [13] and intrusion detection [14, 15], etc. The IL algorithms used in NS-TSP need to meet the following conditions [16, 17]:

a)
Any training dataset is only learned once and will not be used for subsequent training. The knowledge learned are stored incrementally in the model parameters.
b)
Because the latest dataset can best represent the current environment, knowledge should be classified according to the relationship between knowledge and the current environment, and knowledge should be dynamically updated.
c)
The model should have a strategy to coordinate the conflict between old knowledge and new knowledge, that is, there should be a mechanism to monitor the performance of the model on new data and old data.
d)
The model should forget or discard knowledge that is no longer relevant but can recall it when the environment reappears.

The EL paradigm is a process of generating and merging multiple models according to a certain strategy to solve a specific computational intelligence problem [18]. It has three elements: data sampling, training base models, and combining base models. EL algorithms have been widely utilized in a variety of research areas, such as face recognition [19, 20], text classification [21] basic expression analysis [22], and several other important investigation fields. The EL methods used in NS-TSP can be divided into three categories:

a)
Changing the combination rule to adapt the non-stationary environment for the previously trained base models, such as weighted majority voting and Winnow based algorithms [23].
b)
Updating the online model and all ensemble members with the new dataset, such as the online promotion algorithm [24].
c)
Adding new ensemble members [25] or replacing the minimum contributor or youngest member in the ensemble with the base model generated by utilizing the new dataset [26].

In EL paradigm, many mature machine learning algorithms can be used as the basic learning algorithm, such as SVM and ANNs. Compared with SVM and traditional ANNs, Extreme Learning Machine (ELM) [27] has high scalability and low computational complexity, which can greatly accelerate the generalization speed while ensuring the generalization performance. Extreme Learning Machine with Kernels (ELMK) [28] is a further improvement of ELM, which overcomes the randomness of ELM and is a good choice for the basic learning algorithm.

Based on the above, we can know that the traditional statistical methods assume that there is a linear structure among data variables, but ignore the correlation and multi-level of information, which makes them difficult to obtain excellent prediction results in practical applications. Computational intelligence methods are a type of nonlinear prediction methods, including some classical machine learning models. However, it is difficult for a single model and method to fully reflect the overall change rule of the predicted objects, especially for tasks with high uncertainty. Combination prediction methods combine multiple methods, including traditional statistical methods and computational intelligence ones, to obtain comprehensive information, so as to improve their prediction performance and stability. However, many combination prediction methods propose solutions based on specific NS-TSP tasks, or on specific influencing factors in NS-TS.

Aiming at the challenges of NS-TSP and the disadvantages of the above methods, we hope to develop a unified on-line forecasting framework, which does not need to specify the internal changes of NS-TS. This will greatly reduce the computational complexity and, simultaneously, increase the practicability of the proposed algorithm. Besides, for the base algorithms without on-line forecasting ability, IL and EL are important technologies for them to realize on-line forecasting. Therefore, this paper proposes an Incremental Learning Algorithm via Dynamically Weighting Ensemble Learning, abbreviated as the DWE-IL algorithm.

The basic principle of DWE-IL is to track real-time data changes by building and maintaining a knowledge base composed of multiple base learners, dynamically. DWE-IL believes that, because the distribution of data will change over time, the latest data commonly best reflects the latest state of the current environment. In response to new data, DWE-IL reorganizes and integrates the existing knowledge, while updating the existing knowledge base, so as to accurately reflect the current environment and predict the next environment. The reason why the proposed method can solve the non-stationary prediction better than the other ones is that DWE-IL provides a unified and efficient on-line forecasting framework for NS-TSP through a double IL mechanism and a double dynamically weighting mechanism according to the characteristics of NS-TS. In the learning process, there is no need to consider specific influencing factors in NS-TS. In addition, DWE-IL not only has higher prediction accuracy, but also has lower computational complexity and time cost. Therefore, compared with many traditional statistical methods, computational intelligence methods and combination prediction ones, DWE-IL possesses excellent generalization performance and wide applicability.

The performance of “dynamical” of DWE-IL is mainly reflected in the weight update, which is divided into the following two points: 1) The update of data weights. We initialize the data weights of each newly arrived dataset to be uniform distribution, and then dynamically update the weights according to the performance of the current ensemble model on this dataset. This reflects the adaptability of the old knowledge to the new data, and also makes the samples with higher prediction difficulty in the new data get more attention; 2) The update of the weights of the base models. We use a double dynamically weighting mechanism to obtain the final weights of the base models. From the point of view of a single base model, the performance of each base model in each environment since its generation is dynamically time-weighted, because the performance of each base model in the newer environment should get higher weight. From the perspective of the ensemble model, we dynamically update the weights of the base models mainly based on their performance. In particular, some basic models are temporarily forgotten when they no longer fit into the current environment, but they can be remembered again when the historical training environment returns. Intuitively, the DWE-IL algorithm retains all the acquired knowledge, but only selectively activates and uses the part that is effective at the moment, according to the real-time state of the environment.

It can be seen from the above descriptions that the advantages of “dynamical” of DWE-IL are as follows: 1) The update of data weights and the weights of the base models is conducive to coordinating the old and new knowledge and coping with the “stability -plasticity” problem in NS-TSP, which improves the generalization performance and efficiency of DWE-IL, while ensuring its robustness; 2) The update of the weights of the base models is conducive to dealing with the high noise and periodicity problems in NS-TSP.

Overall, the DWE-IL algorithm can be divided into three parts:

1)
Data pretreatment. In this part, we first need to normalize the original NS-TS, and then convert it into the dataset required for the prediction task according to the time window.
2)
Models training. This part is the most important one, mainly related to the update of data weights and base model weights and the training of the base models. The update of data weights depends on the performance of the current ensemble model on the latest dataset. The update of the weight of a specific base model depends on its comprehensive performance over a period of time in the past, while the performance at a newer moment contributes more to its weight. In addition, unlike other ensemble learning algorithms, DWE-IL can temporarily forget irrelevant knowledge, but when the historical environment reappears, it can recall this knowledge again. The whole training process is similar to the process of human gradual learning and fully reflects the idea of IL.
3)
Models combination. In this part, DWE-IL uses the Weighted Median [17] method for the combination of models. The basic idea is to sort the outputs of all basic models and select the predictive value with a cumulative weight of 50% as the ensemble result, according to their respective ensemble weights.

In view of the challenges of NS-TSP, we put forward several novel strategies in the models training part of the DWE-IL algorithm. The main innovations and contributions of DWE-IL are summarized as follows:

a)
DWE-IL can basically ignore the internal changes of NS-TS and provides a general on-line forecasting framework for NS-TSP tasks. Besides, it provides an effective processing mechanism for the periodicity of NS-TS.
b)
DWE-IL updates the data weights according to the performance of the current ensemble model on the latest dataset. The updated data weights are used for building the new base model, which helps to improve the generalization performance and robustness of the final ensemble model.
c)
DWE-IL uses a double IL mechanism to train the base model, that is, it trains the base model based on old knowledge and new data blocks at the same time, which strengthens the connection between old and new knowledge. While increasing the diversity of base models, it further improves the prediction performance of the final ensemble model.
d)
DWE-IL uses a double dynamically weighting mechanism to obtain the comprehensive performance of each base model and updates the weight of the base model according to the comprehensive performance dynamically. In this process, the performance of the base model in a newer moment receives more attention. This move eliminates the influence of noise in NS-TS, maintains the stability of DWE-IL, and improves its plasticity.

The rest of this paper is organized as follows. In Section 2, the theoretical knowledge of ELM and ELMK is described in detail. In Section 3, the details of the proposed DWE-IL algorithm are described. The experimental results of the DWE-IL algorithm on six non-stationary time series datasets are reported in Section 4. Finally, in Section 5, the conclusions and future works are given.

2 Theoretical basis

2.1 Extreme learning machine (ELM)

Extreme Learning Machine (ELM) is a specific Single-hidden Layer Feedforward Neural Network [29] (SLFN) model proposed by Guang-Bin Huang in 2004. Its innovations: (1) the connection weights between the input layer and the hidden layer, and the bias of the hidden layer can be generated randomly or set manually; (2) the connection weights between the hidden layer and the output layer do not need to be adjusted repeatedly through iteration, but are directly determined by solving the equations by least square method. Its contributions: compared with SLFNs and SVM, it greatly improves the learning speed and reduces the computational complexity, on the premise of guaranteeing the generalization performance. The block diagram of the ELM algorithm is presented in Fig. 1.

Assume that there are N training samples(x_i, t_i), i = 1, 2, …N, where x_i = [x_i1, x_i2, ⋯, x_in]^T ∈ Rⁿ and t_i = [t_i1, t_i2, ⋯, t_im]^T ∈ R^m. If any training sample is represented as (x, t), then ELM can be represented as follows:

$$ \boldsymbol{y}={f}_L\left(\boldsymbol{x}\right)={\sum}_{i=1}^L{\boldsymbol{\beta}}_i{h}_i\left(\boldsymbol{x}\right)=\boldsymbol{h}\left(\boldsymbol{x}\right)\boldsymbol{\beta} $$

(1)

where β = [β₁^T, β₂^T, ⋯, β_L^T]^T ∈ R^L × m and β_i ∈ R^m is the connection weight vector between the i^th node in the hidden layer and each node in the output layer, h(x) = [h₁(x), h₂(x), ⋯, h_L(x)] is the output vector of the hidden layer for specific samples.

h(x) can be expressed as follows:

$$ \boldsymbol{h}\left(\boldsymbol{x}\right)={\sum}_{i=1}^Lg\left({\boldsymbol{w}}_i\cdotp \boldsymbol{x}+{b}_i\right)=g\left(\boldsymbol{W}\cdotp \boldsymbol{x}+b\right) $$

(2)

where W = [w₁, w₂, ⋯, w_L] ∈ R^n × L is the weight vector between the input layer and the hidden layer, b = [b₁, b₂, ⋯, b_L] ∈ R^L is the bias vector of the nodes in the hidden layer.

The target function of ELM is as follows:

$$ {\sum}_{i=1}^N\left\Vert {\boldsymbol{y}}_i-{\boldsymbol{t}}_i^T\right\Vert =0 $$

(3)

We can convert this to solving for Y = Hβ = T, where:

$$ \boldsymbol{Y}={\left[\begin{array}{c}{\boldsymbol{y}}_1\\ {}{\boldsymbol{y}}_{\mathbf{2}}\\ {}\vdots \\ {}{\boldsymbol{y}}_N\end{array}\right]}_{N\times m}\mathrm{and}\kern0.5em \boldsymbol{T}={\left[\begin{array}{c}{{\boldsymbol{t}}_1}^T\\ {}{{\boldsymbol{t}}_2}^T\\ {}\vdots \\ {}{{\boldsymbol{t}}_N}^T\end{array}\right]}_{N\times m} $$

(4)

The ultimate goal of ELM is to find the least square solution with the minimum norm for Hβ = T. The solution that satisfies this condition is β = H⁺T, where H⁺ represents the Moore-Penrose generalized inverse of H.

2.2 Extreme learning machine with kernels (ELMK)

ELM can effectively overcome the inherent defects of traditional neural networks [30, 31]. ELM has a parameter that is very important to the generalization performance of the model, namely, the number of nodes in the hidden layer. For each task, it usually needs a certain time and method to determine [32]. ELMK does not need to determine the number of nodes in the hidden layer, only requiring to select the appropriate kernel function and regularization factor. Because of this property, ELMK avoids the randomness of ELM effectively.

ELMK is defined as follows:

$$ {K}_{ELM}=\boldsymbol{H}{\boldsymbol{H}}^T:{K}_{ELM\left(i,j\right)}=\varphi \left({\boldsymbol{x}}_i\right)\cdotp \varphi \left({\boldsymbol{x}}_j\right)=K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right) $$

(5)

where K_ELM represents ELMK and φ(·) represents an unknown feature map.

Then the output function can be defined as follows:

$$ f\left(\boldsymbol{x}\right)=\varphi \left(\boldsymbol{x}\right){\boldsymbol{H}}^T{\left(\boldsymbol{H}{\boldsymbol{H}}^T+\frac{1}{C}\right)}^{-1}\kern3pt t=\left[\begin{array}{c}K\left(\boldsymbol{x},{\boldsymbol{x}}_1\right)\\ {}\vdots \\ {}K\left(\boldsymbol{x},{\boldsymbol{x}}_N\right)\end{array}\right]{\left({K}_{ELM}+\frac{1}{C}\right)}^{-1}t $$

(6)

If the training dataset is represented as (x_i, t_i), i = 1, …, N, where $ {\boldsymbol{x}}_i\in {\mathfrak{R}}^P $ and $ {\boldsymbol{t}}_i\in \mathfrak{R} $, then the initial optimization problem of ELMK can be written as:

$$ {\displaystyle \begin{array}{c}\operatorname{Min}\kern0.50em {L}_P=\frac{1}{2}{\left\Vert \boldsymbol{w}\right\Vert}^2+\frac{C}{2}{\sum}_{i=1}^N{\xi_i}^2\\ {}\kern4.5em \mathrm{s}.\mathrm{t}.\varphi {\left({\boldsymbol{x}}_i\right)}^T\boldsymbol{w}={t}_i-{\xi}_i\end{array}} $$

(7)

where w is the connection weight vector between the hidden layer and the output layer, C is the regularization parameter, and ξ_i is the error of (x_i, t_i).

Convert the above equation into Lagrange duality problem:

$$ {L}_D=\frac{1}{2}{\left\Vert \boldsymbol{w}\right\Vert}^2+\frac{C}{2}{\sum}_{i=1}^N{\xi_i}^2-{\sum}_{i=1}^N{\theta}_i\left(\varphi {\left({\boldsymbol{x}}_i\right)}^T\boldsymbol{w}-{t}_i+{\xi}_i\right) $$

(8)

where θ_i, i = 1, …, N are the Lagrange multipliers. The KKT conditions corresponding to the above equation are as follows:

$$ \frac{\partial {L}_D}{\partial \boldsymbol{w}}=\boldsymbol{w}-{\sum}_{i=1}^N{\theta}_i\left(\varphi \left({\boldsymbol{x}}_i\right)=0\to \boldsymbol{w}={\sum}_{i=1}^N{\theta}_i\right(\varphi \left({\boldsymbol{x}}_i\right) $$

(9)

$$ \frac{\partial {L}_D}{\partial {\xi}_i}=C{\xi}_i-{\theta}_i=0\to {\theta}_i=C{\xi}_i,i=1,\dots, N $$

(10)

$$ \frac{\partial {L}_D}{\partial {\theta}_i}=\varphi {\left({\boldsymbol{x}}_i\right)}^T\boldsymbol{w}-{t}_i+{\xi}_i=0,i=1,\dots, N $$

(11)

3 The proposed incremental learning algorithm via dynamically weighting ensemble learning (DWE-IL)

As an IL algorithm based on dynamically weighting ensemble scheme, the DWE-IL algorithm proposed in this paper can effectively NS-TSP problems. When it comes to EL, we have to consider the choice of base model. For the reasons analyzed above, ELMK is selected as the base model for the DWE-IL algorithm. In the experiments conducted in Section 4, we also select ELM as the base model and compare the result of two models to reveal the characteristics of the DWE-IL algorithm.

The basic idea of the DWE-IL algorithm is described as follows. Each data subset generates a base model, and build the base model set in a double incremental manner. Then, the performance of all existing base models on the latest dataset is measured. If the weighted sum of the relative errors of the new base model is greater than 1/2, it proves that it cannot correctly reflect the current environment, so it must be discarded, and a new base model must be regenerated. If the weighted sum of the relative errors of the old base model is greater than 1/2, it proves that the knowledge it has learned is not suitable for the current environment, and it should be temporarily discarded. DWE-IL sets the weighted sum of the relative errors of such old base models to 1/2, so that the weight in the final ensemble is equal to 0, so as to achieve the purpose of temporary discarding. However, it is worth noting that when the historical environment where the old base model is trained reappears, its weight is updated, and it is remembered again. Next, evaluate the overall performance of all base models in the past period of time, so that each base model has a greater weight in the newer environment. Finally, a new ensemble model is obtained by the weighted median method, and the data distribution is updated for the next training according to the performance of the ensemble model on the next dataset.

In Algorithm 1, the specific pseudo-code of the DWE-IL algorithm is presented.

To better demonstrate the process of the DWE-IL algorithm, we present the block diagram of the DWE-IL algorithm in Fig. 2.

The following is a detailed description of the DWE-IL algorithm:

A)
Input:

a)
The original time series DATA = {d₁, d₂, …, d_N}.
b)
Time window size tw, which is an important super-parameter in TSP and determines the dimension of sample eigenspace. A too large or too small value of tw directly affects the generalization performance of the algorithm.
c)
The number of data subsets T. Divide training dataset Train into Train^t, t = 1, 2, …, T, is for better implementation of IL and EL. A too large value of T makes IL no sense, while a too small value of T results in poor EL performance.
d)
The base model τ, which can have many choices, such as SVR, ELM, and decision tree, etc. Considering that ELMK performs better than ELM in generalization, and possesses, simultaneously, lower computation complexity than SVM and decision tree, we choose ELMK as the base model in this work.

B)
Pretreatment:

First, in order to improve the accuracy and convergence speed of the model, we normalize original time series DATA to DATA′. This paper uses Min-max normalization to normalize the data into the interval [0, 1]. The normalization formula is as follows:

$$ {d_i}^{\prime }=\frac{d_i-{d}_{min}}{d_{max}-{d}_{min}} $$

(12)

where d_i^′ is the normalized value, d_i is the initial value, d_min and d_max are the minimum and maximum values of the original data.

Next, tw is used to convert DATA′ into the training dataset Train and testing dataset Test required by the prediction task, as shown in Fig. 3.

Finally, divide Train into T subsets $ {\boldsymbol{Train}}^t=\left\{\Big({\boldsymbol{x}}_1,{y}_1\right),\left({\boldsymbol{x}}_2,{y}_2\right),\dots, \left({\boldsymbol{x}}_{m^t},{y}_{m^t}\right),t=1,2,\dots, T $, randomly, where $ {\sum}_{t=1}^T\ {m}^t=N $.

C)
Procedure:

a)
Initializing D^t, t = 1, 2, …, T to a uniform distribution. In the next step, different from AdaBoost, which updates the samples weights based on the base model, this work updates the samples weights based on the error of the existing ensemble model H^t − 1 on the new dataset Train^t. H^t − 1 is obtained by dynamically integrating each base model generated by Train^k, k = 1, 2, …, t − 1.
b)
Step 1: Calculating the error E^t of H^t − 1 on Train^t. E^t, t = 1, 2, …, T is obtained by summing the product of the relative error of samples in Train^t and the initial weights of samples. The initial weights of samples $ \frac{1}{m^t},t=1,2,\dots, T $ can ensure 0 ≤ E^t ≤ 1.
c)
Step 2: Updating the sample weights w^t based on E^t, t = 1, 2, …, T. The formula for updating the samples weights is as follows:
$$ {w}^t(i)=\frac{1}{m^t}\times {E^t}^{{\left\{1-\left\{ abs\left({H}^{t-1}\left({\boldsymbol{x}}_i\right)-{y_i}^t\right)/ errormax\right\}\right\}}^2},i=1,2,\dots, {m}^t $$
(13)

Then, normalizing w^t to ensure that D^t is a distribution. It can be seen from Step 4 and Step 5 that, the updated samples distribution D^t affects the weight of h_k, k = 1, 2, …, t in the ensemble model H^t by affecting the error of h_k on the latest dataset.
d)
Step 3: Training a base model h_k as a new ensemble member with H_k − 1 and Train^t. Note that, when the first data subset comes, the algorithm execution process will skip step 1 and step 2, and jump directly to step 3, because there is no existing ensemble model to update the sample distribution. When the second data subset Train² arrives, h₁ is the ensemble model H¹ at this time.
e)
Step 4: Evaluating the performance of all base models h_k, k = 1, 2, …, t on Train^t. Since the base models are generated at different times, the number of evaluations received by each base model is different. That is, when the latest dataset is Train^t, h_k = t obtains the first error, and h_k, k = 1, 2, …, t − 1 obtains the (t − k + 1)^th error. We use $ {\varepsilon}_k^t $ to represent the error of the k^th base model on Train^t.$ {\varepsilon}_k^t $ can be expressed as follows:
$$ {\varepsilon}_k^t={\sum}_{i=1}^{m^t}{D}^t(i)\cdotp {\left\{ abs\left({h}_k\left({\boldsymbol{x}}_i\right)-{y_i}^t\right)/ errormax\right\}}^2,k=1,2,\dots, t $$
(14)

If $ {\varepsilon}_{k=t}^t $ is greater than 1/2, then we discard the base model h_t, and generate a new base model h_t, instead of directly interrupting the training process. If $ {\varepsilon}_{k<t}^t $ is greater than 1/2, $ {\varepsilon}_{k<t}^t $ is set to 1/2, instead of discarding the base model directly. This is because, if the environment changes dramatically, it is reasonable that the base model does not perform well, and this does not mean that the base model will never be useful again in the future. If the environment reappears, the base model error will become smaller, and the model will again contribute to the current overall decision.

As mentioned in Section 1, it is a several conditions need to be met by IL algorithms for non-stationary environment prediction, one of which is that there should be a mechanism to coordinate the contradiction between old and new knowledge. The above formula embodies this idea that, allow the ensemble model to strengthen the original knowledge while learn new knowledge.
f)
Step 5: Calculating the weights of all base models h_k, k = 1, 2, …, t on Train^t. First, calculate the average weighted error of each base model, so that the performance of the base model in the latest environment gets more attention. The process is as follows:
$$ \left[\begin{array}{ccc}{\omega}_1^1& \cdots & {\omega}_1^t\\ {}\vdots & \ddots & \vdots \\ {}0& \cdots & {\omega}_t^t\end{array}\right]=\left[\begin{array}{ccc}\frac{1}{1+{e}^0}& \cdots & \frac{1}{1+{e}^{-\left(t-1\right)}}\\ {}\vdots & \ddots & \vdots \\ {}0& \cdots & \frac{1}{1+{e}^0}\end{array}\right] $$
(15)

where $ {\omega}_k^t $ is the weight of $ {\varepsilon}_k^t $.

Row k, 1 < k < t satisfies the following conditions:

$$ {\omega}_k^k=\frac{1}{1+{e}^0}<{\omega}_k^{k+1}=\frac{1}{1+{e}^{-1}}<\cdots <{\omega}_k^t=\frac{1}{1+{e}^{-\left(t-k\right)}} $$

(16)

then normalize $ {\omega}_k^t,k=1,2,\dots t $.

The average weighted error of the base model h_k, k = 1, 2, …, t on Train^t can be computed as:

$$ {\displaystyle \begin{array}{l}\overline{\beta_k^t}={\sum}_{j=0}^{t-k}{\omega}_k^{t-j}\left({\varepsilon}_k^{t-j}/\left(1-{\varepsilon}_k^{t-j}\right)\right)\\ {}\kern4em ={\omega}_k^t\frac{\varepsilon_k^t}{1-{\varepsilon}_k^t}+\cdots +{\omega}_k^{t-q}\frac{\varepsilon_k^{t-q}}{1-{\varepsilon}_k^{t-q}}+\dots +{\omega}_k^k\frac{\varepsilon_k^k}{1-{\varepsilon}_k^k}\end{array}} $$

(17)

The weight of h_k is obtained by normalizing the logarithm of the reciprocal of the average weighted error, i.e., $ {W}_k^t=\log \left(1/\overline{\beta_k^t}\right) $, $ {W}_k^t={W}_k^t/{\sum}_{k=1}^t{W}_k^t $. If the knowledge of the base model h_k does not match the current environment, $ {W}_k^t $ will be very small or even zero, thereby achieving the purpose of temporarily “discarding”. If its knowledge becomes relevant again, $ {\varepsilon}_k^t $ will be smaller, so that h_k will get a higher weight in the current environment and will be remembered again. This feature is especially useful in periodic environments.

g)
Step 6: All base models are weighted dynamically to get the latest ensemble model as follows:
$$ {H}^t(i)=\arg\ {\min}_{h_{k\left({\boldsymbol{x}}_i\right)}}{\sum}_{h_{j\left({\boldsymbol{x}}_i\right)}<{h}_{k\left({\boldsymbol{x}}_i\right)}}{W}_k^t\kern0.75em \ge \frac{1}{2}\kern0.5em {\sum}_{j=1}^t{W}_j^t $$
(18)

Note that each new data subset generates a new base model and a new ensemble model. At this time, the generated ensemble model influences the weight of the new base model by influencing the next samples update in step 1.

From the perspective of the whole algorithm description, the DWE-IL algorithm uses IL and EL technologies to realize the on-line forecasting process of NS-TS. Ideally, as long as the data input is uninterrupted, DWE-IL will continue to iteratively update the on-line forecasting model to adapt to the changing non-stationary environment.

4 Experiments

In this paper, six time series datasets are used to evaluate the performance of DWE-IL, including three financial datasets, Sunspot dataset, Mackey-Glass dataset and Lorenz dataset. The reasons why we consider these datasets are explained as follows: (1) These datasets involve finance, astronomy and other fields, and they are classic datasets in the corresponding fields. These fields are closely related to human life and are worthy of our study; (2) These datasets include both real datasets and artificial datasets, which can more comprehensively demonstrate the performance of the proposed DWE-IL algorithm; (3) These datasets are verified to be non-stationary, which is consistent with the topic of our paper. Accordingly, the three financial datasets and another three datasets are selected for the reasons mentioned above, and they are indeed representative.

4.1 Datasets

4.1.1 Three financial time series datasets

Since its birth, stock has been widely valued by people and has a significant impact on the financial market of a country or even the whole world. Therefore, this paper studies three classic stock index datasets, namely Dow Jones Industrial Average Index (DJI), Nikkei 225 Index (N225) and Shanghai Stock Exchange Composite Index (SSE) datasets, all of them are obtained from Yahoo Finance [33].

DJI dataset is composed of monthly samples of Dow Jones Industrial Average Index from February 1985 to March 2015, having a total of 352 data points. N225 dataset is composed of samples of the monthly Nikkei Index from April 1988 to March 2015, containing 324 data points. SSE dataset is composed of monthly samples of Shanghai Stock Exchange Index from December 1990 to January 2015, having a total of 290 data points.

4.1.2 Sunspot dataset

Sunspot is one of the most basic and obvious activities on the solar photosphere, which can reflect the solar activity level in this period, and it is an important index to study the solar cycle. Sunspot data is of great significance for studying space physics, space environment, earth climate and satellite operation. Predicting sunspots is not only significant but also challenging.

4.1.3 Mackey-glass dataset

Mackey-Glass time series, as one of the classical non-stationary chaotic time series, originates from the physiological control system and represents the typical feedback system [34]. It is generated by the following nonlinear differential equation:

$$ \frac{dx}{dt}=\frac{\alpha x\left(t-\delta \right)}{1+{x}^c\left(t-\delta \right)}- bx(t) $$

(19)

If δ > 16.8, then the time series generated by Eq. (17) is chaotic. We set the parameters to:α = 0.2, b = 0.1, c = 10, δ = 10, and x(0) = 1.2.

4.1.4 Lorenz dataset

Lorenz time series is widely used as a classical three-dimensional dynamic system for studying chaotic time series. It is generated by the famous Lorenz equation:

$$ \frac{dx(t)}{dt}=\alpha \left[y(t)-x(t)\right] $$

(20)

$$ \frac{dy(t)}{dt}=x(t)\left[\beta -z(t)\right]-y(t) $$

(21)

$$ \frac{dz(t)}{dt}=x(t)y(t)-\gamma z(t) $$

(22)

We set the parameters to: α = 10, β = 28, and γ = 8/3.

Since not all time series is non-stationary, for rigorous considerations, before the experiment, we use Autocorrelation Function [35] (ACF) test and Augmented Dickey-Fuller [36] (ADF) test to verify the non-stationarity of the six experimental datasets.

The ACF test judges the stationarity of the time series through the attenuation of the autocorrelation coefficient. The autocorrelation coefficient γ_k is calculated as follows:

$$ {\gamma}_k=\frac{\sum_{t=1}^{n-k}\left({X}_t-\overline{X}\right)\left({X}_{t+k}-\overline{X}\right)}{\sum_{t=1}^n{X}_t-{\overline{X}}^2} $$

(23)

where n is the size of the time series, k is the lag period, and $ \overline{X} $ is the mean of the time series.

According to Eq. (23), the autocorrelation coefficient γ_k is a decreasing function of the lag period k, which means that with the increase of k, γ_k decreases and gradually tends to 0. Stationary time series have the short-term correlation, that is, usually only the recent value has a significant effect on the current value, while the farther value has a smaller effect on the current value. Therefore, the value of γ_k of stationary time series drops much faster than that of NS-TS.

Figure 4 shows the ACF test result of each experimental dataset. The ACF image shows the value of γ_k corresponding to each k, and the approximate upper and lower autocorrelation confidence bounds with γ_k = 0 represented by two horizontal lines. As mentioned above, if γ_k rapidly drops to 0, fluctuates around 0, and gradually converges to 0, it is considered that the time series is very likely to be stationary. As can be seen from Fig. 4, the autocorrelation coefficient γ_k for each one of the six time series does not conform to the above characteristics. Therefore, we have sufficient reasons to believe that all six experimental datasets are non-stationary.

Unlike the subjectivity of the ACF test, ADF test judges the stationarity of the time series by the existence of the unit root. If the unit root exists, then the time series is non-stationary. We rigorously prove the non-stationarity of each time series using ADF test and give the corresponding results in Table 1. The ADF test has four basic indicators, the test value pValue, the critical value cValue, the significance level α, and the test result h. The null hypothesis H0 of ADF test is that the time series has a unit root, that is, it is non-stationary. The test result h = 1 indicates that hypothesis H0 is rejected at the 5% significance level. As can be seen from Table 1 that each time series dataset satisfies h = 0 and pValue > α, so it can be concluded that the above six experimental datasets are non-stationary.

Table 1 The results of ADF test on six datasets

DWE-IL: a new incremental learning algorithm for non-stationary time series prediction via dynamically weighting ensemble learning

Abstract

Similar content being viewed by others

Several Novel Dynamic Ensemble Selection Algorithms for Time Series Prediction

EnsPKDE&IncLKDE: a hybrid time series prediction algorithm integrating dynamic ensemble pruning, incremental learning, and kernel density estimation

Time Series Forecasting Through a Dynamic Weighted Ensemble Approach

Explore related subjects

1 Introduction

2 Theoretical basis

2.1 Extreme learning machine (ELM)

2.2 Extreme learning machine with kernels (ELMK)

3 The proposed incremental learning algorithm via dynamically weighting ensemble learning (DWE-IL)

4 Experiments

4.1 Datasets

4.1.1 Three financial time series datasets

4.1.2 Sunspot dataset

4.1.3 Mackey-glass dataset

4.1.4 Lorenz dataset

4.2 Experimental setup

4.2.1 Parameters setup

4.2.2 Comparative algorithms setup

4.2.3 Performance measures setup

4.3 Experimental results

5 Conclusions and future works

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

EnsP^KDE&IncL^KDE: a hybrid time series prediction algorithm integrating dynamic ensemble pruning, incremental learning, and kernel density estimation