Keywords

1 Introduction

Successive equally spaced data instances varying against time is defined as time series data e.g. IoT [1], health care monitoring [1], financial monitoring, anomaly detection [2], scene labeling [3] etc. Time series data analysis requires behavioral study of the data by extracting the underlying pattern for accurate predictions on future values. Presence of anomaly and outliers pose major challenges in time series analysis. An accurate identification of anomaly and outliers conditions would allow underlying model to plan preventive or reactive measures accordingly. As time series data is non-stationary and frequently varying, multiple conditional behaviors can be observed [4]. It is very important to build a robust and reliable model which is able to handle different aspects of data and make accurate predictions accordingly. At the same time underlying model should not be biased to some specific behaviors.

A RNN (Recurrent Neural Network) [5] uses its internal memory to process sequence of inputs unlike feedforward neural network which makes them efficient for time series analysis. A basic RNN can be constructed using single instance of the model but it may not be capable of extracting all patterns inside the given data. Time series analysis remains delay intolerant and needs real time prediction. One network overloaded with whole analysis takes more time to process and give result which makes it inefficient. A possible solution would be create multiple instances of the model but since identical models will learn similar functions and would extract same patterns, this also becomes inefficient. Thus, a generalized model can be constructed by creating separate basic models optimized for different patterns and thereafter creating an ensemble of them. The predictive performance and the generalizability of the model is improved significantly with ensemble learning [6].

LSTM (Long Short Term Memory) [7] is one of the most frequently used RNN model used for time series analysis, natural language processing, sequence modelling and learning. RNN taps into the temporal and sequential aspects of the data in a fixed number of computational steps making them fast and suitable for delay intolerant applications. Along with advantages of traditional RNN, LSTMs simple model, capability to handle potentially infinite dynamic data and ability to store important information for a significantly longer duration makes it preferable model for time series analysis e.g. time series prediction [8], event forecasting [9], anomaly detection [10] etc. Multiple LSTM networks replacing single deep LSTM network [2, 5] increase the accuracy by analyzing single or a group of similar behaviors. Ensemble LSTM independently trains each of the component models such that no two models are identical.

Final prediction from ensemble LSTMs require aggregation. A basic aggregation technique can be done by averaging prediction of different networks. As different LSTM networks are trained for different functions, giving them equal importance leads to inaccurate predictions. In this paper, we propose two weighted aggregation techniques which gives appropriate importance to different LSTM networks which further increases the models accuracy. Rest of the paper is organized in following sections, Sect. 2 describes the problem faced in ensemble LSTM aggregation followed by Sect. 3 containing the proposed intelligent aggregation methods. Results obtained from extensive experiments are shown in Sect. 4. Conclusion and findings of the proposed work is described in Sect. 5.

2 Problem Description

Given time series data set \( \left\{ {x_{i} , y_{i} } \right\}_{i = 1}^{t - 1} \) containing labels or predictions uptil time instance \( (t - 1) \). Time \( (t) \) onwards, for a given data instance \( \{ x_{t} \} \), we need to predict accurate \( \{ y_{t} \} \). Ensemble LSTM network containing n differently trained LSTMs \( \left\{ {L_{k} } \right\}_{k = 1}^{n} \) are built. The existing aggregation is done by

$$ y_{t} = \frac{1}{n}\mathop \sum \limits_{k = 1}^{n} L_{k} (x_{t} ) $$

As this aggregation gives equal weight to all networks hence, even the least accurate network is able to adversely affect the final prediction. Our aim is to develop aggregation technique to give appropriate importance to underlying LSTMs based on their individual performance. Additionally the aggregation should be able to give higher preference to recent instances in order to extract the current trend.

3 Aggregation of Ensemble Thinned LSTMs

A time series predictor harness the power of LSTM and takes advantage of ensemble networks to provide accurate next time instance prediction. Given a time series data till \( (t - 1) \), the point at time step \( (i) \) is represented by \( \{ x_{i} \} \), the next predicted data point \( \{ x_{t} \} \) at time \( t \) is given by

$$ x_{t} = f(x_{1} , x_{2} , \ldots x_{t - 1} :\phi_{t - 1}^{*} ) $$
(1)

Here \( f \) is the function responsible for capturing the behavior of the time series and \( \phi (t - 1) \) are the parameters of the trained model till \( (t - 1) \). The parameters get updated with each new observation. As a data instance arrives, the loss function of the model is updated by

$$ \phi_{t - 1}^{*} = \begin{array}{*{20}c} {argmin} \\ {\phi_{t - 1} } \\ \end{array} \parallel x_{t - 1} - f\left( {x_{1} , x_{2} , \ldots x_{t - 2} :\phi_{t - 2}^{*} } \right)\parallel_{2}^{2} $$
(2)

3.1 Long Short Term Memory

LSTM is an advanced version of RNN which is capable of remembering the extracted information for an extended interval of time as compared to the basic RNN used for Deep Learning. LSTM has the capability to store relevant or indicative information for a sufficiently long time. It also selects important and reduces irrelevant features automatically to avoid curse of dimensionality [11].

A single LSTM unit consists of current, input, output and forget gate. A single layer of LSTM consist of multiple such units and the whole LSTM network contains several such layers.

  • Current Memory

    It is the currently stored information present in the LSTM cell which is used to make predictions. The update that is made on the previous cell state Ct-1 which results in the new cell state Ct

    $$ C_{t} = f_{t} *C_{t - 1} + i_{t} *C_{t} $$
    (3)

    Output from the current time step is prediction for the immediately following step. LSTM also stores the necessary cell state and information for the future predictions thereafter. After the cell state is updated for the future time instance, the output prediction for the immediate time step is given using a Sigmoid function

    $$ O_{t} = W_{0} *\left[ {h_{t - 1} ,x_{t} } \right] + b_{0} $$
    (4)

    here, with respect to output gate, Wo is its weight, bo represents bias, and Ot is the output given by the cell at time (t).

  • Forgetting Mechanism

    It enables the model to decide whether to remember or discard the previously acquired information. The sigmoid function is used for forget gate is as described

    $$ f_{t} = \sigma (W_{f} *\left[ {h_{t - 1} ,x_{t} } \right] + b_{f} ) $$
    (5)

    here, Wf is the forget gate’s weight, bf denotes bias, and ht-1 represents the output of previous time instance(t  1).

  • Saving Mechanism

    It allows the model to extract new information from just arrived data. The logic used for the function is

    $$ i_{t} = \sigma (W_{i} *\left[ {h_{t - 1} ,x_{t} } \right] + b_{i} ) $$
    (6)

    here, Wi and bi represents weight and bias of the input gate respectively. A tanh activation function is used for updating state of the LSTM cell. The vector of new values as stored in the LSTM can be defined as

    $$ C_{t} = tanh(W_{c} *\left[ {h_{t - 1} ,x_{t} } \right] + b_{c} ) $$
    (7)

    here, Wc and bc are the weight and bias of the current memory gate respectively.

The architecture used for prediction is an LSTM network is similar to the baseline [12] as shown in Fig. 1.

Fig. 1.
figure 1

LSTM representation

3.2 Ensemble LSTM

As a single network specializes in specific or group of similar patterns, they tend to become less accurate as time grows. Ensemble of weak models make more powerful model than individual deep models. As a single deep model takes more time to train and predict, latency intolerant analysis get adversely affected from this. It may also become particularly overfitted according to some peculiar trend in the data. Presence of outliers and anomaly also have a higher impact on single instance. On the other hand a collection of independent models, each differently trained are able to solve the above problems. Independent models are also capable of controlling the biasness by bringing in variance and making overall model independent of both the training data as well as the model’s architecture.

In a time series prediction using LSTM neural networks, variance among the different models of the ensemble is done by using dropout regularization [13]. Dropout regularization allows neural network layers to drop units along with their connections during training as shown in Fig. 2b. These thinned networks are hence trained independently and differently from each other as different neurons are dropped from each of them. As due to different trainings, the models do not co-adapt together. Also, dropping different neurons helps avoiding function overfitting problem.

Fig. 2.
figure 2

Dropout regularization

The final prediction from such thinned networks need an aggregation technique to improve the performance. The simple possible way to aggregate a fixed-sized model is to average the predictions of all possible settings of the parameters [14]. This increases the combined model’s error as the least accurate network also gets the same importance to affect the prediction as that of highly accurate network.

3.3 Intelligent Aggregation

In multiple thinned network \( \left\{ {L_{k} } \right\}_{k = 1}^{n} \), each \( L_{k} \) gives its prediction for \( \{ x_{t} \} \). The final prediction is done by aggregating those individual predictions. A simple aggregation is done by averaging the predictions. Equal importance to all networks defies the logic behind ensemble LSTM. As worst performing LSTM also gets the same power to affect the final result as compared to best performing network. Also, not all the functions learnt through thinned networks contains the same information gain hence, giving then equal weight does no t always increase the model’s accuracy.

Weighted Aggregation.

Rather than simply taking the mean of all the predictions and giving equal importance, an alternative technique is to give them weights according to their current performance and take a weighted mean of all the predictions to generate the output. The models which are performing better in terms of accuracy as measured by low \( \varsigma /RMSE \) (Root Mean Square Error) should be given higher weight than those with higher \( \varsigma \). Hence, the importance coefficient maintains inverse proportional relationship with \( \varsigma \). Given labeled data \( \left\{ {x_{i} ,y_{i} } \right\}_{i = 1}^{t - 1} \) till time (\( t - 1) \), it is divided in two parts for training and coefficient calculation respectively.

$$ \left\{ {x_{i} ,y_{i} } \right\}_{i = 1}^{t - 1} = \left\{ {x_{i} ,y_{i} } \right\}_{i = 1}^{t - l} \mathop \cup \nolimits \left\{ {x_{j} ,y_{j} } \right\}_{j = t - l + 1}^{t - 1} $$

Each \( L_{k} \in \left\{ {L_{k} } \right\}_{k = 1}^{n} \) is trained using \( (t - l) \) labeled data \( \left\{ {x_{i} ,y_{i} } \right\}_{i = 1}^{t - l} \). A prediction is obtained from all thinned LSTMs using \( l \) labeled data \( \left\{ {x_{j} ,y_{j} } \right\}_{j = t - l + 1}^{t - 1} \) as

$$ \hat{y}_{kj} = L_{k} \left( {x_{j} } \right) $$

where \( \hat{y}_{kj} \) is the label predicted by LSTM \( \left\{ {L_{k} } \right\}_{k = 1}^{n} \) for data \( \left\{ {x_{j} ,y_{j} } \right\}_{j = t - l + 1}^{t - 1} \), \( \varsigma \) is calculated from

$$ \varsigma_{k} = \sqrt {\frac{{\mathop \sum \nolimits_{j = t - l + 1}^{t - 1} \parallel \hat{y}_{kj} - y_{j} \parallel^{2} }}{t - l}} $$

Weighted m Window Aggregation.

In weighted aggregation, we consider the whole prediction of thinned networks but in time series analysis, recent data have more importance than previous data. By keeping the λ fixed based on previous experience may lead to erroneous prediction. The models training is done similar to previous method using t − 1 instances. Labeled data from t – l + 1 to t − 1 is divided in m sized time windows. λ is calculated on a these m sized time windows and accordingly the coefficients are changed for enhancing recent trend prediction. After each m instances, the ς is calculated using

$$ \varsigma_{k} = \sqrt {\frac{{\mathop \sum \nolimits_{j = t - m - 1}^{t - 1} \parallel \hat{y}_{kj} - y_{j} \parallel^{2} }}{m}} $$

As each ς is on different scale, it needs to be normalized for both aggregation methods. The normalized & is obtained by

$$ \Rightarrow \varsigma_{k} = \frac{{\varsigma_{k} - \mu_{\varsigma } }}{{\sigma_{\varsigma } }} $$

where, \( \mu_{\varsigma } \) is the mean of all n LSTMs and \( \sigma_{\varsigma } \) is its standard deviation. Importance coefficient λ for each thinned network is calculated from

$$ \Rightarrow \lambda_{k} = \frac{1}{{\varsigma_{k} }} $$

where, λk is coefficient of kth LSTM and 1 ≤ k ≤ n. Aggregation of data instance from time instance t onwards is obtained using this λ weighted LSTMs

$$ \hat{y}_{t} = \frac{1}{n}\mathop \sum \limits_{k = 1}^{n} \lambda_{k} L_{k} (x_{t} ) $$
(8)

where, \( \hat{y}_{t} \) is the final prediction of weighted aggregation (Fig. 3).

Fig. 3.
figure 3

Weighted aggregation of LSTM

4 Result and Analysis

The analysis was conducted on the Yahoo! Webscope [15], a time series benchmark dataset. It consist of four (A1–A4) benchmark data. In our analysis, real [1,2,3] and synthetic [1,2,3] dataset were used. Each dataset contains ≈1500 entries distributed among three fields (timestamp, value and is anomaly).

The Real dataset from A1 benchmark is based on real production traffic to the Yahoo! Properties while the synthetic is generated from artificial time-series with random seasonality, trend and noise. Former dataset has a periodic interval of one hour while the outliers in the later dataset are inserted at random positions. The underlying base architecture consists of three layered LSTM network [12]. A performance comparison of the proposed ensemble architectures weighted and m window with the baseline and simple mean aggregation model has been done using both real world and synthetic dataset. Various parameters taken for experiment are shown in Table 1.

Table 1. Parameters

Figure 4 shows the result obtained from different aggregation techniques on the A2-synthetic 2 dataset when mapped with the actual data. Performance of single LSTM is shown in Fig. 4a, as the graph shows that this method is not able to predict the ground values hence gives an error ς = 173:439. Simple mean aggregation on the other hand is able to predict more accurate values as shown in Fig. 4b. The convergence with the actual values is much better than the single model. It increases accuracy by ≈45% than single LSTM. Weighted aggregation further enhances the prediction accuracy by ≈67% and gives better prediction as shown in Fig. 4c than simple mean aggregation. An m window performs even better than weighted aggregation based LSTM by about ≈65%. As can be seen in the case of real data in the given table, giving preference to recent data window does not go well since the recent trend does not follows well with the underlying actual annotation.

Fig. 4.
figure 4

Aggregation result of synthetic time-series

A comprehensive comparison of the results can be seen in Table 2. As evident from the result, weighted and m window clearly outperform baseline and simple mean aggregation. Among overall result, basic single model gives least performance followed by simple mean aggregation. On real dataset, weighted aggregation performs best among all as it is able to do a better approximation on inconsistent manual annotation. A window based aggregation on the other hand works best on synthetic data as extracting recent trend helps tackle random seasonality and noise.

Table 2. Predition error \( (\varvec{\varsigma }) \)

Weighted and m window aggregation helps restrict model uncertainty, misspecification and inherent noise [16] to tolerable limits. Incorrect or ignored parameters based model leads to uncertainty. Dropout regularization helps solve uncertainty as it makes each of the models different from one another. The probability based neurons dropping in each network remains different. These thinned models makes the overall ensemble avoid wrongly trained or overfitted parameters.

A training dataset containing non-uniform samples that does not cover the whole sample space leads to model misspecification. Proposed methods considering the local performance of each thinned network and ne tune their importance helps build a generalized model which removes model misspecification. The uncertainty in data results in inherent noise. It is something which cannot be solved at the model level since the problem is with the data being given as input rather than the model itself. It is solved by systematically incorporating the feature of online Learning. It allows the model to learn with each prediction and fine tune its parameters for better result on upcoming dataset. Thus, the dynamic nature of the data is automatically considered during model training.

5 Conclusion

The study proves that single LSTM gives unsatisfactory predictions and ensemble of thinned LSTMs is able to increase the accuracy. The simple mean aggregation of ensemble LSTMs gives better results than single LSTM but with equal importance to all individual networks, the performance degrades. Proposed weighted aggregation method is able to define appropriate importance to each thinned network by analyzing their correctness. Especially in manual annotated real data, importance based on exhaustive past data counters the inconsistency in annotation. Weighted aggregation’s performance degrades in presence of random outliers and noise. A m window based aggregation solves this problem by calculating thinned networks’ importance based on last m data instances. Recent trend based importance factor is able to segregate outliers and noise accurately.