Keywords

1 Introduction

The flow forecast is necessary due to the dependence and fixation of societies around river basins throughout history. It is fundamental for the civilization to maintain its essential activities, such as agriculture, livestock, basic sanitation, hydroelectric power generation, industry, and tourism. Keeping water available implies developing techniques to identify and predict the behavior of these basins. Besides, it is possible to avoid tragedies such as those resulting from floods, droughts, rupture of dams, and disease vectors [14]. From a current perspective of society, the improvement of these techniques is in line with the water resources’ growth management and environmental preservation. It is negatively impacted by the accelerated urban expansion, enabling sustainable development and enabling decision-making and long-term risk planning competent bodies [10].

Historical records contained in time series of water phenomena are often costly and difficult to measure, in addition to presenting noises and missing data, which impairs the performance of forecasting these time series [6]. The case study of this work, the river basin’s Paraíba do Sul, has 45 flow measurement stations with many missing data in all stations resulting from the station shutdown or the like activities. In addition to hydro-geomorphological modifications or even failures in sensors that result in noise in the time series.

The hydro-geomorphological variables present in a basin present correlated variations temporally and also spatially. That indicates possible events, such as changes in the records measured by an upstream flow measurement station, which influence the forecast of the downstream measurement stationFootnote 1. Therefore, it is necessary to consider these phenomena to improve predictive capacity. For example, if a dam is installed in a river basin region, the entire flow downstream of that dam will be affected, so the time series forecasts of stations downstream from the dam need to consider this phenomenon.

The flow time series is susceptible to exogenous and uncertain factors, such as the measuring station’s maintenance, probably because of measurement failures in sensors, which require its shutdown. Also, the relationships present in the time series distributed along the river basin, when not appropriately used, constitute a reneged potential for forecasting and wasting resources spent on flow measurements. Therefore, forecasting with robust methods for missing data and noise inflow time series is necessary.

MultiTask Learning is an approach to inductive learning transfer that increases generalization using information from related tasks. This is done by learning in parallel using a shared representation which can help to improve the learning of the others as defined in [4]. The MultiTask Learning method can be resilient to missing and noisy data since it considers the temporal and spatial relationships present in the river basin’s flow time series. As a result, missing data or noise that would impair the model’s performance has its negative effect diminished by the relationships present in the data, combining each time series’s learning in a single model. The learning transfer method MultiTask Learning still captures information implicit in the relationships between all flow time series along the river basin, providing better use of the available data concerning the forecast models’ application separately in each measuring station. The motivation of this work consists of combining these characteristics of the transfer of learning MultiTask Learning with the LSTM model of recurrent neural networks.

The literature presents promising results in several applications. Jin and Sun [7] showed that multi-task learning (MTL) has the potential to improve generalization by transferring information in training signals of extra tasks. Ye and Dai [15] developed a multi-task learning algorithm, called the MultiTL-KELM, for multi-step-ahead time series prediction. MultiTL-KELM regards predictions of different horizons as different tasks. Knowledge from one task can benefit others, enabling it to explore the relatedness among horizons. Zhao and collaborators [17] introduced a multi-task learning framework that combines the tasks of self-supervised learning and scene classification. The proposed multi-task learning framework empowers a deep neural network to learn more discriminative features without increasing the parameters. The experimental results show that the proposed method can improve the accuracy of remote sensing scene classification. Cao et al. [3] proposed a deep learning model based on LSTM for time series prediction in wireless communication, employing multi-task learning to improve prediction accuracy. Through experiments on several real datasets, the authors showed that the proposed model is effective, and it outperforms other prediction methods.

The prediction of flow time series is widely used for the planning and management of water resources, as evidenced by the work in [13]. This paper presents the classic models such as ARIMA and Linear Regression, which are unable to capture the non-stationarity and non-linearity of the hydrological time series. This study also points to the growth of attention given to data-driven models such as neural networks that progress in predicting non-linear time series, capturing water time series’s complexity. Aghelpour and Varshavian, [1] compare two stochastic and three artificial intelligence (AI) models in modeling and predicting the daily flow of a river. The results showed that the accuracy of AI models was higher than stochastic ones, and the Group Method of Data Handling (GMDH) and Multilayer Perceptron (MLP) produced the best validation performance among the AI models.

In comparison to several hydrological models, deep learning has made significant advances in methodologies and practical applications in recent years, which have greatly expanded the number and type of problems that neural networks can solve. One of the five most popular deep learning architectures is the long short-term memory (LSTM) network, which is widely applied for predicting time series [11]. LSTM is a specific recurrent neural network architecture that can learn long-term temporal dependencies and be robust to noise. This feature makes it efficient in water resource forecasting problems as explored in the works at [9], which showed the LSTM model as an alternative to complex models. Such models can include prior knowledge about inflows’ behavior and the study at [16] which showed LSTM ’s ability to predict water depth for long-term irrigation, thereby contributing to water management for irrigation. However, both works clarify the need for a considerable amount of data for LSTM to present satisfactory results.

The Paraíba do Sul River basin is of great importance for Brazilian economic development and supplies 32 million people [8]. This basin has 45 measurement stations whose captured time series have missing and noisy data, so forecasting this basin’s flow is difficult. The work on [2] showed the efficiency of the LSTM model for the flow forecast in the Paraíba do Sul River basin compared to other classic models such as ARIMA and also pointed out the importance of the long flow forecast in this basin. This work used a subset of 4 of the 45 flow measurement stations on the Paraíba do Sul River.

To applying a Machine Learning technique to forecast time series, it is common to optimize an error measure by training a single forecast model of the desired time series. However, it is sometimes necessary to explore latent information from related series to improve forecasting performance, resulting in a learning paradigm known as Multi-Task Learning (MTL). According to Dorado-Moreno et al. [5] the high computational capacity of deep neural networks (DNN) can be combined with the improved generalization performance of MTL, designing independent output layers for each series and including a shared representation for them. The work of Shireen and collaborators [12] showed that models using MTL could capture information from several time-series simultaneously, with robustness to missing data and noise, making inferences about all historical data and their relationships within the scope photovoltaic panels.

This work proposes a robust forecasting model for missing and noisy data to make long-term flow predictions from information present in the time series of measuring stations located along a hydrographic basin. We have used the time series of measuring stations located along the Paraíba do Sul river basin as a case study for this work. The proposed model combines Deep Learning techniques, such as LSTM, with the transfer of learning MultiTask Learning - MTL, to take advantage of the implicit relationships between the time series of each measurement station, making the model robust to missing and noisy data to improve forecast performance.

2 Materials and Methods

2.1 Study Area and Data

The set of series used in this work consists of daily records collected, from 1935 to 2016, at 45 flow measurement stations along the Paraíba do Sul River basin, provided by the National Water Agency (ANA)Footnote 2. Some measurement stations present missing or noisy data in their collected historical series, as can be seen in Fig. 1. The missing data in the series come from failures of sensors present in the measuring station or similar problems, which resulted in their shutdown for maintenance. In red are non missing data from a measurement station.

Fig. 1.
figure 1

Streamflow time series with missing data.

Fig. 2.
figure 2

Percentage of missing data per measurement station.

Missing data and noise are the problem for the time series’ prediction since noise imply errors in learning the time series’s behavior. On the other hand, missing data inhibits the model from understanding what happened when the data was unwilling. They, therefore, affect the continuity of the model forecast.

Figure 2 shows the number of records missing in the series of flow measurement stations along the basin, whether due to shutdown, maintenance, or defects present in the measurement stations in some period. The series’ median, the ARIMA method, and the average of the months’ days were some data imputations techniques to treat the missing data in these work’s series. Simultaneously, the MTL-LSTM model, which combines two robust techniques for dealing with noisy data, MultiTask Learning and LSTM, was used to deal with noises. As the imputing values’ process in the missing data creates noises, the learning characteristics were from the correlation of the imputed time series in the MTL-LSTM model.

2.2 Streamflow Estimation Model

The experiments were carried out with the historical series’ set of 45 flow measurement stations distributed along the Paraíba do Sul river basin to compare the forecast made by the MTL-LSTM models with the LSTM models trained with each isolated series.

As shown in Fig. 3, the E time series are provided as input to the model. They are divided into rolling windows of size j and steps of size 1. Each step of these time series is concatenated with the E measuring station, forming a E rows matrix and j columns and a y vector with size E. These data are then provided to the LSTM, which learns to predict the time series’s future behavior.

Fig. 3.
figure 3

MultiTask-LSTM model

The experiments were retrieved in the Google ColabFootnote 3 environment with 12 GB of RAM in GPUs using the KerasFootnote 4, NumPyFootnote 5 and Tensorflow librariesFootnote 6 in Python. All results were chosen about the average of 30 runs. The MAPE metric was chosen to compare the results, defined by the Eq. 1:

$$\begin{aligned} MAPE = \frac{100 \%}{n} \sum ^ {n} _ {t = 1} \mid \frac{A_t - F_t}{A_t} \mid \end{aligned}$$
(1)

where \( A_t \) is the historical time series value in time t, \( F_t \) is the value predicted in time t, and n is the size of the time series.

The LSTM applied in the MTL-LSTM model had hyper-parameters as suggested by Campos et al. [2]. These hyperparameters were used to build a single-task learning LSTM (STL-LSTM) to separately model each time series collected on the 45 measurement stations.

The MTL-LSTM model uses 14-day windows as in the work of Campos et al. [2], with 45 reference stations and is written as:

$$\begin{aligned} Q_ {1, t + 14}= & {} F (Q_ {1, t}, Q_ {1, t-1}, \cdots , Q_ {1, t-13})\\ Q_ {2, t + 14}= & {} F (Q_ {2, t}, Q_ {2, t-1}, \cdots , Q_ {2, t-13})\\ \vdots&\\ Q_ {k, t + 14}= & {} F (Q_ {k, t}, Q_ {k, t-1}, \cdots , Q_ {k, t-13})\\ \vdots&\\ Q_ {45, t + 14}= & {} F (Q_ {45, t}, Q_ {45, t-1}, \cdots , Q_ {45, t-13}) \end{aligned}$$

where \(Q_{k,t+14}\) is the streamflow at station k predicted 14 d ahead.

We trained the model using a training set with the first 75% data of the time series, and the 10% of the followed data in the validation set to verify the hyperparameters, and the last 15% of data for the test set. Each experiment was performed 30 times, from which we calculated the average of the MAPE metric to assess the final performance of the model.

3 Computational Experiments

Figure 4 shows us that the MTL-LSTM model performs considerably better than the LSTM model when the median metric was applied in the imputation of missing data in the times series presented to the models, except for the station 58218000. The results evidence the MTL-LSTM model’s capacity to learn hydro-geomorphological relations in the basin, ignoring the noise added by a constant median imputation.

Fig. 4.
figure 4

MultiTask-LSTM and LSTM comparison with median missing data imputation

When ARIMA or Mean of days per month are applied to impute missing values as in the Figs. 5 and 6, MTL-LSTM performs considerable better than LSTM in all measurement stations. This behavior shows the robustness of the MTL-LSTM model to learn series relations in the basin when data is more accurate with two imputation methods that preserve the seasonality and variability of the time series. This behavior indicates that MTL-LSTM would perform better than LSTM with no missing data.

Fig. 5.
figure 5

MultiTask-LSTM and LSTM comparison with ARIMA missing data imputation

Fig. 6.
figure 6

MultiTask-LSTM and LSTM comparison with mean of days per month missing data imputation

The Table 1 summarizes the results found in the experiments. We can observe that the MultiTask-LSTM model obtains averaged percentage errors around half of the errors achieved with the individual LSTM models. Note that while the LSTM models achieved percentage errors above 40%, the MultiTask-LSTM model achieved MAPEs below 22%. As shown in Fig. 7, the MultiTask-LSTM has the advantage of having your training time faster as it places all flow measurement stations in the same model. On the other hand, the model containing only LSTM is considerably slower to train each time series separately.

Fig. 7.
figure 7

Time comparison between MultiTask-LSTM and LSTM

Table 1. MAPE’s mean for each streamflow measurement station by imputation method and model.

4 Conclusion

Flow forecasting in the river basin is an essential issue for well-being and social development. To ensure adequate environmental, social and economic conditions, the study of models that provide the improvement of long-term flow forecasting is necessary, especially in time series with a lot of missing data, noise, and hydrogeomorphological changes such as flow time series.

Using the MultiTask Learning technique together with the Deep Learning model, LSTM, allows absorbing the information present in the data of all the time series of the measuring stations of a basin. In other words, it reuses the knowledge learned in a time series of a measuring station in the learning of the other series of that basin. The Paraíba do Sul River Basin, located in Brazil, was used as a case study for this work. However, the model can be applied to forecasting other basins where multiple flow measurement stations collect data, especially if these measuring stations have time series with noise or missing data.

The study used three missing data imputation techniques to verify robustness against noisy data of the MTL-LSTM model. As can be seen in Figs. 4, 5 and 6 the MTL-LSTM model achieved considerably better percentage errors in all missing data imputation scenarios. The LSTM models were applied in long-term forecasts in each series of flow measurement stations located along the Paraíba do Sul river basin. The MTL-LSTM model also presented a shorter training time when compared to the LSTM models, as seen in Fig. 7.

The learning transfer approach present in the MTL-LSTM model allowed the improvement of long-term forecasts. Results from all measuring stations in the hydrographic basin demonstrated the robustness of the data imputation procedure, maintaining a stable performance with the different imputations.