1 Introduction

Basin water resource planning and management require accurate data on rainfall. Understanding and modelling rainfall is one of the hydrologic cycle most complicated issues because of the complexity of both, the atmospheric processes triggering rainfall and the significant variation range of scales in space and time (Hung et al. 2009; Sumi et al. 2012). Indeed, rainfall forecast poses a great challenge in operational hydrometeorology, regardless of several advances in weather forecasting in the last decades (Li and Lai 2004; Hong 2008; Majumdar et al. 2021).

There are two well-known approaches to forecast precipitation (Luk et al. 2001). The former involves studying the rainfall process by applying the underlying physical laws using mathematical equations for hydrological process (Solomatine and Ostfeld 2008). The latter is substantiated by the pattern recognition methodology, which tries to recognize precipitation patterns, taking into account their characteristics in historical data, which are used to predict the rainfall evolution (Ghumman et al. 2011; Diodato and Bellocchi 2018; Ridwan et al. 2021). By such contribution, the latter is considered appropriate since the immediate priority is to predict monthly rainfall at certain spots inside a basin.

The improvements in pattern recognition methodologies have led to the extensive use of machine learning (ML) tools to solve several problems in many research areas, including spatial prediction of several natural hazards, i.e., flooding (Band et al. 2020a), landslides (Moradi et al. 2019), wildfires (Watson et al. 2019), tropical cyclones (Li et al. 2017; Yang et al. 2020; Yuan et al. 2021) and storm surge (Bai et al. 2022). Ghazvinei et al. (2018) applied extreme ML to describe the sugarcane growth and, consequently, the improvement of agricultural production. Choubin et al. (2019) proposed several models based on ML for predicting earth fissures. Likewise, ML algorithms (e.g., boosted regression tree, random forest, parallel random forest, regularized random forest, randomized trees) have also been applied to evaluate the flash flood (e.g., Band et al. 2020a) and gully erosion (e.g., Band et al. 2020b) susceptibility modelling. Fernández-Alvarez et al. (2019) calculated the power of an intraocular lens to be implanted after cataract surgery by applying a multilayer perceptron (MLP) neural network. Recently, Shabani et al. (2020) used several ML methods (Gaussian Process Regression, K-Nearest Neighbors, Random Forest and Support Vector Regression) to predict evaporation. Note that evaporation is a complex and nonlinear phenomenon and one of the most critical components of the hydrological cycle. While Wang et al. (2017a,b) found that the MLP model fed by regional data was skillful for predicting monthly evaporation in different climate regions of China, Ghorbani et al. (2018) noted that the hybrid MLP-Firefly Algorithm performed better. Recently, Fang and Shao (2022) and Li et al. (2022) applied the Long Short-Term Memory (LSTM) for predicting rainfall-runoff in the Han and Elbe rivers basins, respectively.

The literature reveals that ML has gained popularity for water management, rainfall prediction and solving hydrological issues (Ridwan et al. 2021; Barrera-Animas et al. 2022). Several authors (Teschl et al. 2007; Krasnopolsky and Lin 2012; Nastos et al. 2013; Hardwinarto et al. 2015; Kashiwao et al. 2017; Vathsala and Koolagudi 2017; Chao et al. 2018; Haidar and Verma 2018; Benevides et al. 2019; Anochi et al. 2021; Narejo et al. 2021; Ridwan et al. 2021; Sun et al. 2021; Venkatesh et al. 2021) have utilized artificial neural networks (ANNs) (Anochi et al. 2021), to perform rainfall forecasts. Ridwan et al. (2021) used NN regression to estimate the rainfall data in Tasik Kenyir, Terengganu; and Anochi et al. (2021), using a supervised ANN model, proposed a new approach to predict seasonal precipitations over South America. Besides, Krasnopolsky and Lin (2012), based on data learning using the ANN technique developed a multi-model cluster approach to improve short-term rainfall forecasting over the continental United States, while Teschl et al. (2007) enhanced weather radar forecast with feed-forward ANNs. Following a similar approach, Benevides et al. (2019) performed an hourly intense rainfall forecasting using the time series feed-forward ANN, integrating meteorological data and Global Navigation and Positioning System (GNSS). Similarly, Zhao et al. (2021) utilized the precipitable water vapour predicted by the GNSS as input data for a supervised learning algorithm for hourly rainfall forecasting. Kashiwao et al. (2017) applied the MLP ANN to predict local rainfall in Japan’s areas with data from the Japan Meteorological Agency. Likewise, Haidar and Verma (2018) applying one-dimensional deep Convolutional Neural Network (CNN) proposed a new forecasting method to estimate monthly rainfall for a chosen area in eastern Australia. Furthermore, Lee et al. (2018) developed a model using ANN for rainfall forecasting in South Korea’s Geum River Basin. Proposals by Haidar and Verma (2018) and Lee et al. (2018) stand out as novel since they use climate indices to predict monthly accumulated precipitation. Narejo et al. (2021) found that the temporary Deep Belief Network (DBN) model outperforms the CNN particularly on the forecast of rainfall time series, but it requires more computational resources than other deep learning architectures. Additionally, Samad et al. (2020) utilized an LSTM model to forecast rainfall in subsequent days. By using satellite data from the NASA Global Precipitation Measurement, Gamboa-Villafruela et al. (2021) trained a convolutional LSTM architecture for precipitation nowcasting. With a similar approach, Bhuiyan et al. (2020) and Derin et al. (2020) used rainfall observation from satellite to train a random forest and neural network for improving the precipitation estimation in water resources applications.

On the other hand, stochastic modelling techniques are also common for forecasting time series with hydrological applications. Several authors (e.g. Wang et al. 2013; Mahmud et al. 2017; Hernández et al. 2017; Bonakdari et al. 2019; Ebtehaj et al. 2019) have employed Autoregressive Integrated Moving Average (ARIMA) models to perform precipitation forecasts, but these models show a mild ability to catch major nonlinear features of rainfall series (Narejo et al. 2021). Nevertheless, the advantages of ANN models to carry out a non-linear mapping between inputs and outputs provide a useful alternative for rainfall forecasting at short and long terms (Ali et al. 2020; Wu et al. 2010; Dounia et al. 2014; Nourani et al. 2019). Authors (e.g., Zhang 2003; Aladag et al. 2012; Moeeni and Bonakdari 2017) have shown that the combination of both ANNs and ARIMA models can increasingly trigger forecasting accuracy much more effectively than if they were set apart from each other (Dodangeh et al. 2020). Indeed, hybrid models of ANNs and ARIMA have been widely used in several applications, i.e., for estimating PM10 pollution (Wongsathan and Seedadan 2016), modelling rainfall-runoff process (Nourani et al. 2011), predicting water quality (Faruk 2010) and fuelwood prices (Koutroumanidis et al. 2009), and determining the annual energy cost budget (Jeong et al. 2014). Overall, both ANNs and ARIMA models turn out especially applicable when: (a) mathematical simulation of the physical phenomena is either too difficult or impossible; and (b) needed parameters for mathematical simulations cannot be described with adequate precision (Xiaojian and Quan 2009). These situations arise quite often in water resources management problems, modeling of rainfall-runoff process, flood forecasting, etc., turning ANNs and ARIMA models into viable options to be used.

The accurate rainfall forecast for hydrological watershed management is a challenge nowadays due to its spatial and temporal variability (Xu et al. 2014). Therefore, given the findings of previous works (e.g., Zhang 2003; Haidar and Verma 2018; Lee et al. 2018; Bonakdari et al. 2019; Ebtehaj et al. 2019; Dodangeh et al. 2020), this study aimed to develop a hybrid model based on ANNs (MLP, CNN and LSTM) and ARIMA models for predicting monthly rainfall totals over hydrological basins. A skillful model for forecasting monthly rainfall will be useful for planning water management to mitigate the impact of dry periods. Furthermore, an accurate forecast over a year will also help decision-makers to decide on water availability for agriculture, industry and the population in general. As the study region, we selected the Almendares-Vento catchment, considered one of the most important basins in Cuba because it provides freshwater to Havana city.

2 Materials and methods

2.1 Study Area

The study area spans the Almendares-Vento basin, which is located in the western region of Cuba, on the northern slope of the country (Fig. 1), and is also recognized by the National Center for Hydrographic Basins (CNCH, Spanish acronym) as a basin of national interest (García-Fernández and Díaz 2017). The Almendares watershed covers an area of 422 km2. Its hydrographic network comprises intermittent streams, while its main river is the Almendares with a length of 49.8 km (Rivera 2009). The Vento underground basin is divided into two large regions: karstic and non-karstic. Gentle slopes and undulating plains predominate and a system of sub-horizontal, monocline blocks and a well-marked stratification stand out (Valcarce-Ortega et al. 2007).

Fig. 1
figure 1

Almendares-Vento basin. The red point and black numbers represent the station gauge locations and the station IDs, respectively

2.2 Data

Rainfalls registered in gauge stations of the Almendares-Vento basin and climatic indices from different sources were collected as monthly predictors of rainfall. Rainfall data used were retrieved from the National Institute of Hydraulic Resources (INRH, Spanish acronym) of Cuba. The gauge station spatial distribution is disclosed in Fig. 1. The compiled rainfall data included unavailable values. Those values were completed by data from nearby weather station or average. To produce all data, predictors depicted in Table 1 were utilized. Rainfall, sea surface temperature (SST), sunspot and climatic indices time series were included as input characteristics.

Table 1 Statistics of each rainfall predictor. Niño indices, AMO and SST (in ºC) and rainfall (in mm). (σ: standard deviation, µ: mean value. Rainfall ID-N is referred to the rainfall in the gauge station with the identification number N in Fig. 1)

The model parameter dependence on climatic features plays an important role in the response of the catchments to climatic variability (Sivapalan et al. 2011). Thus, following Haidar and Verma (2018) and Lee et al. (2018), several climatic indices that modulate climatic conditions in the North Atlantic Ocean were used to predict the variability of rainfall in the Almendares-Vento basin. The North Atlantic Oscillation (NAO) index (Jones et al. 1997) was extracted from the Climatic Research Unit (CRU). The NAO is one of the greater variability modes of the Northern Hemisphere atmosphere in the Atlantic Ocean (Hurrell 1995). The NAO is often deemed as the normalized pressure difference between a station on Iceland and one on the Azores. Niño 1.2, Niño 3.0, Niño 3.4 and Niño 4.0 indices (Rayner et al. 2003) were extracted from the Global Climate Observing System (GCOS). The Atlantic Multi-decadal Oscillation (AMO) is a consistent way of natural change happening in the North Atlantic Ocean. It is based on the SST average anomalies in the North Atlantic basic, frequently over 0–70º N (Enfield et al. 2001). The Southern Oscillation Index (SOI) can be defined as a standardized index, which is supported by the observed differences in sea level pressure between Tahiti and Darwin, Australia. The SOI is a large-scale fluctuation measure of air pressure that occurs in the tropical Pacific, between the eastern and western part, during La Niña and El Niño phenomena. The sunspot values were compiled from the Solar Influences Data Analysis Center (SIDC) and the SST averaged over the Atlantic Ocean between 5 and 50 ºN and 5-100 ºW was taken from the Centennial Time Scale (COBE SST2) database (Hirahara et al. 2014) at the National Oceanic and Atmospheric Administration (NOAA). In the dataset the daily SST field is made up of a trend’s total, interannual variability, and every day shifts, taking into account in situ SST and sea ice concentration observations. All these climatic indices were considered as predictors of the monthly rainfall in the Almendares-Vento basin. The source, minimum, maximum, mean, median, and standard deviation (σ) of each weather parameter are described in Table 1. Figure SM1 in Supplementary Material (SM) also shows the time series for each rainfall predictor.

2.3 Artificial Neural Networks

ANNs are mathematical models, which harness learning algorithms led by the intellect to save information (Keijsers 2010). Therefore, the ANNs are algorithm clusters made up of computational components, denoted as neurons, which get signals from the environment or from other neurons, and then change those signals and produce an output signal, which could be spread to the environment or to another neuron (Fernández-Alvarez et al. 2019). The first neural network called a perceptron was introduced in 1957 (Hung et al. 2009). It contained a single input layer and the outputs were obtained directly from the inputs through the weighted connections. The MLP neural network was developed in 1960. This ANN gradually turned into one of the most broadly used for solving various problems (Velasco et al. 2019; Ren et al. 2021). The MLP is a feed-forward network that comprises one or more hidden layers (Vathsala and Koolagudi 2017). Mostly, the signals are transmitted within the network from input to output in one direction. Every neuron’s output does not influence the neuron itself because of the lack of a loop. The MLP power stems precisely from non-linear activating functions (Rhee and Shin 2018). The learning process is carried out by means of a supervised method where the willing output must be foreseen to actualize the internal connections weights among the layers (Hussain et al. 2020). Connection weights are constantly changed in agreement with the calculated error until the error gradient achieves an adequate minimum value, which implies that the latter output is near the goal (Ebecken 2011; Panchal et al. 2011; Hossain et al. 2013; Popescu et al. 2009). MLPs are usually trained with the back-propagation algorithm.

The CNN is a profound neural network initially created for image study. A convolutional neural networks power stems from a particular sort of layer known as the convolutional layer. Hence, the structure of a convolutional neural network turns out a multi-layered feed-forward neural network, generated by stacking a lot of hidden layers on top of one another in a row. These layers are generally divided into three types: convolutional, pooling and fully connected (Sakib et al. 2018).

The LSTM neural network was suggested by Hochreiter and Schmidhuber (1997), and is widely used today because of its superior performance in accurate modelling of both short and long-term data. LSTMs are particularly created to recall long-term dependencies. As a frequent repeated neural network, it contains a self-loop. The distinction between LSTM and frequent networks is the inner architecture. The most noteworthy element of the LSTM is the cell state that discloses the information throughout the whole chain with a few linear interactions (Fathi and Shoja 2018).

2.4 Autoregressive Integrated Moving Average (ARIMA) Models

One of the most frequently used automatic prediction algorithms are ARIMA models (Tseng et al. 2002; Aguado-Rodríguez et al. 2016). The ARIMA models serve to predict simple series of a single variable, in which the forecasts are just based on previous values of the analyzed variable. The ARIMA models can be used to make short-term forecasts because most of them attach more importance to the late past than to the faraway past.

Box et al. (1994) developed the classic methodology that uses time series for generating models (e.g., ARMA, ARIMA) to obtain predictions. According to Hyndman and Khandakar (2008) and Aguado-Rodríguez et al. (2016), a common obstacle when using ARIMA models for prediction is that the order selection process is generally considered subjective and difficult to apply. The authors recommend to all those interested in delving into the details of the method to consult the specialized literature on the subject (Kotu and Deshpande 2019; Hyndman and Athanasopoulos 2018).

2.5 Methodology

The dataset generated consists of time series created from gauge station records, as shown in Table 1. Additionally, evaluating the ANN model skill to forecast the monthly rainfall, the rainfall records of some gauge stations (ID-12, ID-15, ID-284, ID-338, ID-441, ID-451, see Fig. 1) in the Almendares-Vento basin and the climatic indices from 2015 to 2019 were used. It is worth noting that the selection criteria to use these gauge stations was based on the availability of the rainfall data from 2015 to 2019. About 120 characteristics that represent the predictors backward values up to one natural year backward were generated to estimate the monthly rainfall totals for the next calendar year.

2.5.1 Parameter Tuning

For the training of ANN models, we applied a similar algorithm to that previously described by Haidar and Verma (2018). To sum up, the dataset was partitioned into three distinct parts for training, validation and testing. Some researchers (e.g., Wu et al. 2010) have used 50% of the data for training and remaining 50% for testing, whereas others (e.g., Venkatesh et al. 2021) used 70% and 30% as training and testing data, respectively. In this work, after several tests, we partitioned the dataset in 70% to learn the weights and biases of neurons in each ANN model, 15% for validating procedure to find the proper architecture during training and the remaining 15% for testing. The training and validating steps are critical phases to optimize the model parameters and also prevent overfitting (Chen et al. 2016; Sadeghi et al. 2020). In addition, the inclusion of check point helped identify when a better performance over the validation dataset was attained in order to save the network weights.

To determine the optimal ANNs architecture and ARIMA configuration the trial-and-error method was carried out. The MLP neural network was configured with two layers: the input and output layers (Fig. 2a). The selected architecture for CNN is made up of two convolutional layers, average pooling and fully connected layers (Fig. 2b). Additionally, the LSTM network was configured with a single LSTM layer and two fully connected layers (Fig. 2c). Dropout (a regularization technique) was added to the CNN and LSTM architectures (Srivastava et al. 2014; Haidar and Verma 2018). In all implemented ANNs, we used the Rectified Linear Unit (ReLU) as activation function. The superior performance of ReLU is commonly believed to come from sparsity (Glorot et al. 2011; Sun et al. 2015). Formally, in the ReLU activation function, y is equal to x when x is greater than or equal to zero, and null when x is less than zero. Furthermore, the neural network was trained by reducing the mean absolute error (MAE). That is, the MAE was used as loss function for training. Additionally, we used the Adaptive Moment Estimation Algorithm or so-called Adam Optimizer (Kingma and Ba 2017) for deep learning. According to Kingma and Ba (2017), such method is computationally useful, very suitable for issues that are large in terms of data, adequate for non-stationary purposes and has few memory requirements.

Fig. 2
figure 2

Artificial Neural Networks architecture for: (a) Multi–Layer Perceptron (MLP); (b) Convolutional Neural Network (CNN); and (c) Long Short-Term Memory (LSTM)

Furthermore, an exploratory analysis of rainfall data in gauge stations reveals certain seasonal frequency in the period analyzed. Data were filtered and smoothed with the Fast Fourier Transform (FFT) algorithm, in order to estimate the seasonal frequency. Indeed, Fig. 3 displays the temporal progression of the smoothed mean rainfall in Almendares-Vento basin, which is characterized by 48 ~ 60 months (4 ~ 5 years) of seasonal frequency. Thus, there is a succession of wet and dry periods of approximately 4–5 years each one. These results were utilized to configure the ARIMA error model seasonally integrated with seasonal moving average model.

Fig. 3
figure 3

Smoothed mean rainfall in Almendares-Vento basin by applying the Fast Fourier Transform algorithm

The rainfall time series of each gauge station were processed in a numerical experiment, with and without differentiating (in ARIMA model, D = 1 and D = 0, respectively). The better results (not shown) were obtained for D = 1, considering moving average terms at lags 1, 2, …, 12; therefore, the configuration applied for all gauge stations was ARIMA(0, 1, 12) with a seasonal frequency of 48 and 60 months, respectively.

The Python Keras package (Chollet 2015) installed on top of Tensorflow framework was employed to carry out and train the CNN, MLP and LSTM neural networks, while the ARIMA function in MATLAB was configured with the rainfall dataset of each gauge station as ARIMA (0,1,12), with a seasonal frequency of 48 (ARIMA4) and 60 (ARIMA5) months, respectively.

2.5.2 Performance Measurements

To evaluate the precision of the models, some statistics, which have been broadly employed in rainfall forecasting tasks were estimated: MAE (Eq. 1), BIAS (Eq. 2), Root Mean Square Error (RMSE, Eq. 3), Pearson correlation coefficient (rp, Eq. 4), Nash-Sutcliffe efficiency (NS, Eq. 5) and coefficient of variation (CV, Eq. 6):

$$MAE=\frac{{\sum }_{i=1}^{n}\left|{x}_{i}-{y}_{i}\right|}{n}$$
(1)
$$BIAS=\frac{{\sum }_{i=1}^{n}\left({x}_{i}-{y}_{i}\right)}{n}$$
(2)
$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}}{n}}$$
(3)
$${r}_{p}=\frac{{\sum }_{i=1}^{n}\left({x}_{i}-\overline{x}\right)\left({y}_{i}-\overline{y}\right)}{\sqrt{{\sum }_{i=1}^{n}{\left({x}_{i}-\overline{x}\right)}^{2}{\left({y}_{i}-\overline{y}\right)}^{2}}}$$
(4)
$$NS=1-\frac{{\sum }_{i=1}^{n}{\left({x}_{i}-{y}_{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({y}_{i}-{\overline{y}}_{i}\right)}^{2}}$$
(5)
$$CV=\frac{{\sigma }_{x}}{\overline{x}}$$
(6)

In all equations yi is the observed value, xi is the predicted value, \(\overline{y}\) and \(\overline{x}\) are the average of observed and predicted values, respectively; n is the number of elements in the dataset, 1 ≤ i ≤ n; σy, σx represents the standard deviation of observed and predicted values, respectively. The RMSE and MAE values fluctuate between 0 and ∞. The nearest to 0 a value is, the more suitable the model forecast will be. Pearson correlation values range between − 1 and + 1. The closer the value is to 1, the better the forecast is. Nevertheless, to determine the statistical significance of the Pearson correlation coefficient, we applied the two-tailed significance test (Weathington et al. 2012). The NS values range between -∞ and 1. Negative NS values denote that the forecasting model does not prove to be a higher predictor than the mean of the measured values. According to Wu et al. (2010), NS is a good alternative as a “goodness-of-fit” because it is sensitive to differences in the observed and forecasted means and variances. Additionally, NS = 0 is regularly used as a benchmark to differentiate ‘good’ from ‘bad’ models (Houska et al. 2014).

For calibrating and evaluating hydrological models, the Kling–Gupta efficiency (KGE) (Gupta et al. 2009; Kling et al. 2012) has been recently used to summarize the model performance (Knoben et al. 2019). Therefore, to gain a complete overview of the ability of each ANN and ARIMA model to predict the monthly rainfall in the Almendares-Vento basin, we used the KGE index, which is defined according to following Eq. (7):

$$KGE=1-\sqrt{{\left({r}_{p}-1\right)}^{2}+{\left(\frac{{\sigma }_{x}}{{\sigma }_{y}}-1\right)}^{2}+{\left(\frac{\overline{x}}{\overline{y}}-1\right)}^{2}}$$
(7)

Similar to the NS efficiency, KGE = 1 indicates a perfect agreement. Several authors have used positive KGE values as indicative of “good” model simulations, while negative KGE values indicates “bad” simulations (see references in Knoben et al. 2019). Conversely, Knoben et al. (2019), pointed out that KGE values higher than − 0.41 indicates that a model enhances over the mean flow benchmark, but this criterion was applied to the flow in a basin. Therefore, in this study we assumed that higher KGE values suggest that the model is more skillful to predict the monthly rainfall. The KGE efficiency has been widely applied to evaluate the model performance in predicting rainfall time series (e.g., Thiemig et al. 2013; Towner et al. 2019; Gebremichael et al. 2022; Shahid et al. 2021; Girihagama et al. 2022; Li et al. 2022).

Additionally, to better evaluate the relative improvement of one model relative to another, we utilized the KGE skill score (KGESS), defined according to Eq. 8, in which KGEa and KGEb are the scores for the model of interest and the comparative or baseline model, respectively:

$${KGE}_{SS}=\frac{{KGE}_{a}-{KGE}_{b}}{1-{KGE}_{b}}$$
(8)

Positive KGESS indicates improved skill, while a negative score represents that the model of interest performs worse than the baseline. The KGESS was previously used with this goal by Towner et al. (2019).

The flowchart shown in Fig. 4 summarizes all the steps followed in this work to identify the models with the highest performance to predict the monthly rainfall in the study region and, accordingly, to generate the rainfall forecast. This schematic representation could be a useful guide for future applications of this methodology in other hydrological basins.

Fig. 4
figure 4

Schematic representation of the methodology followed in this work. MLP: Multi-Layer Perceptron, CNN: Convolutional Neural Network, LSTM: Long Short-Term Memory, FFT: Fast Fourier Transform, ARIMA4 and ARIMA5: Autoregressive Integrated Moving Average model with a seasonal frequency of 4 and 5 years, respectively

3 Results and Discussion

3.1 ANN Training

The statistics for ANN architectures during the training and testing procedures are depicted in Fig. 5. The loss function above the training database had not been very different from the one over the validating dataset, except for LSTM (Fig. 5c), which compared with CNN (Fig. 5a) and MLP (Fig. 5b) has a larger separation, but with similar monotony. These results suggest that the networks did not overfit in the training phase (Chollet 2015). According to Barrera-Animas et al. (2022), overfitting between observed and forecasted values is one of the major drawbacks in rainfall forecasting. Table 2 presents the MAE, BIAS, RSME, rp and NS obtained for each ANN model during the training and testing steps. In the training phase, the best weights were obtained on 2898, 995 and 1805 epochs for the CNN, MLP and LSTM models, respectively. The statistics shown in Table 2 suggest that CNN predicted monthly rainfall during the training and validation procedures in good agreement with the observed rainfall.

Fig. 5
figure 5

Loss function while training the neural networks: (a) Convolutional Neural Network (CNN), (b) Multi–Layer Perceptron (MLP), and (c) Long Short-Term Memory (LSTM). The red and blue curves denote losses during training and validations phase, respectively

Table 2 Statistics for each ANN model during the training and testing phases. The epoch denotes when the best loss values were obtained during the training. Mean absolute error (MAE), BIAS and Root Mean Square Error (RMSE) are given in mm; and Pearson correlation coefficient (rp) and Nash-Sutcliffe efficiency (NS) are dimensionless. Pearson correlation coefficient higher than 0.27 are statistically significant (p < 0.05)

Every time that an ANN was trained, the training time was approximately 5, 13 and 19 min for MLP, CNN and LSTM architectures, respectively. Nevertheless, the training time depends on the available computing resources. In this work, we used a computer with 32 CPUs and 128 GB of RAM. It is worth noting that the time required for the training process also depends on the learning rate, which controls how much to change the model in response to the estimated error each time the model weights are updated. A small learning rate leads to a long training process, whereas a value too large may result in sub-optimal learning weights. Furthermore, the differences observed in the training time of the three neural networks were due to the differences in complexity between them. As the trial-and-error method was applied, the total training time to determine the model with the most adjusted weights will also depend on the number of times each ANN architecture was trained. However, once the model that best fits the accumulated monthly rainfall during the training phase has been found, the predictions are obtained relatively quickly.

3.2 Comparative Analysis of the Forecasting Models

The rainfall data in the Almendares-Vento basin and climate indices from January 2015 to December 2019 were used to assess the ability of ANN and ARIMA models, as noted above. The validation of model performance was achieved based on several statistical metrics and model performance criteria. Figure 6 and Table SM1 show the statistics for each predicting model. Apparently, the LSTM model exhibits the best performance by analyzing the mean rainfall over the whole basin and in the previously selected gauge stations. In agreement with Kim and Bae (2017), Kim and Won (2018) and Kumar et al. (2019), the LSTM performs better in terms of the RSME, MAE and NS. However, and although several studies (e.g., Kim and Bae 2017; Liu et al. 2017; Kratzert et al. 2018; Kim and Won 2018; Kumar et al. 2019) suggest the superiority of LSTM models, we found in this study that it is the worst to predict the peaks of maximum rainfall, as shown in Fig. 7 (for the mean rainfall) and Figure SM2 (for the monthly total precipitation in each gauge station). The main reason for such behaviour is that the number of heavy rainfall events was scarce in the data, and thus, it was difficult for ANN models, and particularly for the LSTM, to learn such features. Besides, very few rain events might not activate neurons (Zhang et al. 2021). As the LSTM model has the capability of learning long-term dependencies, probably the size of the input data for training was too small for the optimal fit of the model weights. It is important to remark that Fig. 7 also shows that the model simulated rainfalls preserved the seasonality.

Fig. 6
figure 6

Comparative analysis of the statistics of each forecasting model: (a) Mean Absolute Error, (b) BIAS, (c) Root Mean Square Error, (d) Pearson correlation coefficient, (e) Nash-Sutcliffe coefficient and (f) Coefficient of variation. In (d) the horizontal black dashed line indicates the threshold (rp =0.27) for statistically significant rp at 95% significance level. In (f) the marker “+” denotes the coefficient of variation of observed data in each gauge station

Moreover, there is consistency in the results between precipitation time-series predicted by CNN, MLP and ARIMA models and the observed series, but the CNN model shows better performance. This result agrees with the findings of Haidar and Verma (2018), who noted the ability of CNN models to predict rainfall time series. From Fig. 6 (see also Table SM1), the CNN predicted the mean rainfall with the lowest MAE (46.41 mm), BIAS (4.22 mm) and RSME (65.78 mm), and it also showed a skill in predicting the monthly rainfall on the gauge stations. It must be highlighted that the CNN showed a statistically significant Pearson correlation coefficient (p < 0.05) between the predicted and observed rainfall in both mean rainfall and gauge stations. Furthermore, the CNN model predicted the mean rainfall with the least difference between the coefficient of variation of the observed and simulated data. Overall, the CNN model has the capacity to cope with the time series data with non-stationarity and seasonal feature of rainfall.

Fig. 7
figure 7

Mean observed rainfall values and the ANN and ARIMA models outputs in the whole basin from January 2015 to December 2019

To assess the proper functioning of each model to predict the monthly rainfall, we computed the KGE index over the mean rainfall and over the rainfall in selected gauge stations. By definition, KGE includes several metrics as the Pearson correlation coefficient, the standard deviation and the mean (Eq. 7), therefore, it provides a more precise estimation of each model accuracy (Osuch et al. 2015; Towner et al. 2019). As noted above, higher KGE values are indicative of better model performance. The validation results by applying KGE index are shown in Fig. 8. The KGE values of mean rainfall predictions ranged from 0.53 (for the LSTM model) to 0.65 (for the CNN), which confirms the ability of CNN to estimate the monthly mean rainfall over the study region. Additionally, our analysis indicates that in general, ARIMA models are better predicting rainfall in selected gauge stations. It must be noted that the KGE value for LSTM is higher than all in the station gauge ID-441. Based on these findings, the CNN and ARIMA4 models exhibit the best performance again. The ability of the CNN model could be attributed to its filter to capture certain recurrent patterns, and then, try to forecast the future values. Apparently, this property of the CNN model favours their applications on time series forecasting (Fotovatikhah et al. 2018; Moazenzadeh et al. 2018; Chong et al. 2020). The overall low KGE values in station gauges ID-284, ID-338, ID-441, and ID-451 can be attributed to the bias magnitude of model predictions. This behavior agrees with Gebremichael et al. (2022), who found that the KGE value tends to decrease when the model overestimates the precipitation. It is worth noting that the data used from stations ID-12 and ID-15 represent 10.5% and 26.4% of the dataset, while the data for the remaining stations account for 8.8% (ID-338), 4.6% (ID-451), 4.1% (ID-441) and 3.4% (ID-284). Therefore, the best performance of all models for the mean rainfall in the whole basin and gauge stations ID-12 and ID-15 could be attributed to more data availability during the training process. Accordingly, the ANN and ARIMA models better learned the rainfall variability in stations ID-12 and ID-15 than in the other stations. This hypothesis agrees with Abiodun et al. (2018) and Hijazi et al. (2020), who noted that ANNs usually need big databases for training the network to accomplish substantial prediction accuracy.

Fig. 8
figure 8

Kling–Gupta efficiency (KGE) for ANN and ARIMA models to predict the mean rainfall over the Almendares-Vento basin and in gauge stations from January 2015 to December 2019

In order to compare the performance of the models in terms of estimating the mean monthly rainfall, the KGE skill score was also used (Eq. 8). This parameter makes it possible to assess the relative improvement achieved when using one model compared to another, taking into account that the denominator of the KGEss expression measures the difference between the KGE value obtained for a perfect fit (1.0) and the KGE obtained for the comparative base model, i.e., the maximum possible improvement that would be possible to achieve, and the numerator measures the difference between the KGE of the model to be compared with the KGE obtained for the comparative base model, i.e., the actual improvement in fit achieved by using another model. The maximum possible value of the KGEss is 1, assuming that the model selected for comparison with the base model perfectly fits the data with a KGE equal to 1. In this way, the magnitude of the KGEss value is directly proportional to the degree of improvement of the mean monthly rainfall estimation with the reference model with respect to the base model of the comparison. The other issue that must be taken into account is that the positive value of KGEss indicates that the model which is compared with the base model has a better prediction performance.

Figure 9 displays the KGE skill score of all models compared against each other in predicting the mean rainfall over the whole basin. Simple inspection of the figure reveals that CNN outperformed all models, followed by ARIMA4 > MLP > ARIMA5 > LSTM. The improvement ranges from 0.25 between CNN and LSTM models to 0.008 between MLP and ARIMA5 models. Note that CNN improvement over ARIMA4, the second highest performing model, was 0.075, a value that represents 27% of improvement relative to the minimum improvement achieved between two models in the range.

The KGE skill score for predicting the rainfall in the station gauges (Figure SM3) confirmed the results shown in Fig. 8. ARIMA4 showed the highest improvement in station ID-12, ARIMA5 in ID-15, MLP in ID-284, LSTM in ID-441 and ARIMA5 in ID-451, while ARIMA4 and CNN were better at predicting the monthly rainfall time series in ID-338. Interestingly, the LSTM improvement over the other models at station ID-441 ranged from 0.26 to 0.38, which means that the performance of LSTM model in this gauge station is much better than the rest of the models. Overall, the KGE skill score confirmed that CNN and ARIMA4 performed better.

Fig. 9
figure 9

Kling–Gupta efficiency skill score (KGS) in predicting the mean rainfall over the Almendares-Vento basin from January 2015 to December 2019

To investigate further, we analyzed the ability of each model to predict the rainfall amounts in each month from 2015 to 2019. Table 3 shows the RMSE, BIAS and KGE values of the mean rainfall series predicted for each month of the year. In terms of RSME, the CNN model is better in February, April, June and November, the LSTM performs best in March, July, September and October, while ARIMA4 is most skillful in January, August and December and ARIMA5 in May and July. In terms of BIAS, CNN (August, October and November), MLP (February, April, May and June) and LSTM (March, July and September) show the best performance. The Wilcoxon Signed Rank Test revealed that the CNN model has the overall best performance in terms of the RMSE, confirming the results discussed above. Additionally, CNN exhibits improved forecast capacity regarding KGE in six months of the year (February, April, June, September, October and November), while the ARIMA4 shows the better prediction for August and December. Likewise, ARIMA5 is better for predicting in May and July, and finally, the LSTM shows the better prediction of rainfall only in March. Similar results were found for the monthly rainfall prediction each month alone at every gauge station (Tables SM2 to SM7). It is outstanding that the worst performance of all models occurred in January, which can be related to the outlier of 240 mm registered for that month (Fig. 7). It is well known that the weather of January 2016 was described as extremely wet, because of 99.7 mm of average rainfall was recorded for the Cuban archipelago (212% of the historical value). Specifically, 178.1 mm (339%) for the western region and 298.3 mm (424%) for Havana. This unusual behavior was conditioned by the presence of a strong El Niño/Southern Oscillation (ENSO) event in the equatorial Pacific Ocean (González-Pedroso and Estévez 2016). On average, according to these findings, the CNN model performs better than all other models.

Table 3 RMSE, BIAS and KGE values for the mean rainfall series estimated for each month of the year from 2015 to 2019. The best statistics for each month are highlighted in bold text. The best model to predict the rainfall each month was selected based on the maximum value of the KGE index

From the statistical indices shown in Table 3, a hybrid model based on the ANN and ARIMA models was developed in order to combine the forecasting model efficiencies, as previously suggested by Zhang (2003) and Haidar and Verma (2018). In this contribution, for each month, the maximum KGE value was selected as the decision criterion. This strategy aims at increasing computational efficiency by growing correlations and minimizing errors during the forecasting process. The temporal variations of the observed and predicted rainfall by the hybrid model are presented in Fig. 10 and SM4. Both Figures reveal that the hybrid model fits better the observed rainfall values than the prediction by the individual models. Precisely, the results of hybrid model implementation are presented in Table 4. The statistical evaluation shows high accuracy of hybrid model based on RMSE, BIAS, rp, NS and KGE indices. The KGE records for the mean rainfall and selected gauge stations are notably higher than 0.5 and the NS values are always positive, suggesting the ability of the hybrid model in predicting monthly rainfall in the study region. Following the breakdown of KGE values into four benchmark categories (Kling et al. 2012), the performance of the hybrid model could be classified from Intermediate (0.5 ≤ KGE < 0.75) to Good (KGE ≥ 0.75). This approach was previously applied by Thiemig et al. (2013) for evaluating the satellite-based precipitation products and Towner et al. (2019) for assessing the performance of global hydrological models in capturing peak river flows in the Amazon basin. On the other hand, Xu et al. (2020) found NS values ranging from 0.35 to 0.75 in the runoff prediction in the Hun River basins, while Bhagwat and Maity (2012) previously achieved NS values from 0.58 to 0.68. Overall, the hybrid model better captured particular features of the data set than individual models, in agreement with previous works (e.g., Zhang 2003; Aladag et al. 2012; Dodangeh et al. 2020).

Fig. 10
figure 10

Mean rainfall observed values, the CNN and the hybrid model outputs in the whole basin from January 2015 to December 2019

Table 4 RMSE, BIAS, rp, NS and KGE values obtained using the hybrid model for predicting the mean rainfall over the whole basin and at each gauge station

Figure 11 highlights the improvements of the hybrid model in forecasting the mean monthly rainfall in the whole Almendares-Vento watershed and in each gauge station. The fact that all KGEss values are positive means that the hybrid model is better predicting the mean monthly rainfall than the rest of the models in the whole basin and in each gauge station, except at station ID-441, where the LSTM model has a similar performance. The improvement ranges from 0.48 between hybrid model and LSTM models at gauge station ID-284 to 0 between the same models at ID-441. For the whole basin the improvement in the mean monthly rainfall prediction ranges from 0.26 with respect to CNN model to 0.44 with respect to LSTM model. This improvement could be considered as significant.

Fig. 11
figure 11

Hybrid model improvements in the prediction of the monthly rainfall time series in the Almendares-Vento basin from January 2015 to December 2019 based on the Kling–Gupta efficiency skill score (KGS)

4 Conclusions

We utilized the Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM) artificial neural networks (ANNs) models, and the Autoregressive Integrated Moving Average (ARIMA) models in developing a hybrid model (ANN + ARIMA) with the aim to forecast the monthly rainfall totals in hydrological watersheds. The region was the Almendares-Vento basin, which is an important basin in Cuba. The ANN models were trained using the monthly rainfall, sunspots, sea surface temperature and climatic variability modes (NAO, Niño 1.2, Niño 3.0, Niño 3.0, Niño 3.4, Niño 4.0, AMO, SOI) time series of the previous calendar year to predict the rainfall of the next calendar year, while ARIMA models were adjusted using the rainfall seasonality.

Additionally, to assess the ability of each model to predict monthly rainfall in the study region, both ANN and ARIMA models were compared. This study showed in general that the CNN model has a good performance to forecast monthly rainfall amounts in the Almendares-Vento basin. However, better accuracy was reached through a hybridization of ANN and ARIMA models, with maximum monthly KGE as the selection criteria. This approach ensures an efficient performance, as the statistical indices MAE, BIAS, RMSE, rp, NS, and KGE for mean rainfall and selected gauge stations in the Almendares-Vento basin unfolded.

Although this work was focused on forecasting the monthly rainfall over the Almendares-Vento basin, these results prove the reliability of using the hybrid model (ANN + ARIMA) to predict rainfall time series for water management. Monthly time series of precipitation recorded in gauge stations within the basin and climatic indices used as input data in our methodology also allows the ANN and ARIMA models to learn the seasonal variations of the rainfall, improving the accuracy of prediction and consequently the ability of the hybrid model. Additionally, our method can be easily applied in forecasting rainfall in other hydrological basins.

In summary, this work proposes taking a new detour to enhance the water management system in hydrological basins through improving rainfall forecasting. However, the main limitation for the extensive application of the present approach could be the non-availability of long-term time series of monthly rainfall within a basin. In future studies, the network model structure is going to be assessed and further optimized to accomplish a more accurate prediction.