Introduction

Wetlands are an important natural resource because they have a wide variety of plants and animals and can be used to make money for local economies by doing things like grazing cattle or harvesting sugar cane (Maxwell 1957; Young 1977; Tiner et al. 2015; Albarakat et al. 2018). Some of the most densely populated parts of the Earth's water surface are swamps, which are wetland ecosystems inhabited by large quantities of aquatic plants. Among these is Phragmites in Australia, which can be found in almost every swamp (Al-Handal and Hu 2015). Hydrological changes can destroy wetlands, and climate change and human intervention have led to the loss of wetlands in many parts of the world. Land use and hydrological changes have affected climatic conditions at the local level (Albarakat et al. 2018).

The swamps of Mesopotamia are one of the oldest ecosystems in the world. They are located between three Iraqi governorates: Basra, Dhi Qar, and Maysan. The Mesopotamian Marshes are the major wetlands in the Middle East and Western Asia and have an important role in the region's ecosystem. The lower Mesopotamian basin between the Euphrates and Tigris rivers has flat areas known as flood plains, formed by the buildup of sedimentary material moved upstream by surface waters. Currently, the area of the three marshes ranges from about 10,500 square kilometers to 20,000 square kilometers. These include the marshes of Hammar, the Central Marshes, and the Al-Hawizeh marshes. The entire upper Arabian Gulf ecosystem depends on the Mesopotamian marshes' hydrology (https://earthobservatory.nasa.gov/images/1716) (Albarakat et al. 2018). Due to their size, abundance of aquatic vegetation, and isolation from other similar systems, the marshes are crucial for maintaining biodiversity in the Middle East (Al-Handal and Hu 2015; Douabul et al. 2013). The Tigris and Euphrates rivers use them as natural wastewater treatment systems, filtering fertilizers out of the water before releasing it into the Arabic Gulf. The drying of over 10,000 square kilometers of wetlands and lakes will have a significant impact on the local microclimate. Removing vegetation from wetlands will result in significantly lower rates of evaporation and moisture, leading to changes in precipitation patterns (Partow 2001). As well as continuous temperature increases, particularly during the long and hot summer. The reed layer will no longer protect the marsh from strong, dry winds above 40 °C (Maltby 1994).

The results of water scarcity and pollution, extreme thermal conditions, and increased vulnerability to toxic dust storms that can devastate drinking salt ponds and dry swamp basins are just a few of the ways that ecosystem degradation on this scale can seriously harm human health (Pörtner et al. 2022). The exposed salt crusts and dry marsh soil will generate higher volumes of dust, and wind erosion will distribute various impurities, affecting thousands of square kilometers outside of Iraqi borders (Partow 2001). Additionally, due to wind erosion and sand erosion from dry swamps and surrounding deserts, the fragile arable land near the former swamp is likely to contribute to land degradation and desertification (Meng et al. 2020). The flow of the Tigris and Euphrates rivers changed in the late 1980s and early 1990s after the construction of dams and canals; swamps dried up due to human-made dams and politically motivated drainage practices (Parsaie 2016).

Degradation is the shrinking of the area covered in vegetation into arid land. All three of the noted marshes have shrunk, which has caused a massive increase in arid areas. The swamp has degenerated into a wasteland as a result. The Hammar and Fasat Marshes are the most severely degraded, with a 95% degradation rate. The Karkheh River continues to supply water from Iran to the northeastern portion of the Hawizeh swamp, preserving about 30% of the land area (Partow 2001). One of the biggest ecological disasters affecting wetlands worldwide is large-scale drainage modification (Mohamed and Hussain 2016).

The majority of the embankment dams and dams on the Tigris and Euphrates rivers were uprooted by the swamp's residents after the regime responsible for these drainage changes was overthrown in late 2003, and water started to flow back into the swamp (Fitzpatrick 2004). After three years of natural flow, Mesopotamia's swamps started to recover. Between 50 and 60% of the original population of plant and animal species have returned, demonstrating the wetlands' resilience (Richardson 2005; Richardson et al. 2005).

Although drought is a natural, somewhat unpredictable phenomenon, it can be observed, studied, and predicted using contemporary techniques. A catastrophic drought occurs when the precipitation system fails, affecting the water supply for both natural and agricultural systems and human activities. Because rain is one of the most important sources of water, its presence or absence can have a significant impact on wetlands, particularly due to dams built by neighboring countries with a lack of rainfall, which caused drought and reduced the wetlands area (Raj et al. 2018; Adham 2018; Awchi and Jasim 2017).

The main goal of our research is to study and predict rainfall. The University of East Anglia Climatic Research Unit (CRU) studied rainfall from 1901 to 2020 and created long-term rainfall predictions. We analyzed these data via the Google Earth based CRU TS add-on interface. Previous studies have used satellite imagery to examine the impact of rainfall and climate changes on the landscape over 16 years (Rabbani et al. 2022; Alhumaima and Abdullaev 2020). We use hybrid deep learning models for modeling and predicting rainfall using univariate time series data. This research aims to improve monthly rainfall forecasts for the marshes of Hawizeh, Central, and Al Hammar. For this purpose, we employ data visualization techniques, such as data exploration (patterns, unusual observations, changes over time, or structural breaks). Different underlying assumptions concerning the estimate of data were employed in the hybrid machine learning models used in this research.

Our approach that combines different types of deep neural networks with probabilistic approaches to model uncertainty. Different kinds of deep learning networks, however, deep learning algorithms do not model uncertainty, the way Bayesian, or probabilistic approaches do. Hybrid learning models combine the two kinds to leverage the strengths of each. Our approach (CNN-BDLSTMs) combines CNN and BDLSTMs, and we find that it outperforms the other models.

Materials and proposed algorithm framework

Study area

The Mesopotamian Marshes of Southern Iraq are situated between 46.4° E and 48° E longitude and 30.5° N and 32.2° N latitude. The wetlands consist of shallow freshwater lakes with varying levels of permanence. The mean annual precipitation and mean annual temperature are less than 25 mm/year and 26.5 °C, respectively, based on the GLDAS study, which allows this land area to be classified as arid (Albarakat et al. 2018; Peltier 1950; Fookes et al. 1971).

Figure 1 shows the normalized difference moisture index (NDMI) for the Mesopotamian marshes in southern Iraq in 2000, 2010, and 2020 based on MODS satellite data. Images were taken in October, when the climatic conditions improved. Despite this, drought rates were high. The lowest soil moisture index was noted in 2000 (12%) due to drought and lack of rain; in 2010, the percentage improved due to re-flooding (30%); and the highest index was recorded in 2020 (56%).

Fig. 1
figure 1

Location of the Mesopotamian marshes in Southern Iraq and NDMI for 2000, 2010, and 2020

Dataset

To analyze CRU data we installed the CRU TS interface to Google Earth Pro. Then we selected the area of study and loaded the relevant data. The dataset is updated annually and includes data from 1901 to 2020. The interface is available on CRU website https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.02/ge/.

The highest monthly average in the Hawizeh Marsh was recorded in January 1974 (see Fig. 2). We noted high volatility in rainfall after 1940, indicating a structural break at the variance level (shift). The highest rainfall levels in the Central Marsh were recorded in January 2004 (see Fig. 3). We noted a change in average rainfall after 1940, indicating a structural break at the monthly average level (location). The highest monthly average rainfall in the Al Hammar Marsh was recorded in April 1939. We noted a change in mean rainfall after 1940, indicating a structural break at the mean level (location) (Fig. 4).

Fig. 2
figure 2

Monthly average rainfall in the Hawizeh Marsh

Fig. 3
figure 3

Monthly average rainfall in the Central Marsh

Fig. 4
figure 4

Monthly average rainfall in the Al Hammar Marsh

Figure 5 shows the strength of the relationship between variables: there is a nearly perfect correlation between Al Hammar Marsh and Hawizeh Marsh and between central Marsh and Hawizeh Marsh, and a perfect correlation between central Marsh and Al Hammar Marsh. This indicates that the rainfall in one region coincides with the rainfall in the other regions.

Fig. 5
figure 5

Correlation heatmap

Proposed algorithm framework

The mechanism underlying our proposed approach for modeling and forecasting average rainfall is depicted in Fig. 6. The algorithm consists of the following steps: CSV data and the developed hybrid deep learning algorithm are available on GitHub (Abotaleb 2022).

Fig. 6
figure 6

Schematic of the proposed algorithm framework

First step: Datasets on average rainfall in the Hawizeh Marsh, Central Marsh, and Al Hammar Marsh are generated in Google Earth Pro to train the algorithm. Then, the "cruts_4.06_gridboxes.kml" add-on interface is launched to display climatic data from January 1901 to December 2020. We then load the average rainfall data for Hawizeh Marsh, Central Marsh, and Al Hammar Marsh. Detailed information about each rainfall dataset is stored in a separate CSV file. Each CSV file contains two columns: the first is the date, and the second is the average rainfall value. There are 1440 rows of data, resulting in a table of (1440). Al Hammar Marsh has a file size of 13.4 KB, Central Marsh—13.5 KB, and Hawizeh Marsh—13.7 KB.

Second step: Input time series data for monthly average rainfall in Hawizeh Marsh, Central Marsh, and Al Hammar Marsh into our algorithm. Then the input parameters for the deep learning model (optimizer, loss function, number of epochs, and number of neural networks) are entered, and the algorithm is started.

Third step: Preprocessing and training require memory and time. Back propagating extended sequences create a trained model with poor performance. The data are prepared via normalization and standardization before being input into neural networks. Using data normalization, the standard deviation is set to 1 and the mean is set to 0.

Fourth step: The dataset is split into three sets namely, testing, validation, and training sets. The size of the test set is 20% of the dataset. The remaining 80% is split validation set of (20%) and training set of (80%). The model is trained using the training set to improve the model performance. During the training, the test for the overfitting is performed using the validation set. However, the model performance evaluation is performed using the test set.

Fifth step: The algorithms are executed for the CNN (Convolutional Neural Network), LSTM, LSTMs (Stacked LSTM), BDLSTM (Bidirectional LSTM), BDLSTMs (Stacked Bidirectional LSTM), GRU (Gated Recurrent Unit), Conv-LSTMs, and Conv-BDLSTMs models.

Sixth step: Evaluation of model performance.

Seventh step: Use best models in forecasting.

Methodology and optimization

In contrast to Bayesian and probabilistic approaches, deep learning algorithms do not account for uncertainty in their calculations. Several varieties of deep neural networks are integrated with probabilistic techniques to describe uncertainty in our suggested model. These hybrid models combine the best features of both types of deep learning networks. We discover that our suggested hybrid model outperforms the other models by combining a convolutional neural network (CNN) with bi-directional long short-term memory (BDLSTMs), and we call this method CNN-BDLSTMs. Both data and code come from (Abotaleb 2022).

Methodology

We use eight deep-learning models to forecast rainfall.

A. Convolutional neural network (CNN)

The convolutional neural network has a double-convolutional layer architecture to facilitate spatial advantage extraction. At each time step t, the flow data \({x}_{t}^{s}\) is convolved with itself in a one-dimensional space. Specifically, the local perceptual domain is acquired using a sliding and one-dimensional convolution kernel filters (Graves et al. 2013). The method of the twisted kernel filter is demonstrated as follows:

$${Y}_{t}^{s}=\sigma \left({W}_{s}*{x}_{t}^{s}+{b}_{s}\right),$$
(1)

where \({Y}_{t}^{s}\) is the convolutional layer output; \({W}_{s}\) is the filter weights; \({x}_{t}^{s}\) is the input traffic flow at time t; and \(\sigma \) is the activation functions.

Because CNNs can be trained to recognize patterns in time series data and utilize that information to make predictions about the future, they are a valuable tool for anybody working with this type of information (Koprinska et al. 2018). CNN can also automatically recognize and capture features from class data without presuppositions and feature ordering. They may also work well with time series containing high noise by filtering out the noise in each subsequent layer, generating a set of useful information and features and extracting only meaningful features (Koprinska et al. 2018).

After moving the input information to the Conv layer, a ReLU is used to extract patterns. Then the max pooling layer is used to reduce the number of parameters and move the information to a lower dimension. Flatten is used in Keras to normalize data to the number of elements in the tensor. These mechanisms are displayed in Fig. 7 below.

Fig. 7
figure 7

Architecture and hyper parameters of the proposed convolutional neural network (CNN)

According to the model's processing mechanism for the data from Fig. 7, the model is trained to extract data patterns with the training stopping at the point that achieves maximum accuracy and least information loss. The model weights are randomly optimized by improving the accuracy of the training data in the model from one layer to another using the keras time series data generator (Muftah et al. 2022).

B. Long short-term memory (LSTM)

When solving the problem of vanishing gradients, long short-term memory (LSTM) was one of the earliest and most effective methods to be developed (Hochreiter and Hochreiter 1977; Gers et al. 2002). In this context, "long-term" refers to simple recurrent neural networks storing information about their previous decisions as weights. A gradual shift in weights occurs throughout training when new information about the data is retrieved and used to calibrate the model. Short-lived activations hop from one node to another and are therefore referred to here. In the LSTM paradigm, a memory cell serves as intermediate storage. For the first time, multiplex nodes are incorporated into the construction of memory cells, making them a more complicated unit. Three gates (input, output, and forget) make up a generic LSTM unit (Huynh et al. 2017). With the input gate, LSTM may be programmed to either retain existing data or learn new information. The sigmoid layer and the tanh layer make up this gate's structure. The tanh layer generates a vector of potential new values to be added to LSTM (Zhang, et al. 2019a, b), whereas the sigmoid layer specifies which values will be modified. To derive the final result from these layers, we use:

$${i}_{t}=\sigma ({W}^{i}{x}_{t}+{U}^{i}{h}_{t-1}+{b}^{i}),$$
(2)
$${u}_{t}=tanh({W}^{u}{x}_{t}+{U}^{u}{h}_{t-1}+{b}^{u}),$$
(3)

where \({i}_{t}\) is the updated value; \({u}_{t}\) is new candidate values; \(\sigma \) is the sigmoid layer (or nonlinear function); \({x}_{t}\) is a sequence of length t; \(b\) is constant bias; \(h\) is RNN memory at time step t; and \(W\) and \(U\) are weight matrices.

Forget gates, whose sigmoid functions are used to choose data for deletion from LSTM, are discussed in detail in (Song et al. 2020). The values of h and \({x}_{t}\) are used heavily in making this determination. This gate has an output f that may take on the values 0 and 1, where 0 signifies full erasure of the acquired value and 1 represents complete preservation of the value. This result is derived by:

$${f}_{t}=\sigma ({W}^{f}{x}_{t}+{U}^{f}{h}_{t-1}+{b}^{f}),$$
(4)

where \({f}_{t}\) is updated value; \(\sigma \) is the sigmoid layer (or nonlinear function); \({x}_{t}\) represents a sequence of length t; \(b\) is constant bias; \(h\) represents RNN memory at time step t; and \(W\) and \(U\) are weight matrices.

The input gate uses a sigmoid layer to determine which sub-tree of the LSTM is responsible for the output. A nonlinear tanh function is then used to assign values between −1 and 1 after that. The output from the sigmoid layer is then multiplied by the final product. Following are some formulae that are used to determine output:

$${o}_{t}=\sigma ({W}^{o}{x}_{t}+{U}^{o}{h}_{t-1}+{b}^{o}),$$
(5)
$${h}_{t}={o}_{t}{tanh}_{t}{c}_{t-1},$$
(6)

where \({o}_{t}\) is an output gate and \({h}_{t}\) is a value between [1, −1].

The LSTM is kept current by combining these two layers. The forget gate layer works by first doubling the previous value, \({c}_{t-1}\), and then adding the candidate value, \({i}_{t}{u}_{t}\), to forget the current value. Specifically, this process requires the following equation:

$${c}_{t}={i}_{t}{u}_{t}+{f}_{t}{c}_{t-1},$$
(7)

where \({c}_{t}\) represents a memory cell and \({f}_{t}\) represents a value between 0 and 1 produced by the forget gate. Specifically, a value of 0 denotes that the value is nullified, whereas a value of 1 indicates that it is retained (Van Houdt et al 2020). Figure 8 depicts a potential configuration including these components.

Fig. 8
figure 8

The long short-term memory (LSTM) model (Mohamed and Hussain 2016)

In the LSTM model, the input information is passed to the forget layer, at which point the model decides to: (a) keep the information in the past and use it for prediction, or (b) forget the information and rely on the instantaneous state, then send this information to a tanh function to normalize the information and extract features and patterns and remove noise from them (Reddy and Prasad 2018).

Figure 9 shows the characteristics of the kernel used to run the LSTM model which is used to fit the model to the training data, the memory used to store information, and key features in the data and used for forecasting.

Fig. 9
figure 9

Schematic and hyper parameters of the proposed LSTM model

C. Stacked long short-term memory (LSTMs):

Graves et al. (2013) first proposed this model after concluding that the number of memory cells in a given layer is less essential than the network depth for data modeling and pattern extraction. There are several nested layers in the stacked LSTM model, and each houses numerous memory nodes. Instead of sending a single value to the LSTM layer below, a stacked LSTM sends a sequence. In other words, rather than having a single output time step for all input time steps, there is one output per input time step (Cui et al. 2020).

Figure 10 shows the structure of a stacked LSTM; the mechanism is similar to the LSTM model, but with several layers \(f, i,o(\sigma )\), which allows for additional features to be extracted from the data.

Fig. 10
figure 10

A stacked LSTM architecture (Muftah et al. 2022)

Figure 11 shows the properties of the kernel used to run the Stacked LSTM model and size of the storage memory for information storage in preparation for the production of predictions. As shown, the model is adapted to the training data in more than one layer, which allows the extraction of highly complex information (Dikshit et al. 2021).

Fig. 11
figure 11

Architecture and hyper parameters of the proposed stacked long short-term memory (LSTMs) model

D. Bidirectional long-short term memory model (BDLSTM):

The Bi-LSTM model integrates the strengths of two separate RNNs. With this setup, the network may exchange sequence-related data in both directions at each time interval (Fernández et al. 2007). The input data is processed in both directions by the Bi-LSTM, from the future to the past and back again. If you utilize LSTM for your backwards estimations, you may save your future-oriented data and use the two hidden states in combination at any time. That way, we would not lose any of the knowledge from the past or the future (Shahid et al. 2020). The expression for the output y at time t is:

$${y}_{t}=\sigma ({W}_{y}\left[{h}_{t}^{\to },{h}_{t}^{\leftarrow }\right]+{b}_{y}),$$
(8)

where \(\sigma \) is nonlinear function; \({W}_{y}\) are weight matrices that are used in deep learning model; \({b}_{y}\) is a constant bias; and \({h}_{t}\) is hidden states.

In the BDLSTM model, the hidden state \({h}_{t}\) works to receive information from the past and future \({x}_{t}\) and take advantage of these patterns in prediction \({y}_{t}\) (see Fig. 12).

Fig. 12
figure 12

Bidirectional long-short term memory model (BDLSTM)

To forecast the future value of a variable, \({y}_{t}\), a kernel is used to extract features from a non-linear function (kernel) that is fed information from a time series of inputs, \({x}_{t}\), both past and future (see Fig. 13). These models allow full sequence information to be retrieved for all points before or after a given point in the sequence using a bidirectional recurrent neural network, which helps to improve prediction accuracy in some areas where past and future data are important (Zhang et al. 2022).

Fig. 13
figure 13

Architecture and hyper parameters of the proposed bidirectional long-short term memory model BDLSTM

Bidirectional stacked long short-term memory model (BD-LSTM):

This model combines the features of the BD and LSTM models, allowing the user to obtain information about the sequence forward and backward at each time step (Fernández et al. 2007). The model provides multiple sequential values instead of a single value output to the LSTM layer (Shahid et al. 2020).

E. Stacked bidirectional long short-term memory model (BDLSTMs)

The BDLSTMs model uses information from the past and future with multiple LSTM layers for processing (see Fig. 14).

Fig. 14
figure 14

Stacked bidirectional long short-term memory model (BDLSTMs)

The BDLSTMs model processes information using the same method as the BDLSTM model with several layers from LSTM (see Fig. 15) (Biswas and Sinha 2021).

Fig. 15
figure 15

Architecture and hyper parameters of the proposed stacked bidirectional long short-term memory model (BDLSTM)

F. Gated recurrent unit model (GRU):

Compared to LSTM, the Gated Recurrent Unit (GRU) is far superior. In the same vein, this is a recurrent neural network. In comparison to LSTM, which employs three hyper parameters, RNG only needs two (a reset gate and an update gate) (Dey and Salem 2017). When deciding what data should be transmitted to the output, the update gate and reset gate act as vectors (Gulli and Pal 2017). The reset gate sets the amount of state that should be retained. The update gate then decides if the new state is an exact replica of the previous one. Two sigmoid-activation-function-equipped, fully-connected layers provide two gate outputs. All of the GRU (Wang et al. 2018) inputs are depicted in Fig. 6, including those for the reset and update gates. For a mathematical analysis of output, we have:

$${r}_{t}=\sigma ({W}^{r}{x}_{t}+{U}^{r}{h}_{t-1}+{b}^{r}),$$
(9)
$${z}_{t}=\sigma ({W}^{z}{x}_{t}+{U}^{z}{h}_{t-1}+{b}^{z}),$$
(10)

where \({r}_{t}\) represents the reset gate, \({z}_{t}\) represents the update gate, \({h}_{t-1}\) represents the hidden state from the previous time step, \(\sigma \) represents the sigmoid activation function, \(W\) and \(U\) represent weight parameters, and \(b\) represents a constant bias. Then, we join the reset gate with the standard refresh system:

$${i}_{t}=\sigma ({W}^{i}{x}_{t}+{U}^{i}{h}_{t-1}+{b}^{i}),$$
(11)

which leads the hidden state; the next candidate:

$${a}_{t}=tanh(w{x}_{t}+{r}_{t}{U}^{i}{h}_{t-1}+{b}^{h}),$$
(12)

where \({r}_{t}\) is the reset gate, \({h}_{t-1}\) is the hidden state from the preceding time step, \(w\) and \(U\) are weight parameters, tanh is the activation function, and \(b\) is a constant bias. Last but not least, the update gate's impact must be factored in. This evaluates the degree of similarity between the current hidden state and the previous state, as well as the similarities between the current hidden state and the candidate states. By selecting convex combinations of elements \({h}_{t}\) and \({h}_{t-1}\) element-wise, the update gate may be employed for this purpose (Seidu et al. 2022). The following equation is the result of this process and represents the final GRU update:

$${h}_{t}={z}_{t}{h}_{t-1}+\left(1-{z}_{t}\right){a}_{t}$$
(13)

where \({z}_{t}\) the update is gate; \({r}_{t}\) is the reset gate; \({a}_{t}\) is the activation function; and \({h}_{t}\) is the hidden state output gate.

The input \({\mathrm{x}}_{\rm{t}}\) is sent to update gate \({\mathrm{z}}_{\rm{t}}\), then to reset gate \({\mathrm{r}}_{\rm{t}}\), and then to activation function \(\mathrm{tanh}\), where the information properties are extracted less excessively (see Fig. 16).

Fig. 16
figure 16

Gated recurrent unit (GRU) layer

The GRU model uses a recount to process information, which allows access to a shorter form of the previous models (see Fig. 17). The most prominent feature shared between LSTM and GRU model is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit. The traditional recurrent unit always replaces the activation, or the content of a unit with a new value computed from the current input and the previous hidden state. On the other hand, both LSTM unit and GRU keep the existing content and add the new content on top of it (Chung et al. 2014). These two units however have a number of differences as well. One feature of the LSTM unit that is missing from the GRU is the controlled exposure of the memory content. In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the output gate. On the other hand the GRU exposes its full content without any control. Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate) (Gaudio et al. 2021).

Fig. 17
figure 17

Architecture and hyper parameters of the proposed gated recurrent unit (GRU) layer

G. Convolutional neural network long-short term memory model (CNN-LSTM):

When combining Conv and LSTM, we get the CNN-LSTM Model, which takes as input a spatial–temporal traffic flow matrix of the form \({\mathrm{x}}_{\rm{t}}^{\mathrm{s}}\) (Mallah and Bagheri-Bodaghabadi 2022):

$${x}_{t}^{s}=\left[\begin{array}{c}{x}_{t-n}^{s}\\ {x}_{t-(n-1)}^{s}\\ \vdots \\ {x}_{t}^{s}\end{array}\right]\left[\begin{array}{cccc}{f}_{t-n}^{1}& {f}_{t-(n-1)}^{1}& \dots & {f}_{t}^{1}\\ {f}_{t-n}^{2}& {f}_{t-(n-1)}^{2}& \cdots & {f}_{t}^{2}\\ \vdots & \vdots & \ddots & \vdots \\ {f}_{t-n}^{m}& {f}_{t-(n-1)}^{m}& \cdots & {f}_{t}^{m}\end{array}\right]$$
(14)

where \({\mathrm{x}}_{\rm{t}}^{\mathrm{s}}={\mathrm{f}}_{\rm{t}}^{1}\dots {\mathrm{f}}_{\rm{t}}^{\mathrm{m}}\) denotes the prediction region’s traffic flow at time t, representing the POI’s historical traffic flow to be forecasted and its neighbors (Livieris et al. 2020).

In the LSTM model, the initial estimation stage is a CNN layer (see Fig. 18), and the last level of estimation is an LSTM layer with a dense layer. Each time unit is processed by the LSTM model, which is responsible for interpreting steps, while the CNN model is responsible for extracting relevant data (see Fig. 18). The CNN-LSTM neural network architecture allows the hidden relationships to be automatically captured and used for prediction, which may lead to the method being more applicable and easy to implement (Zha et al. 2022).

Fig. 18
figure 18

The CNN-LSTM model is a combination of Conv and LSTM (Livieris et al. 2020)

H. Convolutional neural network bidirectional long-short term memory model (CNN-BDLSTM)

By utilizing CNN to capture characteristics and then feeding those features into a BDLSTM model, this model can fully use the capabilities of both models. In the next step, the outputs from each max pooling layer are combined to generate the BDLSTM input, before the layer's three gates are used to perform a recursive backpropagation-style filtering operation. Input to the fully connected layer (Lu et al. 2021), which connects each input to a subset of the output (Nie et al. 2021; Casallas, et al. 2022), is the result of this stage.

It can be seen how the CNN-BDLSTM model works in Fig. 19. For instance, the BDLSTM model lets you get forward and backward information about the sequence at each time step, while the CNN model is utilized to extract the relevant information. In the left panel, we can see the first CNN layer, followed by the subsequent LSTM layers and finally, the dense layer at the very end (right panel).

Fig. 19
figure 19

Mechanism of the CNN-BDLSTM model (Nie et al. 2021)

Optimization: Adam optimization algorithm

Adam optimization is an extension of stochastic gradient descent that allows for more effective updates to network weights. Adam optimization arises from a two-factor interaction (RMSprop and Momentum, see Fig. 19 and Pseudo-code 1). In the field of stochastic optimization, adaptive moment estimation is employed (Jais et al. 2019; Kim and Choi 2021). Since Adam optimization also displays this behavior, understanding how the pace of learning might change over time is crucial.

Adam optimization is the stochastic optimization algorithm proposed in this work. The elementwise square \({g}_{t}^{2}\) is calculated for\({g}_{t}\odot {g}_{t}\). The default values are set as: α = 0.001, \({\beta }_{1}\) = 0.9, \({\beta }_{2}\) = 0.999 and ϵ = \({10}^{-8}\). The element-wise operation is applied for all vectors. With \({\beta }_{1}^{t}\) and \({\beta }_{2}^{t}\) we denote \({\beta }_{1}\) and \({\beta }_{2}\) to the power t (Kingma and Ba 2015).

Require: \(a\): Stepsize

Require:\(f(\theta )\): Stochastic objective function with parameters \(\theta \)

Require: \({\beta }_{1},{\beta }_{2}\in \left[{0,1}\right)\): Exponential decay rates for the moment estimates

Require:\({\theta }_{0}\): Initial parameter vector:

Initialize timestep:\(t\leftarrow 0\)

Initialize 2nd moment vector: \({v}_{0}\leftarrow 0\)

Initialize 1st moment vector: \({m}_{0}\leftarrow 0\)

while \(\theta \) not converged do

\(t+{t}_{1}\)

\({g}_{t}\leftarrow {\nabla }_{\theta }{f}_{t}\left({\theta }_{t-1}\right)\) (Get gradients w.r.t. stochastic objective at timestep \(t\))

\(mt\leftarrow {\beta }_{1}\bullet {m}_{t-1}+(1-{\beta }_{1})\bullet {g}_{t}\) (Update biased first moment estimate)

\(vt\leftarrow {\beta }_{2}\bullet {v}_{t-1}+\left(1-{\beta }_{2}\right)\bullet {g}_{t}^{2}\) (Update biased second raw moment estimate)

\({\widehat{m}}_{t}\leftarrow {m}_{t}/(1-{\beta }_{1}^{t})\) (Compute bias-corrected first moment estimate)

\({\widehat{v}}_{t}\leftarrow {v}_{t}/(1-{\beta }_{2}^{t})\) Compute bias-corrected second raw moment estimate)

\({\theta }_{t}\leftarrow {\theta }_{t-1}-a\bullet {\widehat{m}}_{t}/(\sqrt{{\widehat{v}}_{t}}+\epsilon \)(Update parameters)

end while

return \({\theta }_{t}\) (Resulting parameters)

Adaptive moment estimation (Adam)

Pseudocode: Adam algorithm for stochastic optimization

Note:

There are two separate beta coefficients → one for each optimization component

We implement bias correction for each gradient

On iteration t:

Compue dW, db for current mini-batch

# #Momentum

v_db = beta1 * v_db + (1 − beta1) db, v_db_corrected = v_db/(1 − beta1 ** t)

v_dW = beta1 * v_dW + (1 − beta1) dW, v_dW_corrected = v_dw/(1 − beta1 ** t)

# #RMSprop

s_dW = beta * v_dW + (1 − beta2) (dW ** 2), s_dW_corrected = s_dw/(1 − beta2 ** t)

s_db = beta * v_db + (1 − beta2) (db ** 2), s_db_corrected = s_db/(1 − beta2 ** t)

# #Combine

W = W − alpha * (v_dW_corrected/(sqrt(s_dW_corrected) + epsilon))

b = b − alpha * (v_db_corrected/(sqrt(s_db_corrected) + epsilon))

Coefficients

alpha: the learning rate. 0.001

beta1: momentum weight. Default to 0.9

beta2: RMSprop weight. Default to 0.999

epsilon: Divide by Zero failsave. Default to 10 ** -8

Overfitting and under fitting

Overfitting and under fitting are a major contributing factor to poor performance in deep learning models. In overfitting the model (which performs consummately on the training set while fitting ineffectively on the testing set) the model begins by matching the noise to the estimation data and parameters, thus producing predictions with large out-of-sample errors that negatively impact the model’s ability to generalize. An over fit model shows low bias and high variance (He et al. 2016). Under fitting refers to the model's inability to capture all the data's characteristics and features, resulting in poor performance on the training data and an inability to generalize the model's results (Zhang et al. 2019).

To avoid and detect overfitting and under fitting, we tested the validity of the data by training the model on 80% of the data subset and testing the other 20% using the set of performance indicators (Alqahtani et al. 2022; Abotaleb and Makarovskikh 2021) detailed in the next section.

Performance indicators:

To compare the prediction performance of the three models we:

Calculated mean square error (MSE):

$$\frac{\sum_{t=1}^{n}{(\hat{{y}_{t}}-{y}_{t})}^{2}}{n}$$
(15)

where \(\hat{{y}_{t}}\) the forecast is value; \({y}_{t}\) is the actual value; and \(n\) is the number of fitted observed.

Calculated root mean square error (RMSE) between the estimated data and actual data:

$$\sqrt{\frac{\sum_{t=1}^{n}{(\hat{{y}_{t}}-{y}_{t})}^{2}}{n}}$$
(16)

where \(\hat{{y}_{t}}\) is the predicted value; \({y}_{t}\) is the actual value; and \(n\) is number of fitted observed.

Calculated relative root mean square error (RRMSE):

$$\sqrt{\frac{\frac{1}{n}\sum_{t=1}^{n}{(\hat{{y}_{t}}-{y}_{t})}^{2}}{\sum_{t=1}^{n}{\left(\hat{{y}_{t}}\right)}^{2}}}$$
(17)

Calculated mean absolute error (MAE):

$$\frac{1}{n}\sum_{t=1}^{n}\left|{y}_{t}-{\hat{y}}_{t}\right|$$
(18)

Calculated mean bias error (MBE):

$$\frac{\sum_{t=1}^{n}{(y}_{t}-{\hat{y}}_{t})}{n}$$
(19)

Calculated optimum loss error:

$$loss({y}_{t},{\hat{y}}_{t})=\frac{1}{n}\sum_{t=1}^{n}{\left|{y}_{t}-{\hat{y}}_{t}\right|}^{2}$$
(20)

The model with the lowest values of (RMSE – RRMSE – MAE – MBE – loss) is the best.

Results

Table 1 shows that mean > median > mode, which shows that the distribution is skewed to the right for all variables. Observations with a value larger than mean are more frequent. Kurtosis < 3 for all variables, indicating that there are no extreme outliers. The greatest difference between the maximum and minimum value for rainfall was noted in Hawizeh Marsh (0 mm to 142.7 mm). This lead to a larger S.D. and S.E (difficulty in prediction) than the rest of the variables.

Table 1 Descriptive statistics of rainfall

Table 2 shows that CNN-BDLSTMs was the best model for predicting rainfall because it has the least values of MSE – RMSE – RRMSE – MAE – MBE – Optimum Loss Error and, therefore, the least difference between the actual and predicted values. This model achieves convergence between the training and test data's actual and predicted values, demonstrating their ability to capture data features.

Table 2 Comparison of dataset evaluation methods(20%)

Figures 20, 21 and 22 show the convergence of actual monthly rainfall in the Hawizeh Marsh, Central Marsh, and Al Hammar Marsh with the values predicted by the CNN-BDLSTMs model. There is good convergence between the actual and predicted data. This model is able to clarify volatility in rainfall and capture structural breaks, and can thus be used to predict monthly rainfall in this region.

Fig. 20
figure 20

The CNN-BDLSTMs model for forecasting monthly average rainfall in the Hawizeh Marsh

Fig. 21
figure 21

The CNN-BDLSTMs model for forecasting monthly average rainfall in Central Marsh

Fig. 22
figure 22

The CNN-BDLSTMs model for forecasting monthly average rainfall in the Al-Hammar Marsh

Conclusion

Climate change has impacted Wetlands due to increased annual average maximum temperature and decreased rainfall. Since Google Earth Pro data has great potential for detecting changes that have already occurred, it can be used to monitor the climatic elements of marshes and water bodies. The Mesopotamian marshes are vital to Iraq's ecology and economy, so it is crucial to take measures to develop them and return them to their original state. We aim to continue our research in this field by developing a model for predicting monthly average rainfall which incorporates data on sea-surface temperature, global wind circulation, and a variety of other climatic variables. We described deep learning approaches for monthly average rainfall forecasting and proposed a hybrid deep learning CNN-BDLSTMs-based model for Hawizeh Marsh, Central Marsh, and Al Hammar Marsh. The dataset includes average monthly records for meteorological parameters such as maximum and minimum temperatures, precipitation, evaporation, and monthly average rainfall from Google Earth Pro for 1901 to 2020. Our tests showed that the proposed prediction model is accurate. Smart farming and other applications that require accurate rainfall forecasts might benefit from this model.