1 Introduction

Numerous coastal cities worldwide grapple with water-related issues due to the interplay between natural and human factors. These issues range from water scarcity and ground subsidence to seawater intrusion into aquifers, coastal erosion, and flooding (Schuetze and Chelleri 2013). One hundred thirty-six largest coastal cities in the world are particularly vulnerable due to a combination of factors, such as climate change, rising sea levels, subsidence from low-lying terrain, ongoing urbanization (Du et al. 2015), high-value assets, and bustling economic activities (Meng et al. 2019). On the other hand, another crucial water-related issue is flooding hazards. Cities with high population density and economic significance are particularly vulnerable to flood disasters, often stemming from anthropogenic factors such as decreased urban permeability and rapid runoff, coupled with the effects of global warming and extreme weather events (Chakrabortty et al. 2023; Roy et al. 2020). Therefore, addressing such events’ increasing frequency and intensity is crucial to safeguarding cities from future threats (Chowdhuri et al. 2020; Imteaz and Hossain 2023; Ruidas et al. 2022, 2023).

Flood hazard simulations play a crucial role during intense rainfall to address the issues mentioned above, as they aid decision-makers in understanding such disasters. Traditionally, numerous studies have employed physically-based numerical models to simulate the extent of flooding at different temporal and spatial scales during storm events. Applying such models offers several advantages. For instance, they accurately simulate flood depths based on initial and boundary conditions. The output values of these models are rooted in physics and mathematical theories, imparting a certain degree of reference value. However, they also come with several drawbacks, requiring substantial computational time, extensive data storage, and effective management, mainly when focused on flood forecasting rather than mere simulation. Several studies have indicated that flood prediction is crucial in enhancing the effectiveness of early warning systems (e.g., Plate 2007).

A significant advancement is the application of AI technology to urban flood prediction. These methods consider the results of two-dimensional (2D) hydraulic simulations as a database used for training and testing AI models. Compared to the computational resources the 2D hydraulic model demands, the trained AI requires significantly fewer resources and can deliver results rapidly. These characteristics make it well-suited for real-time flood prediction in flood prevention systems (Hofmann and Schüttrumpf 2021). Accurate forecasts provide ample time for both agencies and the public to respond. With the advent of artificial intelligence (AI), a variety of machine learning and deep learning methods for predicting flood runoff and water levels have been proposed in recent years (Sit et al. 2020; Sun et al. 2020). These include Support Vector Machines (Jhong et al. 2017), Back Propagation Neural Network (BPNN) (Berkhahn et al. 2019; Chu et al. 2020), stacked autoencoder with Recurrent Neural Networks (RNN) (Kao et al. 2021), Long Short-Term Memory (LSTM) (Nearing et al. 2022), and Convolutional Neural Network (CNN) (Guo et al. 2021; Hosseiny 2021).

Establishing complex and non-linear interdependencies among various hydrological variables has always been a significant research focus in flood forecasting. For instance, utilizing artificial intelligence (AI) to model the non-linear relationships between input variables (such as precipitation, temperature, and evapotranspiration) and output variables like river discharge or water level enables AI-based forecasting models to predict the future extent of flooding accurately. The BPNN is a supervised learning algorithm that refines the network’s performance by adjusting the weights of the input, hidden, and output layers. BPNN has been widely used in hydrological modeling, particularly for forecasting river discharges and water levels. LSTM, a unique variant of RNN, was introduced by Hochreiter and Schmidhuber (1997) specifically to circumvent the vanishing gradient problem. It has found extensive use in water resources management, including applications in rainfall-runoff simulation (Cui et al. 2021; Kratzert et al. 2018), probabilistic streamflow forecasting (Zhu et al. 2020), flood level forecasting (Dazzi et al. 2021), combined sewer overflow monitoring (Palmitessa et al. 2021), and flooding depth (Yang et al. 2023). While LSTM is extensively utilized, its complexity serves as a significant drawback. Gated Recurrent Unit (GRU) was introduced by Cho et al. (2014), effectively simplifying the LSTM model. Unlike LSTM, which has three gates in each module, GRU only has two: the reset and update gates. Numerous researchers have employed GRU for predicting river flow. The Bidirectional LSTM (BiLSTM), a derivative of the bidirectional Elman neural network (Graves and Schmidhuber 2005), establishes both forward and backward hidden layers. The input layer feeds into these forward and backward hidden layers, which jointly compute the predicted value. Kang et al. (2020) utilized BiLSTM to predict urban wastewater flow. However, despite the emphasis in the literature on applying the AI above methods to various hydrological forecasting studies, the accuracy of most flood prediction models remains limited. Another critical issue is how to further enhance the accuracy of flood prediction models based on the characteristics of flood hydrographs.

Incorporating additional input factors, in addition to rainfall and flood depth variables, is crucial to enhance the accuracy of flood prediction models. Investigating their impact is also essential for improving the performance of these models. In this study, a novel method, the Trade Forecasting Method (TFM), is proposed to improve the accuracy of forecasting models and solve time lag problems. The Annan District of Tainan City, Taiwan, was selected as the study area due to its low-lying terrain, which often experiences flooding during typhoons and heavy rains. This study explored two issues: (1) Compare the accuracy of BPNN, LSTM, GRU, and BiLSTM in forecasting flood depth and (2) Discuss how much our proposed TFM could improve the accuracy of model forecasting. The proposed model can be used for urban flood forecasting. In this paper, the first chapter serves as an introduction. The second chapter outlines the methodology, providing explanations of algorithms and the proposed method. The third chapter details the study area, data used, and model development; the fourth, fifth, and concluding chapters present results, discussions, and conclusions.

2 Methodology

2.1 Back Propagation Neural Network

A BPNN is an artificial neural network that employs a supervised learning algorithm called backpropagation for training the network (Najafabadipour et al. 2022). BPNN consists of an input layer, one or more hidden layers, and an output layer. Each layer comprises interconnected neurons (nodes or units) that process and transfer information. The network learns by adjusting the weights of the connections between neurons to minimize the error between the forecasted outputs and the actual target outputs. The net input (netj) is calculated for each node in the hidden layer using the formula:

$${net}_{j}=\sum {w}_{ij}\cdot {x}_{i}+{b}_{j}$$
(1)

where wij is the weight from the input node i to the hidden node j, xi is the input value, and bj is the bias for the hidden node j. An activation function σis applied to netj to get the output (yj) from the hidden nodes:

$${{\text{y}}}_{j}=\sigma ({net}_{j})$$
(2)

2.2 Long Short-Term Memory

The LSTM model comprises a forget gate, an input gate, and an output gate, each serving distinct functions (Kao et al. 2020). LSTM networks are constructed from memory blocks, also known as cells. The cell state and hidden state are propagated to the subsequent cell. Initially, the current input at time t, denoted as xt, and the output from the previous time unit t-1, denoted as ht-1, are fed into the activation function σ. This step determines which portion of the previous output should be discarded, a process called the forget gate (ft). The corresponding formula is as follows:

$${f}_{t}=\sigma \left({W}_{f}{x}_{t}+{U}_{f}{h}_{t-1}+{b}_{f}\right)$$
(3)

where σ is the active function; Wf, Uf are weight matrices; bf is weight bias. After introducing xt and ht-1 into the network, the input gate it employs an activation function to decide whether to disregard or incorporate new information. The cell state \({\widetilde{C}}_{t}\), which signifies the content to be updated, is computed using a tanh function.

$${i}_{t}=\sigma \left({W}_{i}{x}_{t}+{U}_{i}{h}_{t-1}+{b}_{i}\right)$$
(4)
$${\widetilde{C}}_{t}=tanh\left({W}_{c}{x}_{t}+{U}_{c}{h}_{t-1}+{b}_{c}\right)$$
(5)

where Wi, Ui, Wc, and Uc are weight matrices; bi and bc are weight bias. The previous cell state Ct-1 is multiplied by ft to ascertain the extent of memory retention from the last instance. This outcome is then added to the new memory, derived from it multiplied by \({\widetilde{C}}_{t}\). Ultimately, the newly updated memory Ct is outputted.

$${C}_{t}={f}_{t}\odot {C}_{t-1}+{i}_{t}\odot {\widetilde{C}}_{t}$$
(6)

where ⊙ denotes the Hadamard product. Upon introducing xt and Ct-1 into the network, the activation function σ is employed to decide whether to output new information, a process known as the output gate ot. Ct is input into the tanh function and multiplied by ot to yield the output ht at time t.

$${o}_{t}=\sigma ({W}_{o}{x}_{t}+{U}_{o}{h}_{t-1}+{b}_{o})$$
(7)
$${h}_{t}={o}_{t}\odot tanh({C}_{t})$$
(8)

where Wo, Uo are weight matrices; bo is weight bias.

2.3 Gated Recurrent Unit

GRU can also be regarded as a simple variant of LSTM (Xie et al. 2022). GRU has two gating layers: reset gate zt and update gate rt. The reset gate determines how much information to forget from a previous memory. The function of the update gate is similar to the forget gate and input gate of the LSTM unit. It determines how much information from previous memory can be passed to the future. The zt and rt formulas are as follows:

$${z}_{t}=\sigma ({W}_{z}{x}_{t}+{U}_{z}{h}_{t-1}+{b}_{z})$$
(9)
$${r}_{t}=\sigma ({W}_{r}{x}_{t}+{U}_{r}{h}_{t-1}+{b}_{r})$$
(10)

where σ is the active function; ht-1 is the output of the last unit; Wz Uz, Wr, and Ur are weight matrices; bz and br are weight bias. The hidden state candidate (\({\widetilde{h}}_{t}\)) and hidden state (ht) at time t can be defined by the following formula:

$${\widetilde{h}}_{t}=tanh({W}_{h}{x}_{t}+{U}_{h}({r}_{t}\odot {h}_{t-1})+{b}_{h})$$
(11)
$${h}_{t}=\left(1-{z}_{t}\right)\odot {h}_{t-1}+{z}_{t}\odot {\widetilde{h}}_{t}$$
(12)

where Wh and Uh are weight matrices; bh is weight bias.

2.4 Bidirectional Long Short-Term Memory

BiLSTM constructs a forward and a backward hidden layer, linking the input layer to these forward and backward hidden layers, respectively (Wu et al. 2023). Subsequently, the forecasted value is computed collectively. Within the BiLSTM model, the output values of the hidden layer are as follows:

$${\overrightarrow{h}}^{(t)}={\overrightarrow{\sigma }}^{(t)}\odot tanh ({\overrightarrow{c}}^{(t)})$$
(13)
$${\overleftarrow{h}}^{(t)}={\overleftarrow{\sigma }}^{(t)}\odot tanh ({\overleftarrow{c}}^{(t)})$$
(14)

In the above, \({\overrightarrow{h}}^{(t)}\in {R}^{{p}_{1}\times 1}\) represents the output of the hidden layer calculated by the forward LSTM, while \({\overleftarrow{h}}^{(t)}\in {R}^{{p}_{1}\times 1}\) denotes the output of the hidden layer calculated by the backward LSTM. The weight matrix of the hidden layer output, as calculated by the forward and backward LSTM, can be designated as \(V\in {R}^{k\times {p}_{1}}\) and \(\Lambda \in {R}^{k\times {p}_{2}}\), respectively, and the input bias weight can be set to \({b}_{y}\in {R}^{k\times 1}\). The forecasted value of the BiLSTM model at time t can be expressed as follows:

$${\widehat{y}}^{(t)}={\sigma }_{y}\left({a}_{y}^{\left(t\right)}\right)={\sigma }_{y}(V\overset\rightharpoonup{h}^{\left(t\right)}+\Lambda \overset\rightharpoonup{h}^{\left(t\right)}+{b}_{y})$$
(15)

where \({\sigma }_{y}(\cdot)\) is generally set as the Softmax function.

2.5 Trend Forecasting Method

The process flow of the TFM is illustrated in Fig. 1 and comprises four steps, detailed as follows:

Fig. 1
figure 1

Flowchart for Trend Forecasting Method (TFM)

2.5.1 Select the Appropriate Factors Causing a Flood

First, the factors that mainly lead to the flooding depth are listed, which the following formula can express:

$${\widehat{D}}_{t+\Delta t}=f({X}_{1},{X}_{2},\dots ,{X}_{n})$$
(16)

where \({\widehat{D}}_{t+\Delta t}\) represents the forecasted flood depth at time t + Δt; Δt is the lead time; and f denotes various machine learning methods; the terms X1, X2,…,Xn represent input factors and their lag lengths. For an inundation forecasting model, selecting input factors is crucial, precisely determining the value n in the Eq. (16). In the proposed method, the value n can primarily be determined through optimization algorithms or by assessing the correlation between the input and output factors of the model.

2.5.2 Select the Best Machine Learning Method

Flood depths are forecasted using various machine learning methods. The performance of each method is assessed using a specific evaluation index, leading to the selection of the most suitable machine learning method, denoted as f’. In this study, to choose the most suitable machine learning method, the performance of the flood prediction model can be determined through various evaluation metrics. Commonly used evaluation metrics typically involve calculating errors or correlations between model output values and observed values to assess the quality of the model. Therefore, the more diverse the evaluation metrics used, the more representative the selected optimal machine learning method becomes.

2.5.3 Three Flooding Depth Models Based on the Best Machine Learning Method

  1. (a)

    Typical Flooding Depth Model (Model f’)

Model f’ employs the original input and output, which the following formula can represent:

$${{\widehat{D}}_{t+\Delta t}}^{(1)}={f}^{\prime}({{X}_{1}, X}_{2},\dots ,{X}_{n})$$
(17)

where \({{\widehat{D}}_{t+\Delta t}}^{(1)}\) denotes the forecasted flood depth by Model f’.

  1. (b)

    Recurrent Flooding Depth Model (Model f’-RD)

Several studies have indicated that incorporating forecast information from each time step as model input data can enhance the model’s accuracy (Jhong et al. 2017; Yang et al. 2019). Consequently, Model f’-RD can be represented by the following formula:

$${{\widehat{D}}_{t+\Delta t}}^{(2)}={f}^{\prime}({\widehat{D}}_{t+\Delta t-1},\dots {\widehat{D}}_{t+1}{{,X}_{1}, X}_{2},\dots ,{X}_{n})$$
(18)

where \({{\widehat{D}}_{t+\Delta t}}^{(2)}\) denotes the forecasted flood depth by Model f’-RD.

  1. (c)

    Delta Flooding Depth Model (Model f’-ΔD)

Variations in rainfall either increase or decrease, resulting in flood depth alterations. Consequently, the model can establish the relationship between rainfall and the variation in flood depth \({\Delta \widehat{D}}_{t+\Delta t}\) at t + Δt. The term \({\Delta \widehat{D}}_{t+\Delta t}\) represents the variation in flood depth between the time steps t + Δt-1 and t + Δt. The Model f’-ΔD can be expressed as follows:

$${\Delta \widehat{D}}_{t+\Delta t}={f}^{\prime}({{X}_{1}, X}_{2},\dots ,{X}_{n})$$
(19)

\({{\widehat{D}}_{t+\Delta t}}^{(3)}\) represents the current water depth Dt accumulating the changes in flood depth at subsequent time steps. The formula is as follows:

$${{\widehat{D}}_{t+\Delta t}}^{(3)}={D}_{t}+\sum\limits_{i=1}^{\Delta t}{\Delta \widehat{D}}_{t+i}$$
(20)

where \({{\widehat{D}}_{t+\Delta t}}^{(3)}\) denotes the forecasted flood depth by Model f’-ΔD.

  1. (d)

    Select the forecast value according to the trend

Three different forecasted results for future flood depth predictions can be obtained: \({{\widehat{D}}_{t+\Delta t}}^{(1)}\), \({{\widehat{D}}_{t+\Delta t}}^{(2)}\) and \({{\widehat{D}}_{t+\Delta t}}^{(3)}\). The concept of trend forecasting (as illustrated in the step 4 in Fig. 1) is introduced. If the flood depth at the current moment is increasing, the likelihood of it rising at the next moment may be high. Consequently, the maximum value among the three is selected. Conversely, if the flood depth at the current moment is decreasing, the chance of it reducing at the next moment is high, leading to the selection of the minimum value among the three. The following equation can represent this:

$$\left\{\begin{array}{l}{D}_{t}-{D}_{t-1}\ge 0 {\widehat{D}}_{t+\Delta t}={\text{max}}\left({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}, {{\widehat{D}}_{t+\Delta t}}^{\left(2\right)},{{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}\right)\\ {D}_{t}-{D}_{t-1}<0 {\widehat{D}}_{t+\Delta t}=min({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}, {{\widehat{D}}_{t+\Delta t}}^{\left(2\right)},{{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}) \end{array}\right.$$
(21)

3 Materials

3.1 Study Area

The research area is located on one of the local traffic arteries in Annan District, Tainan City, Taiwan (23°02′38.6 “N 120°11′35.9” E), as shown in Fig. 2a. The site is centrally located and densely populated with buildings. The area is located upstream of the storm sewer system, but the low-lying terrain makes it difficult for surface runoff to flow into the stormwater sewer system. Flooding often occurs during typhoons and heavy rains, resulting in traffic interruptions and flooding buildings.

Fig. 2
figure 2

a Location of the study area in Annan District, Tainan City, Taiwan b Observed flooding events for model training and testing

3.2 Observed Flooding Depths

The rainfall data in this study is sourced from a rainfall station established by the Tainan City Government, with data recorded at ten-minute intervals. The flood depth data, recorded at ten-minute intervals, is derived from a Flooding Depth Gauge (FDG) (Model Anasystem SenSmart WLS) installed by the Water Resources Planning Branch, Water Resources Agency. The FDG measures inundation depth via radio frequency admittance, boasting an accuracy of 0.5% of the sensor length (typically 1.5 to 2.0 m). The observed data is transmitted to a cloud server every 30 s using Long Range technology. The FDG was installed in 2016 and has recorded 781 data across six rainfall events. Three of these events were caused by typhoons and tropical depressions, while the other three resulted from heavy rains. Five events (comprising 592 data) were used for training, and one event (consisting of 189 data) was used for testing, as depicted in Fig. 2b. Event 6 was the test event due to its second-highest maximum flooding depth.

3.3 Model Development

Figure 2b demonstrates a strong correlation between rainfall and flooding depth. Consequently, this study used \({{D}_{t}, R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{R}-1)}\) as X in formula (16). Here, Dt represents the real-time observed flood depth, Rt is the real-time observed rainfall, and \({R}_{t-1},\dots ,{R}_{t-\left({L}_{R}-1\right)}\) are the observed antecedent rainfalls. LR is the lag length of rainfall (10 min). The correlation coefficient between flood depth and rainfall for different LR values was calculated to identify an appropriate LR. Then, the trial and error method was employed to determine the optimal LR. Consequently, this study selected LR as 6. Moreover, the water depth was forecasted for the next 10, 20, 30, 40, 50, and 60 min; hence Δt was set as 6. The initial experiment evaluated the flood depth forecasting ability of the BPNN, LSTM, and BiLSTM models.

This study employed four AI models, specifically BPNN, LSTM, GRU, and BiLSTM, to evaluate flood depth forecasting. The general forms of the LSTM, GRU, and BiLSTM models are presented as follows:

$$\begin{array}{ll}\mathrm{Model\;BPNN} & {\widehat{D}}_{t+\Delta t}={f}_{BPNN}({{D}_{t}, R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{R}-1)})\\\mathrm{Model\;LSTM} & {\widehat{D}}_{t+\Delta t}={f}_{LSTM}({{D}_{t}, R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{R}-1)})\\\mathrm{Model\;GRU} & {\widehat{D}}_{t+\Delta t}={f}_{GRU}({{D}_{t}, R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{R}-1)})\\\mathrm{Model\;BiLSTM} & {\widehat{D}}_{t+\Delta t}={f}_{BiLSTM}({{D}_{t}, R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{R}-1)})\end{array}$$
(22)

This study computed the evaluation indicators for the four models, with GRU demonstrating superior performance (refer to Sect. 4.1 for details). Based on Eqs. (18) and (19), Models GRU-RD and GRU-ΔD were respectively developed. The general forms of Models GRU-RD and GRU-ΔD are presented as follows:

$$\mathrm{Model\;GRU}-{\text{RD}}\quad {\widehat{D}}_{t+\Delta t}={f}_{GRU}({{\widehat{D}}_{t+\Delta t-1},\dots {\widehat{D}}_{t+1},{D}_{t},R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{RD}-1)})$$
(23)
$$\mathrm{Model\;GRU}-\mathrm{\Delta D}\quad {\Delta \widehat{D}}_{t+\Delta t}={f}_{GRU}({{D}_{t},R}_{t},{R}_{t-1},\dots ,{R}_{t-({L}_{RD}-1)})$$
(24)

Because the actual flooding depth must be greater than or equal to zero, if the forecasted flood depth was negative in this study, it was regarded as a 0 value. The Maximum Forecasting Method (MFM) and Average Forecasting Method (AFM) were adopted to assess the improvements provided by the proposed TFM, allowing for a direct comparison with TFM. Based on the MFM, the maximum value among\({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}\),\({{\widehat{D}}_{t+\Delta t}}^{\left(2\right)}\), and \(,{{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}\) is used as the forecast value, which the following equation can represent:

$${\widehat{D}}_{t+\Delta t}=max({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}, {{\widehat{D}}_{t+\Delta t}}^{\left(2\right)},{{\widehat{D}}_{t+\Delta t}}^{\left(3\right)})$$
(25)

The concept of AFM involves using the average value of the three forecasted values as the final value. The following equation can represent this:

$${\widehat{D}}_{t+\Delta t}=\frac{{{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}+ {{\widehat{D}}_{t+\Delta t}}^{\left(2\right)}+{{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}}{3}$$
(26)

If the warning authorities exhibit a tendency toward safety and conservatism, they are inclined to opt for the maximum forecasted values. In the MFM method, the maximum value among\({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}\)\({{\widehat{D}}_{t+\Delta t}}^{\left(2\right)}\), and \({{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}\) is selected as the model output. Additionally, the application of the AFM method is justified by the prevalence of the mean as a statistical method. In the AFM method, the average value among\({{\widehat{D}}_{t+\Delta t}}^{\left(1\right)}\)\({{\widehat{D}}_{t+\Delta t}}^{\left(2\right)}\), and \({{\widehat{D}}_{t+\Delta t}}^{\left(3\right)}\) is selected as the model output. Finally, the model output values selected by the TFM, MFM, and AFM are compared.

The models were developed using Python 3.8 and the Keras library. Data normalization was achieved using the Max–min Scaler. Hyperparameters for the models were optimized through trial and error, meaning the models have been run several times. All models were assigned the same hyperparameters to facilitate the comparison of structural differences between models. Each model had three hidden layers and 20 neurons. The batch size was set to 10, and the dropout rate was 0.2. The loss function used was Mean Squared Error (MSE), and the activation function was tanh. The program was compiled using the Adam optimizer, and the number of epochs was set to 120.

3.4 Evaluation of Model Performance

In this study, the Root Mean Square Error (RMSE), coefficient of determination (R2), and Nash–Sutcliffe model efficiency coefficient (NSE) were used to evaluate the overall forecasting performance of the model. EDp and ETp were also used to assess the model’s forecasting performance for peak values. RMSE represents the error between forecasted and observed values. A model can forecast more accurately when the RMSE is closer to 0. R2 is often used to assess the linear correlation between the model’s and the target’s output. R2 ranges from 0 to 1. A model can forecast more accurately when its R2 is closer to 1. NSE is commonly used to evaluate hydrological forecasting models. NSE values range from negative infinity to 1. A model with an NSE value closer to 1 can make more accurate forecasts. Models with an NSE value of less than 0 demonstrate poorer performance than those that only produce the mean. EDp evaluates the error between the forecasted peak and the observed peak, while ETp assesses the error between the forecasted peak occurrence time and the observed peak occurrence time. The smaller the absolute values of EDp and ETp, the better the model’s performance. The evaluation indices can be calculated using the following formulas:

$${\text{RMSE}}={\left[\frac{1}{n}\sum_{i=1}^{n}{\left[{D}_{for}({t}_{i})-{D}_{obs}({t}_{i})\right]}^{2}\right]}^{1/2}$$
(27)
$${R}^{2}={\left[\frac{{\sum }_{t=1}^{N}({D}_{for}\left({t}_{i}\right)-{\overline{D}}_{for}\left({t}_{i}\right))({D}_{obs}\left({t}_{i}\right)-{\overline{D}}_{obs}\left({t}_{i}\right))}{\sqrt{{\sum }_{t=1}^{N}{({D}_{for}\left({t}_{i}\right)-{\overline{D}}_{for}\left({t}_{i}\right))}^{2}{\sum }_{t=1}^{N}({{D}_{obs}\left({t}_{i}\right)-{\overline{D}}_{obs}\left({t}_{i}\right))}^{2}}}\right]}^{2}$$
(28)
$${\text{NSE}}=1-\frac{\sum_{i=1}^{n}{\left[{D}_{for}\left({t}_{i}\right)-{D}_{obs}\left({t}_{i}\right)\right]}^{2}}{\sum_{i=1}^{n}{\left[{D}_{obs}\left({t}_{i}\right)-{\overline{D}}_{obs}\left({t}_{i}\right)\right]}^{2}}$$
(29)
$${ED}_{p}=\frac{{D}_{p,for}-{D}_{p,obs}}{{D}_{p,obs}}\times 100\%$$
(30)
$${ET}_{p}={T}_{p,for}-{T}_{p,obs}$$
(31)

where \({D}_{for}\left({t}_{i}\right)\) and \({D}_{obs}\left({t}_{i}\right)\) represent the i-th forecasted and observed values, respectively. \({\overline{D}}_{for}\left({t}_{i}\right)\) and \({\overline{D}}_{obs}\left({t}_{i}\right)\) are the mean values of the forecasted and observed values, respectively. \({D}_{p,for}\) and \({D}_{p,obs}\) represent the forecasted and observed peak values, respectively, while \({T}_{p}^{for}\) and \({T}_{p}^{obs}\) denote the occurrence times for the forecasted and observed peak values, respectively.

4 Results and Discussion

4.1 Comparative Analysis of the Flood Forecasting Models Based on the BPNN, LSTM, GRU, and BiLSTM

Table 1 details BPNN, LSTM, GRU, and BiLSTM training and testing outcomes in flood depth forecasting. Figure 3a–c compare these models during the testing phase. As lead time extended, all models’ RMSE increased, and their R2 and NSE decreased, thus reducing forecasting precision. This trend corroborates Yang et al. (2023) findings from forecasting flood depth in Rende District, Tainan City, Taiwan, using BPNN, RNN, and LSTM.

Table 1 Evaluation indexes of the models
Fig. 3
figure 3

a RMSE of BPNN, LSTM, GRU, and BiLSTM; b R2 of BPNN, LSTM, GRU, and BiLSTM; c NSE of BPNN, LSTM, GRU, and BiLSTM; d EDp of BPNN, LSTM, GRU, and BiLSTM; e ETp of BPNN, LSTM, GRU, and BiLSTM

BPNN exhibited notably higher RMSE and significantly lower R2 and NSE than LSTM. Figure 4c–f demonstrate BPNN’s fewer black points compared to LSTM’s blue points at water depths below 150 mm, becoming comparable beyond this threshold. Particularly under 150 mm, BPNN’s forecasts were substantially inferior to LSTM’s. This reflects Yang et al. (2023) similar findings on BPNN versus LSTM. The fluctuation in flood depth corresponds to the hydrological process where initial rainfall runoff is managed by ditches and sewers, causing no initial flooding. However, as runoff exceeds these capacities, flooding ensues. Post-rainfall, as runoff decreases, so does flooding. LSTM, adept at capturing such time series and long-term dependencies, consistently surpasses BPNN, especially over longer forecast lead times.

Fig. 4
figure 4

Scatter diagrams of BPNN, LSTM, GRU, and BiLSTM for a T + 1, b T + 2, c T + 3, d T + 4, e T + 5, and f T + 6 forecasting

Figures 3a–c and 4a–f show that GRU, with slightly lower RMSE and marginally higher R2 and NSE than LSTM, forecasted flood depths more accurately. While LSTM units have three gates (input, forget, output) (Ding et al. 2020), GRU units have only two (update and reset), lacking an output gate and using the hidden state as output (Kao et al. 2020). This makes GRUs simpler and potentially less prone to overfitting in linear systems. Despite the task and data dictating the choice between LSTM and GRU, our experiments suggest a slight superiority of GRU in flood depth forecasting.

Figures 3a–c and 4a–f show BiLSTM’s slightly higher RMSE and lower R2 and NSE than LSTM, indicating LSTM’s marginally better forecast accuracy. While both models are tailored for sequential data, BiLSTM, which processes data forward and backward, offers enhanced context understanding, which is beneficial for tasks like language modeling (Vatanchi et al. 2023). However, for applications like flood depth forecasting, where future context is less critical and past rainfall predominantly influences changes, our experiments reveal that LSTM outperforms BiLSTM.

Figure 3d–e show EDp and ETp for LSTM, GRU, and BiLSTM, with most EDp values being negative, indicating underestimated flood peaks. BPNN’s average EDp was -3.11%, slightly outperforming LSTM (-9.32%), GRU (-9.89%), and BiLSTM (-8.36%) in peak prediction. ETp for the models ranged from 10 to 60 min across six forecasts, showing time lags. Inputs included Dt and past rainfall, with Dt having a more significant impact. Balancing recent and historical data is essential to prevent overlooking long-term trends and reacting to short-term noise, which can cause forecast lags. While real-time data is critical, incorporating a broader historical context is necessary to capture patterns over time. An overly focused model on recent data may miss future trends (Bollerslev et al. 1994; Box et al. 2015; Hyndman and Athanasopoulos 2018; Zhang et al. 1998).

4.2 Improvement of Model Accuracy Due to the Use of GRU, GRU-ΔD, and GRU-RD

Table 1 compares flood depth forecasts for GRU, GRU-RD, and GRU-ΔD. GRU-ΔD had slightly higher RMSE but similar R2 and NSE values to GRU, indicating comparable accuracy. GRU-ΔD’s average EDp of -6.0% surpassed GRU’s -9.2%, making it more effective in peak depth prediction, though both had similar time lags. Meanwhile, GRU-RD showed lower RMSE and higher R2 and NSE than GRU, enhancing accuracy. Although its -12.6% EDp suggests slightly reduced peak forecasting precision compared to GRU’s -9.2%. However, GRU-RD’s smaller ETp indicates improved time lag. GRU-RD’s feedback loops, which pass previous inputs and outputs to the next step, account for this reduced lag. These results are consistent with findings by Nanda et al. (2016) and Yang et al. (2019).

In time series forecasting, autoregressive modeling, which uses previous outputs as inputs, effectively reduces time lag by capturing the time-dependent structure. This responsiveness is crucial for non-stationary series where statistics vary over time, enhancing forecast accuracy by integrating recent information and minimizing error. It also captures non-linear relationships in complex dynamics (Brockwell and Davis 2016; Chatfield and Xing 2019; Hyndman and Athanasopoulos 2018; Zhang et al. 1998).

4.3 Improvement of Model Accuracy Due to the Use of the Proposed TFM

Figure 5a–c show GRU’s RMSE as notably higher and R2 and NSE as significantly lower than AFM, MFM, and TFM. GRU’s forecasts, as Fig. 6d–f depict, notably underestimated actual values, a shortcoming mitigated by applying AFM, MFM, and TFM. TFM demonstrated smaller RMSE and larger R2 and NSE than AFM, indicating superior performance as shown in Figs. 5a–c and 6b–f. During the test event, flood depth rose or fell continuously in 49 steps and discontinuously in 23, confirming a 68% probability of continued rise or fall, supporting the TFM hypothesis. While AFM and TFM showed similar forecast biases (EDp), AFM exhibited a time lag in peak forecasts, unlike TFM. Consequently, TFM more accurately predicts flood peaks without noticeable time lag.

Fig. 5
figure 5

a RMSE of GRU, AFM, MFM and TFM; b R2 of GRU, AFM, MFM and TFM; c NSE of GRU, AFM, MFM and TFM; d EDp of GRU, AFM, MFM and TFM; e ETp of GRU, AFM, MFM and TFM

Fig. 6
figure 6

Scatter diagrams of GRU, AFM, MFM, and TFM for a T + 1, b T + 2, c T + 3, d T + 4, e T + 5, and f T + 6 forecasting

Figure 5a–c show that TFM outperformed MFM in early forecasts (T + 1-T + 4) with lower RMSE and higher R2 and NSE, while MFM excelled in later forecasts (T + 5 and T + 6). Figure 6b–f depict MFM’s significant overestimation by selecting maximum values from various forecasts. Despite MFM’s smaller EDp indicating better peak predictions, it had time lags, unlike TFM, which could predict flood peaks without noticeable delays (Fig. 5d–e). Overall, TFM was the most accurate for flood depth, followed by MFM and AFM. While MFM led in peak forecasting, its time lags were a drawback, whereas TFM, though slightly less accurate, predicted flood peaks without significant time lag.

4.4 Limitations of the Work and Future Research

This study’s limited observational data includes six rainfall events and 781 records. Enhancements could consist of using additional deep learning models to generate more data. Future work utilizing the GRU model might explore more complex models and extend the current 60-min lead time to longer periods.

4.5 Impact and Usefulness of Work Concerning Water Resources Management

The proposed TFM is versatile and applicable to various hydrological forecasts like streamflow, flood stage, and sewer water depth, and it is especially effective in predicting continuous hydrograph trends. By integrating TFM, multimodal forecasting becomes more accurate. Furthermore, integrating this model into Tainan City’s flood response system could preemptively warn residents and authorities, triggering preventive actions like installing barriers, closing roads, and deploying water pumps, thereby enhancing flood management.

5 Conclusions

This study aimed to improve model accuracy and overcome the time lag problem. The proposed Trend Forecasting Method (TFM) has achieved this purpose. First, the appropriate input factors causing flood events and the most suitable AI algorithm were determined to construct the forecasting models. Second, the forecasting models, respectively using the multi-step-ahead approach and the variation in flood depth as input, were developed to investigate the relationships between the input and output variables of models predicting flooding depths at different lead times and the improvement of flood prediction concerning the variation between the current and the previous time step in flooding depth. Then, based on a flood hydrograph’s rising and falling limb, the maximum and minimum values predicted by the models above were chosen as the final outputs, respectively.

The benefits of the proposed method were showcased through its implementation in the Annan District of Tainan City, Taiwan. The research results indicated that the GRU exhibited the highest accuracy among all examined models, with LSTM, BiLSTM, and BPNN following that order. Despite all models demonstrating time delay issues, GRU has shown the best empirical performance, while BPNN excels in peak forecast. Based on the selection of the GRU-based forecasting model, the proposed method enhanced the prediction accuracy of the original GRU model, addressing issues such as the time lag commonly encountered in time series forecasting. Moreover, it enabled a more accurate prediction of flood peaks. Based on the findings of this study, the proposed method applies to various hydrologically related time series forecasting domains, particularly those requiring improved accuracy in predictive models for time series or long-term dependencies. Additionally, AI-related models demand substantial-high-quality historical observational data for training and testing. Due to limitations in available observational data, future research could involve updating datasets with more observations and exploring integrating new models to refine the proposed method, thereby enhancing predictive accuracy and extending the lead time of predictions.