Introduction

Extreme weather and climatic events, particularly flooding, have caused a huge impact on people's life and property and social development (Jamshed et al. 2021). In the twentieth century, the number of deaths caused by catastrophic floods has ranged from 100,000 to 1.4 million, according to published national statistics (Hajat et al. 2003). Recent studies have reported that floods, one of the natural disasters caused by extreme weather and climate events, are becoming more frequent and intense (Hirabayashi et al. 2013; IPCC 2014). It is estimated that from 2000 to 2020, flood events have caused economic losses of more than $537 billion globally, affecting the normal life of 1.6 billion people (EM-DAT 2020). With increasing impervious cover in urban areas driving dramatic changes in rainfall infiltration and storage capacity (Mu et al. 2020), which lead that urban flood appear sudden and frequent (Ward 1978), posing severe challenges to urban flood control and drainage. Cities gather a large number of talents, creating an economy that occupies an absolute advantage in the overall economic proportion, which leads to urban floods affect a large number of people worldwide, causing human fatalities and significant damages (Rahmati et al. 2020). After the Louisiana floods in 2016, the floods in Shouguang and Zhengzhou in China in 2018, and the floods in Iran on March 25 in 2019, as examples, these heavy rains and floods caused considerable economic losses and casualties, and have become prominent bottlenecks affecting the healthy development of cities (Yazdi et al. 2019). The main reason for urban flood is that urbanization increases hardened area, reduces infiltration, increases runoff and triggers higher and faster peak water flow (Loperfideo et al. 2014; Ferreira et al. 2016). These changes have a considerable impact on the hydrological process when rainfall occurs, resulting in a large and rapid runoff generation, coupled with the failure of storm drainage system (GebreEgziabher and Demissie 2020), resulting in a higher probability of urban flood occurrence and a higher recurrence rate (Braud et al. 2013; Miller et al, 2014; Jongman 2018).

To assist decision makers in anticipating potential flooded and preemptively taking measures to lleviate the pressure brought by urban floods, promote the steady development of cities and ensure the safety of people's lives and property, researchers and practitioners have done a lot of research on urban flood prediction (Bhan and Team 2001; Diaz-Nieto et al. 2012; Gain and Hoque 2013; Kong et al. 2017). Hydrological and hydrodynamic models and data-driven models are the most popular and widely used tools in the research of early warning and forecast of urban flood information (White and Greer 2006; Bubeck et al. 2016).

Hydrological and hydrodynamic models are based on hydrological characteristics, which can physically describe runoff confluence by combining the physical laws of mass momentum and energy conservation (Vojinovic and Tutulic 2009). SWMM (Zhao et al. 2009; Huong and Pathirana 2013), Mike (Zoppou 2001; Zolch et al. 2017) and InfoWorks (Schmitt et al. 2004) are widely used hydrological and hydrodynamic models in flood prediction. Zhang et al. built an urban flood model based on SWMM to predict the flood disaster and pipeline drainage process under different types of designed rainfall, based on the data of topographic map underground drainage network, urban land use and rainfall. The results prove the applicability of SWMM in urban rainstorm flood simulation and drainage analysis of pipe network (Zhang and Li 2019). Wu et al. (2017) established a two-dimensional hydrodynamic inundation model through the coupling of SWMM and LISFlood-FP model, and on this basis revealed the evolution law of the inundation of Shiqiaoxi District (SCD) of Dongguan City under different scenarios of sea level rise and subsidence under heavy rain. Patro et al. (2009) took the data results of MIKE11 as the input of the two-dimensional model MIKE 21, coupled the MIKE11 model and the MIKE 21 model laterally to form the two-dimensional flood inundation simulation MIKE flood model in the study area, and carried out numerical simulation on the flood inundation range and flood inundation depth. Bisht et al. (2016) used the two-dimensional (2D) MIKE model to overcome the limitations of the one-dimensional (1D) SWMM model in simulating the flood range and flood inundation, and simulated the flood in a small urbanized area in West Bengal, India. The InfoWorks ICM 2D hydrodynamic model is utilized for simulating historical and designed rainfall events, which is carried out in the “Sponge City Construction” pilot area of Jinan City. The simulated water depth and flow velocity are recorded for flood risk zoning and the result shows that the InfoWorks ICM 2D model performed well (Cheng et al. 2017).

The data-driven intelligent model does not need to consider the specific process of the model. It is mainly manifested as the analysis and learning of the existing observation data, so as to establish the mapping relationship between input and output, so as to predict the specific variables (Nourani et al. 2009; Jhong et al. 2016). Ding et al. (2020) proposed an explicable spatiotemporal attention long—short memory model (STA-LSTM) based on LSTM and attention mechanism, and established the model using dynamic attention mechanism and LSTM method to make explicable analysis of flood prediction. Granata et al. (2016) predicted the runoff due to rainfall through support vector regression (SVR) and compared the results with those of the SWMM model. The results of SWMM overestimated the runoff compared to those of SVR. Kim and Han (2020) established flood prediction models for various basins by introducing nonlinear autoregressive model and self-organizing map (NARX-SOM), and carried out flood prediction for the extremely heavy rainstorm in Seoul, South Korea in 2010 and 2011, with high prediction ability. She and You (2019) combined the architectural advantages of radial basis function neural network (RBFNN) and nonlinear autoregressive and exogenous input neural network (NARXNN) and proposed the RBFM prediction model to predict the urban drainage system flow, which proved the great potential of RNFM in urban runoff prediction and management. Wu et al. (2020a) established a real-time prediction model of flood depth based on waterlogging point by using GBDT algorithm based on multi-factor analysis, and verified the validity and applicability of the model for real-time prediction of waterlogging process. However, the model that Wu used only be predicted when rainfall occurs, and cannot predict the flood depth after rainfall.

The above studies have achieved good results in the field of flood prediction. However, the current research results still focus on the prediction of a single aspect of the depth range of urban flood and the duration of water retention, which leads to the failure to make appropriate decisions in time to avoid the damage caused by the flood disaster (Yazdi and Neyshabouri 2012; Wu et al. 2020b). Moreover, studies on the spatial flood prediction for large urban basins are not sufficient. As such, this paper intends to use the Naive Bayes algorithm and random forest algorithm in machine learning to forecast the information of urban waterlogging generated by rainfall. The specific contents are as follows: 1. According to the rainfall data information, a classifier model based on Naive Bayes algorithm is constructed to analyze and predict the urban waterlogging point; 2. Construct a regression prediction model based on random forest algorithm, and use real-time rainfall and water accumulation information to make short real-time prediction of water accumulation process at waterlogging points. From the determination of the waterlogging point to the prediction of the water level at the waterlogging point, the prediction research of the whole process of urban waterlogging is realized, which provides technical support for urban flood control management.

Materials and methods

Study area

Zhengzhou, the capital of Henan Province in Central China, covers an area of approximately 7446 km2 (Fig. 1). Its permanent resident population reached 10.352,000 by the end of 2019, ranking 14th in China. Among them, the urban population was 7.721 million, with an urbanization rate of 74.6%. As an important hub city on the “new Silk Road” in Europe and Asia, Zhengzhou’s the total GDP (Gross Domestic Product) reached 177.3 billion dollars in the same year. Zhengzhou’s geographical location (34°16′–34°58′ N; 112°42′–114°14′ E) in the continental monsoon climate allows 60% of its 524.1 mm annual average rainfall to occur during the summer months from June to September, when there is an increased risk of urban flood. For example, heavy rains on August 19, 2018 and August 1, 2019 caused widespread flooding in city; some waterlogging prone points have serious water accumulation, compromising the regular traffic operation (Figs. 2, 3).

Fig. 1
figure 1

Location of the study area

Fig. 2
figure 2

The structure of the Naïve Bayes

Fig. 3
figure 3

Model construction of urban flood water accumulation process prediction based on RF regression algorithm

Date and material

Through the analysis of the hydrological process of the accumulation and confluence of water, it is found that both rainfall and geographical factors play an indispensable role in the formation of water. Ignoring any one of these factors may lead to the distortion and deviation of the predicted results. Thus, considering the usefulness of machine learning algorithm for multidimensional data and in combination with previous research results (Choubin et al. 2019; Vafakhah et al. 2020) and the data available in the study area, three main prediction features, namely, geographical characteristics, rainfall characteristics and flood characteristic are selected for training and verification the model. Geographical characteristics describe land use (the proportion of roads, woodlands, grasslands and building) and geographical structure (permeability, catchment area, and slope), which were obtained from the maps extracted of Pleiades Satellite in May 2014 with the 0.5 m high spatial resolution. Rainfall characteristics include three rainfall indexes, namely, rainfall, rainfall duration and peak rainfall, which were obtained from the Henan Meteorological Service. Because the occurrence of rainfall, the characteristics values of rainfall in different parts of Zhengzhou urban area are various, the data of the rainfall was processed by using the Kriging method of space interpolation to refine the rainfall data and increase the diversity of rainfall intensity. For flood characteristic, locations and depths information of flooded urban areas were included, which were collected from the monitoring equipment at each intersection administered by the Zhengzhou Municipal Urban Management Bureau.

Naive Bayes (NB) algorithm

Naive Bayes classifier is one of the few classification algorithms based on probability theory of the classical machine learning algorithms (Perez et al. 2009). It does not need to consume a lot of time for calculation like k-nearest neighbor, support vector machine and other methods, nor does it need to determine and input any parameters (Patil and Atique 2020). Therefore, the time of training model and model test is relatively fast, which is an outstanding advantage to provide sufficient time for urban flood control work to deal with the damage caused by urban flood. And it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function (Lou et al. 2014). NB classifier predicts the probability of a class membership, that is to say the probability that a given set of variables (features) belongs to a particular class (Omran and El Houby 2020). The NB classifier works as shown in the following steps:

The NB classifier predicts the \(Y_{i}\) of classes that X belongs to, based on the highest posteriori probability of the class conditioned on X, which means that:

$$P\left( {\mathop Y\nolimits_{i} |X} \right) > P\left( {\mathop Y\nolimits_{j} |X} \right)\;\;{\text{for}}\;\;1 \le j \le m,j \ne i$$
(1)

Based on Bayes’ theorem, P (Y|X) can be written as formula:

$$P\left( {Y\left| X \right.} \right) = \frac{{P\left( Y \right)P\left( {X\left| Y \right.} \right)}}{P\left( X \right)}$$
(2)

For a given sample, the P (X) is independent of the class tag and same for all classes, so P (Y |X) is only related to P (Y) and P (X |Y). Based on the assumption that each network characteristic attribute independently has an attribute influence on the prediction results, he formula can be rewritten as:

$$h_{nb} \left( X \right) = \arg \max P\left( Y \right)\prod\nolimits_{i = 1}^{d} {P\left( {x_{i} \left| Y \right.} \right)}$$
(3)

Then, the training set was used to set the value for P(xi|Yi). Finally, the model takes the category with the highest probability as the optimal output result.

Random Forest (RF) algorithm

Random Forest algorithm is an ensemble machine learning algorithm for performing classification or regression (Prajwala 2015; Kabir et al. 2018), which was first introduced by Breiman (Breiman 2001) and has been widely used in Geography (Gislason et al. 2006; Guo et al. 2020), Bioecology (Parkhurst et al. 2005; Smith et al. 2010), Medicine (Chen and Liu 2006; Lee et al. 2010) and so on recent years. RF is the algorithm of tree class structure, which combines multiple decision trees to generate corresponding prediction results for different characteristics of the same phenomenon. Compared with various current machine learning models, the RF algorithm has the following three obvious advantages (Malekipirbazari and Aksakalli 2015; Li et al. 2020): 1. RF can deal with high latitude independent variable problems. 2. able to fit and predict nonlinear problems. 3. the learning process is fast, and I can deal with a large amount of data efficiently. The important steps to implement the RF regression algorithm are presented below:

  1. 1.

    K data sets are extracted in the way of Bootstrap sampling with random from the input data sets. The data amount of the K data sets is the same as the original data amount and the composition of the data can be repeated. This step is the first “Random” in the RF model.

  2. 2.

    Assuming that the number of variables in a data set is M, the \(\mathop M\nolimits_{try}\) variables are randomly selected from each node of each regression tree as alternative branching variables, and then the optimal branching is selected according to the branching excellence criterion. This step is the second “Random” in the RF model.

  3. 3.

    \(\mathop K\nolimits_{tree}\) decision trees are constructed and trained by using the select data from the step 1 and 2. Each decision tree grows as much as possible without pruning, and then K decision trees are formed to form a random forest. This step is the “Forest” in the RF model.

  4. 4.

    The result of the prediction for a new sample is obtained by averaging the predictions from all the individual well-grown regression trees in the RF regression model:

    $$f = \frac{1}{{\mathop K\nolimits_{tree} }}\sum\limits_{i = 1}^{{\mathop K\nolimits_{tree} }} {\mathop f\nolimits_{i} \left( x \right)}$$
    (4)

where \(\mathop K\nolimits_{tree}\) is the total number of trees and \(\mathop f\nolimits_{i} \left( x \right)\) is the prediction from each individual well-grown regression tree by using the training data set training.

What can be captured from the above modeling steps is that the diversity of the system in RF model is be improved, which can effectively avoid overfitting and improve the predictive performance of the model (Table 1).

Table 1 The impact factors and flood data in the study

Evaluation of model accuracy

Model evaluation is an important step in the modeling and prediction process, which represents accuracy of the results obtained by the model and the degree of people's trust that can be placed in the model. For the prediction of flood susceptibility in waterlogging points based on NB theory, Precision, Recall, Accuracy and F1score are used as indicators for evaluation of model performance (Table 2).

Table 2 Categories of result and evaluation indexes of NB classification model

For the short real-time prediction of flood process based on RF algorithm, Mean Absolute Error (MAE), Mean Relative Error Ratio (MRER) and Root Mean Square Error (RMSE) are used as indicators for evaluation of model performance, which are calculated by the following formula:

$$MAE = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left| {\left( {\mathop y\nolimits_{si} - \mathop y\nolimits_{oi} } \right)} \right|}$$
(5)
$$MRER = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {\frac{{\left| {\mathop y\nolimits_{si} - \mathop y\nolimits_{oi} } \right|}}{{\mathop y\nolimits_{oi} }} \times 100\% }$$
(6)
$$RMSE = \sqrt {\frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left( {y_{si} - y_{oi} } \right)}^{2} }$$
(7)

where \(y_{si}\) and \({ }y_{oi}\) is the simulated value and the measured value of the flood at the point i, respectively.

The closer the index (P, R, A and F1) value is to 1, the more accurate the NB model is in predicting the waterlogging point. And the smaller the value of these three indicators (MAE, MRER and RMSE) is, the more the prediction flood depth result of the RF model is in line with the actual situation.

Results and discussions

Predictive analysis of flood susceptibility of urban waterlogging points based on Naive Bayes classification model

Previous studies and transport project appraisal (Dalziell and Nicholson 2001; Chang et al. 2010; Pregnolato et al. 2017) have shown that when the depth of the flood is 3-5 cm, urban vehicles can pass normally without being affected, so the threshold for determining the flood is 5 cm in this study. When the maximum depth of flood in a waterlogging area is greater than the threshold value 5 cm, it is considered that flood will occur in this area, namely the positive sample above; if not, it is considered that there is no need to worry about the occurrence of floods. There are 10 historical rainfalls and corresponding floods depth data available, which happened specifically on July 26th, 2011; August 2nd, 2012; May 26th, 2013; June 9th and 19th, 2014; July 22th, 2015; June 11th,July 19th and August 5th, 2016; July 20th, 2017. SQL Server Data Tools was used to process diversified Data of geographical characteristics, rainfall characteristics and flood characteristic and build database. The geographical feature information of Zhengzhou city and the information of the first 7 rainfall floods were used as training data set to train the model, and the remaining 3 rainfall (August 2nd, 2012; June 19th, 2014 and July 19th, 2016) information was used to verify the model (Table 3).

Table 3 The prediction result by the NB model (one of three validation events)

Short real-time prediction of water accumulation process based on random forest regression algorithm

A waterlogging point in the city was randomly selected after obtaining the flood susceptibility analysis and waterlogging point results by using NB model, and data of 6 rainfall-water events occurred before about the waterlogging point was collected (Fig. 4). By the data preprocessing of linear interpolation, rainfall data and water accumulation data are unified into the same time scale, and the time granularity is 1 min.

Fig. 4
figure 4

Time series diagram of rainfall intensity and flood depth

The time series of rainfall and water accumulation at this waterlogging prone point are divided and the data set is constructed by using the moving window method, which is a common method for constructing datasets (Wang et al. 2005; Jing et al. 2020). A moving-window of 2 × w grids rolls through the rainfall-flood data grids with size of 2 × J(rows × columns) at a step of 1. The number of the input variables (w of the moving-window) is the most important task in RF model development. For determining the value of w, samples of 12 different combinations of input data were arranged as provided in Table 4. Figure 5 shows the results of training the RF regression model by input different models, which shows that after A9, the model's OOBS (out_of_bag score) increases by less. Thus, considering both the accuracy of the model and the complexity of the model input, the width of the moving window is set as 9, that is, each data set contains respectively 9 rainfall and water accumulation data recorded successively. And the predicted time step is set as 5 min here.

Table 4 Model structure with a different input combination
Fig. 5
figure 5

The OOBS of RF regression model with 12 different input combination models

The RF model contains several built-in parameters, but there are three main parameters affecting the accuracy of the model, respectively: the number of trees, number of features considered at each split and maximum depth of each decision tree (Liu et al. 2020). Those three built-in parameters of the RF model are obtained by means of traversal search and tenfold cross validation. Those three built-in parameters of the RF model are selected and optimized, and the best parameter combination is obtained by means of the traversal search algorithm and tenfold cross validation (Table 5). The parameter Sampling ratio represents the proportion of predicted features of each selected sample. Each sample contains 18 prediction features, and the proportion value of 0.7 means that a sample 12 predicted features are selected. At the beginning and end of the record, rainfall and water values were replenished to 0 for input to the model. The collected data of the first five rainfall accumulation were used as training data set for training and learning of the model, and the last rainfall data was used to verify the prediction performance of the model (Table 6).

Table 5 Parameters optimization results of the RF model
Table 6 Simulation result of the last rainfall event by RF model

Evaluating the performance of the model

The accuracy of NB classifier was evaluated by using the difference between flood susceptibility and predictive classification of urban waterlogging points under real rainfall events (Table 7). In order to make the predicted results more intuitive, the prediction results of flood susceptibility of waterlogging prone points combined with geographic location information were introduced into GIS, and compared with the actual flood’s location and results, the actual distribution diagram of the indicators of waterlogging prone points was obtained (Fig. 6). According to the indexes obtained from the results, the precision, recall, accuracy and F1score all reached more than 90%, indicating that the analysis and prediction of flood susceptibility at urban waterlogging points are reliable. In case of rainfall, NB model can predict the area where urban flooding is likely to occur in Zhengzhou city. Provide reliable information support for city flood control workers. It can be seen from the figure that the waterlogging situation in Zhengzhou is not too serious compared with that in southern cities, and the area of flood waterlogging is relatively concentrated in the southwest of the city, which may be caused by the early construction of the drainage system in this area and the long-term failure of maintenance and repair.

Table 7 Performances of the NB classification prediction model
Fig. 6
figure 6

Distribution and number of waterlogging points of the rainfall event on June 19th, 2014

The accuracy of the RF regression prediction model was assessed using the values of the MAE, MRER and RMSE between the simulated and measured value (Table 8). As shown in Table 8, the MAE, MRER and RMSE of the prediction results of water accumulation depth are 0.95%, 9.53% and 1.21% respectively, which indicates that the water depth predicted by RF model is close to the measured value and the RF prediction model is feasible in the prediction of water accumulation processes. In order to compare the difference between the predicted water level and the actual water level over time more intuitively, the regression curve of water level was fitted (Fig. 7).

Table 8 Performances of the RF regression prediction model
Fig. 7
figure 7

Fitting curves between predicted and measured values

It can be seen from the figure that the variation trend of the predicted water depth of the RF model is synchronized with the variation trend of the measured water depth. Combined with the data values of the three indexes (MAE, MRER and RMSE), there are sufficient reasons to prove the applicability of RF model in predicting the process of water accumulation.

Conclusion

In this study, in order to achieve the goal of predicting the whole process of urban waterlogging, Naive Bayes and random forest algorithm were used to forecast the waterlogging point and the waterlogging process at the waterlogging point respectively. Four classification evaluation indexes (P, R, A and F1) and three regression evaluation indexes (MAE, MRER and RMSE) were used to evaluate the prediction performance of the NB classification model and RF regression model.

The results show that NB modal predicted waterlogging point with good performance. Four classification evaluation indexes (P, R, A and F1) are 91%, 90.5%, 98.9% and 90.7% respectively. These findings demonstrate the validity of the model for the predicting the water accumulation points, when rainfall specific information is available. Therefore, under the background of relatively accurate rainfall forecast information, NB classification algorithm can be used to predict waterlogging points, so as to give urban flood control workers more sufficient time to respond to urban waterlogging. The input data set of RF model is constructed by using sliding window. By comparing the OOBS obtained from 12 different input models, the optimal input model of RF model was determined as A9. The first 5 rainfalls data were used for the training of the model, and the last rainfall was simulated and predicted, and the three regression indexes (MAE, MRER and RMSE) were respectively 0.95%, 9.53% and 1.21%, which demonstrates the validity of the RF regression model for the predicting the water accumulation process of the water accumulation point.

From the results, NB model and RF model can be used to predict the flood and waterlogging information under urban rainfall, which provide effective technical support for urban flood control and forecasting and allow the city's flood control work to have enough time and accurate flood information to prevent and make decisions on the damage caused by the flood in advance.