1 Introduction

Rainfall is a critical factor for human life and the national economy, with changes in rainfall patterns leading to extreme situations such as floods and droughts, greatly affecting agriculture, water resources, and the ecological environment [1]. The increase in extreme rainfall events in recent years, driven by global climate warming, highlights the need for accurate rainfall information to prevent water disasters and manage water resources more effectively. There have been many studies on rainfall prediction problems both domestically and internationally, and the commonly used rainfall prediction methods are mainly divided into two categories: probability statistical methods and time series analysis [2]. However, these methods have limitations. Probability Statistical Methods, including the gray GM(1,1) model, exponential smoothing, and Markov models, can only predict data with large random fluctuations. Time series prediction methods, including autoregressive model, auto regressive moving average models, and autoregressive integrating moving average (ARIMA), tend to predict values close to the average and are inaccurate in predicting extremes [3]. Thus, traditional rainfall prediction models face challenges in precisely forecasting rainfall due to their inherent limitations.

With the advent of the big data era, machine learning methods have become increasingly mature, offering significant advantages in prediction problems due to their high flexibility and data driven learning ability. Machine learning is widely applied in water conservancy engineering for agricultural irrigation, water quality testing, and reservoir scheduling [4]. In the stock market, it predicts future stock prices or returns [5], while in social sciences, it is used in intelligent homes, smart transportation, medical inspection technology, and other fields [6]. These applications underscore the critical role of machine learning in driving progress in these areas. With the potential to discover hidden laws in meteorological data, machine learning can establish appropriate regression prediction models, leading to accurate predictions of future rainfall. Therefore, the use of machine learning for rainfall prediction is a meaningful topic worth exploring [7].

In recent years, researchers from both domestic and foreign institutions have conducted studies on the use of machine learning methods for rainfall prediction. Gocic and Shamshirband utilized the linear regression method to predict the rainfall trend based on monthly data from 1946 to 2012 in 29 regions of Serbia. They divided the predicted 66 year data into three time series combinations to investigate changes in rainfall from 1986 to 2012 [8]. Similarly, Sulaiman predicted heavy rainfall occurrences in the same region using an artificial neural network (ANN) model. They sorted precipitation data from local weather stations between 1965 and 2015 and divided the monthly precipitation values from the past 50 years into different combinations before using them as prediction input data. The predictive performance of the ANN model is evaluated using the mean square error and correlation coefficient [9]. However, there have been relatively fewer studies conducted in this field in China. The constructed rainfall prediction models are mostly based on specific regions, making it difficult to verify their universality and generalization ability. There is also a lack of comparisons between different machine learning models. The scientific standard for model selection has not been established, and selecting factors affecting rainfall is complicated, leading to difficulties in collecting meteorological data [10].

In light of the existing literatures and outstanding issues in rainfall prediction, in this paper we propose a practical solution. Firstly, this study selects five variables, including specific humidity, relative humidity, horizontal water vapor flux, vertical water vapor flux, and lifting index, as variables to combine the causes of the flood at Zhengzhou on July 20, 2021 and the necessary conditions for heavy rainfall. This approach achieves accurate rainfall prediction with fewer variables, reducing the difficulty of data collection. Secondly, the grid search method is used for hyperparameter tuning on the initial machine learning regression models to find the parameter combination that maximizes the prediction accuracy within a certain range, thus improving the model's prediction accuracy and generalization ability. Finally, the adjusted goodness of fit and mean square error are used as evaluation indicators to measure the model's performance. By comparing the rainfall prediction results of several typical regions in China, such as Zhengzhou, Beijing, and Chengdu, the optimal model applicable to the rainfall prediction problem is scientifically selected. These findings offer new ideas and methods for resolving other practical regression problems and serve as a valuable reference for further exploration of regression analysis algorithms.

The remainder of this article is organized as follows. Section 2 elaborates on the theoretical basis of machine learning prediction theory, including linear regression (LR), random forest regression (RFR), and support vector regression (SVR) and Bayesian ridge regression. In Sect. 3, the application of machine learning methods in rainfall prediction is discussed, including the selection of variables and data acquisition and processing. Then we employ the four aforementioned models to predict rainfall results in different regions. Section 4 compares the prediction results of the different methods. Finally, limitations of this study and its future research are also discussed.

2 Prediction theory and methods of machine learning

In this section, theory and methods of machine learning for prediction are introduced.

2.1 Linear regression model

Linear regression algorithm (LR) is a common machine learning algorithm with features such as easy understanding, convenient execution, and wide application [11]. The basic principle of the linear regression model is to analyze the relationship between the dependent variable and one or more independent variables and use this analysis to train a linear equation that approximates the sample data through training data. The goal is to accurately predict the target value as much as possible based on the input variables [12]. When there is only one independent variable in the regression model, it is called a simple linear regression model. When the regression model contains two or more independent variables, it is called a multiple linear regression model. The specific steps for constructing a linear regression model are as follows:

  1. 1.

    Determine the number of variables and select appropriate independent variables based on the scenario.

  2. 2.

    Determine the error measurement standard and select the loss function.

  3. 3.

    Find the optimal model performance by changing the optimizer for the scenario.

2.2 Random forest regression model

Ensemble learning algorithm is a method of combining multiple classifiers to improve prediction performance [13]. According to the ensemble methods, they can be roughly divided into two categories: Bagging (parallel) and Boosting (serial). The random forest algorithm, known for its simplicity and efficiency, is one of the representative algorithms of ensemble learning algorithms. It is a machine learning algorithm proposed by Breiman in 2001 by combining the Bagging ensemble learning theory with the random subspace method [14]. In classification problems, the final classification result of the random forest model is determined by voting based on the output results of individual decision trees, following the principle of majority rule [15]. In prediction problems, the random forest model integrates the prediction results of many decision trees and outputs them in the form of average [16]. Due to its good performance, the random forest model has achieved great success in many application fields. The training process of the random forest model is shown in Fig. 1 [17].

Fig. 1
figure 1

The process of constructing Random Forest model

2.3 Support vector regression model

Support vector regression (SVR) is a machine learning method that uses statistical learning theory to perform regression calculations based on the idea of support vector machine (SVM). It is widely used to solve nonlinear problems and is suitable for finite sample studies, with the ability to theoretically obtain globally optimal solutions [18].

This model transforms the actual problem into a high dimensional feature space through nonlinear transformation and constructs a linear decision function in the high dimensional space to achieve the nonlinear decision function in the original space, effectively solving the dimensionality problem [19]. Additionally, this model has the advantages of simple structure, strong generalization ability, and high prediction accuracy [20]. It has a wide range of applications in function approximation, regression prediction, and other areas. The training process of SVR is displayed in Fig. 2 [21].

Fig. 2
figure 2

The process of constructing SVR model

2.4 Bayesian ridge regression model

Ridge Regression is a model tuning method specialized in collinearity data analysis [22]. It is essentially an improved least squares estimation method that sacrifices some information and reduces accuracy to obtain more realistic and reliable regression coefficients, and it fits ill-conditioned data better than the least squares method [23].

Bayesian linear regression is a linear regression model solved using Bayesian inference methods in statistical learning [24]. Bayesian linear regression views the model's stochastic parameters as random variables, calculates their posterior probabilities based on the prior probabilities of the model's parameters (weight coefficients), and uses the maximum likelihood method to estimate unknown parameters.

The Bayesian ridge regression model combines the advantages of the ridge regression model and Bayesian linear regression model and introduces an L2 regularization term on the basis of Bayesian linear regression estimation [25]. It is not only suitable for predicting normal data but also has a good fitting effect when there is multicollinearity among independent variables. Due to the model's adaptive ability to the data, it can reuse data, which largely solves the problem of overfitting in maximum likelihood estimation [26].

3 Machine learning based rainfall prediction

3.1 Data acquisition and processing

3.1.1 Data description

In this paper, we use a dataset acquired from the Google Earth Engine platform, including monthly rainfall data of nine regions in China: Zhengzhou, Beijing, Hohhot, Kunming, Lhasa, Lanzhou, Chengdu, Jinan, and Xi'an, from January 1, 1980 to April 1, 2013.

3.1.2 Feature selection

The essence of rainfall is the condensation of water vapor in the air, and the amount of rainfall depends on the water vapor content, condensation efficiency, and duration. Therefore, analyzing the water vapor conditions in the atmosphere mainly involves the distribution of water vapor content and water vapor flux [27].

There are various ways to express water vapor content: specific humidity \((q)\) represents the mass of water vapor per unit mass of moist air, while water vapor pressure \((e)\) represents the partial pressure of water vapor in moist air. Both specific humidity and water vapor pressure represent the absolute water vapor content in the atmosphere and are therefore referred to as absolute humidity. Relative humidity \((\mathrm{RH})\) is also a fundamental physical parameter that characterizes water vapor content and is widely used in rainfall prediction[28].

Water vapor flux, also known as water vapor transport, is a highly important parameter that indicates both the intensity and direction of water vapor transport [29]. It can be divided into horizontal water vapor flux and vertical water vapor flux. The horizontal or vertical transport of water vapor is an essential component of the atmospheric water cycle and is closely related to the formation of precipitation. Therefore, long-term observations of water vapor flux are of significant research significance for predicting rainfall.

The lifting index is a measure of convective instability. It represents the temperature difference between an air parcel, starting from the observed surface and rising along the dry adiabatic process to the lifting condensation level, and then ascending along the moist adiabatic process to 500 hPa, compared to the actual atmospheric temperature at that level[30]. When the lifting index is negative, it indicates atmospheric instability, and the larger the negative value, the greater the degree of instability, making it more prone to precipitation. Conversely, positive values indicate atmospheric stability and a lower likelihood of precipitation.

Therefore, taking into consideration the main factors influencing rainfall and the necessary conditions for heavy rainfall, which include abundant water vapor, atmospheric dynamics, and atmospheric stability, we ultimately select five physical quantities: specific humidity \((q)\), relative humidity \((RH)\), horizontal water vapor flux \(\left|{F}_{H}\right|\), vertical water vapor flux \({F}_{z}\), and lifting index \((LI)\) as feature indicators for predicting rainfall. They can be calculated as follows:

$$\mathrm{Specific humidity}: q=\frac{0.6220e}{P-0.3780e},$$
(1)
$$\mathrm{Relative humidity}: RH=\frac{e}{{e}_{w}},$$
(2)
$$\mathrm{Horizontal water vapor flux}: \left|{F}_{H}\right|=\frac{\left|V\right|q}{g},$$
(3)
$$\mathrm{Vertical water vapor flux}: {F}_{z}=-\frac{wq}{g},$$
(4)
$$\mathrm{Lifting index}:LI={T}_{500}-{{T}^{^{\prime}}}_{\rm suf}.$$
(5)

In the formula above, \(e\) represents water vapor pressure, \({e}_{w}\) represents saturated water vapor pressure, \(u\) is the longitudinal wind, \(v\) is the latitudinal wind, \(w\) is the vertical wind, \(V\) is the total wind speed, \({T}_{500}\) is the temperature of an air parcel that rises along the moist adiabat from the convective condensation level to 500 Pa, and \({T^{\prime}}_{\rm suf}\) is the actual temperature at 500 hPa.

3.1.3 Data processing

Python software is used to process and analyze the retrieved dataset. Based on the variable parameters in the dataset, five variables are calculated as input data for the model, with the daily average of monthly rainfall as the output data. A total of 400 sets of sample data are obtained after data processing. The train_test_split function is called to divide the sample data into 90% training samples and 10% testing samples before the data is standardized. The standardization formula is as follows,\({x}^{*}=\frac{x-\mu }{\sigma }\). (6)

The training sample is used to build the model and the testing sample is used to evaluate the model. To effectively compare the predictive performance of each model and ensure that the prediction results are not affected by the randomness of the data set division, a random seed is selected for each region, and the same sample sequence is used each time the model is constructed.

3.2 Model construction

The model construction in this study is based on the Python learning library. Four models, namely linear regression, random forest, SVR, and Bayesian ridge regression, are selected for regression prediction [31]. To compare the predictive performance and stability of each model, the adjusted R squared, mean squared error (MSE), and RPD (relative percent deviation) are chosen as evaluation indicators, and their calculation formulas are as follows:

$$Adjusted-{R}^{2}=1-\frac{\left(1-{R}^{2}\right)\left(n-1\right)}{\left(n-k-1\right)},$$
(7)
$$MSE=\frac{\sum_{i=1}^{n}{(\widehat{{y}_{i}}-{y}_{i})}^{2}}{n},$$
(8)
$$RMSE=\sqrt{\frac{\sum_{i=1}^{n}{(\widehat{{y}_{i}}-{y}_{i})}^{2}}{n}},$$
(9)
$$SD=\sqrt{\frac{\sum_{i=1}^{n}{({y}_{i}-\overline{y })}^{2}}{n}},$$
(10)
$$RPD=\frac{SD}{RMSE}.$$
(11)

The adjusted R-squared is a measurement of the goodness of fit of the model, and the larger the value, the better the model fits the data. Mean Squared Error (MSE) is a common measure of the average squared difference between the predicted and actual values. It is commonly used in statistical modeling and machine learning to evaluate the performance of regression models. Mathematically, MSE is calculated by taking the average of the squared differences between the predicted and actual values. RPD (relative percent deviation) is an indicator to measure the stability of the model [32], and the stability level corresponding to the RPD value is shown in Table 1.

Table 1 RPD Degree of Stability

3.2.1 Linear regression model

The linear regression model can be written in the following matrix form [33].

$$ QX = \beta_{1} x_{1} + \beta_{2} x_{2} + \cdot\cdot\cdot + \beta_{5} x_{5} + \beta_{0} , $$
(12)
$$ QX = \mathop \sum \limits_{j = 0}^{5} \beta_{j} x_{j} = X^{\prime}\beta , $$
(13)
$${X}^{^{\prime}}=\left[{x}_{1}, {x}_{2},\cdot \cdot \cdot ,{x}_{5},1\right], \beta \left(\genfrac{}{}{0pt}{}{\begin{array}{c}{\beta }_{1}\\ {\beta }_{2}\\ \vdots \end{array}}{\begin{array}{c}{\beta }_{5}\\ {\beta }_{0}\end{array}}\right)$$
(14)

where \(Q(X)\) is the predicted value, \({x}_{j}\) are the variables, and \({\beta }_{j}\) are the weights corresponding to the features. Here \({x}_{1}\)= specific humidity \(\left(q\right), {x}_{2}\)= relative humidity(\(RH\)), \({x}_{3}=\) horizontal water vapor flux \((\left|{F}_{H}\right|)\), \({x}_{4}=\) vertical water vapor flux(\({F}_{z})\), and \({x}_{5}=\) lifting index(\(LI)\).

To evaluate the model's predictive performance, the following loss function is utilized,

$$ J\left( \beta \right) = \mathop \sum \limits_{i = 1}^{n} \left( {QX_{i} - q_{i} } \right)^{2} . $$
(15)

\(J(\beta )\) is the loss value, \(n\) is the evaluation capacity, \({Q(X)}_{i}\) is the predicted value of the ith training sample, and \({q}_{i}\) is the true value of the training sample.

The normal equation method [34] and gradient descent method [35] are used to find the optimal solution for the loss function, \(minJ\left(\beta \right)\).

Comparison of the prediction results indicates that the gradient descent method has a better optimization effect. The final model's adjusted R-squared value on the test set is 0.836, and the mean squared error is 7.971. The Intercept and Coefficients are shown in Table 2.

Table 2 Intercept and Coefficients

We also tested the significance of the regression model and regression coefficients for the 5 independent variables. In the linear regression model for the Zhengzhou region, an F-test was conducted on the regression equation. The results, evaluated at a significance level of 0.05, indicate a statistically significant linear relationship between \(Q(X)\) and the independent variables \({x}_{1}, {x}_{2}{,x}_{3}, {x}_{4}\) and \({x}_{5}\) as a whole, confirming the significance of the regression equation. Furthermore, a t-test was performed on the regression coefficients, revealing that, under the significance level of 0.05, the independent variables \({x}_{1}, {x}_{2}{,x}_{3},\) and \({x}_{4}\) passed the significance test, while \({x}_{5}\) did not, suggesting that \({x}_{5}\) may not have a significant impact on \(Q(X)\) for the Zhengzhou region. However, upon analyzing the linear regression model results in the remaining 8 regions, it was observed that \({x}_{5}\) was statistically significant in some of the eight regions. To ensure a consistency across the regions so that the results can be directly compared, \({x}_{5}\) was kept in the linear regression model for all nine regions.

The comparison of predicted values and true values based on the test set is described in Fig. 3. The vertical axis is rainfall, and the horizontal axis is serial number.

Fig. 3
figure 3

Forecast comparison results of the LR model

The linear regression model is implemented using the Linear Regression and SGD Regressor packages in Python.

3.2.2 Random forest model

Import the Random forest regressor package, use the preprocessed data set to train the initial model, and obtain a goodness of fit of 0.746 and mean squared error of 12.375 on the test set. To improve the predictive performance of the model and reduce generalization error, we use grid search to tune the model [36].

Grid search is an exhaustive search method that involves specifying parameter values. It optimizes the estimation function's parameters through cross-validation to obtain the optimal learning algorithm [37]. It involves creating a "grid" by listing all possible combinations of parameter values. After each training iteration, the best estimators are used to obtain the optimal combination [38]. This process is repeated, gradually narrowing down the parameter range, to obtain the best parameter combination.

Compared to traditional machine learning algorithms, random forest models have more complex hyperparameters, which can be roughly divided into three categories. The first category is the number of weak learners. When the number of weak learners increases, the complexity of the model also increases, and the generalization ability of the model initially increases but then decreases or remains unchanged, which means the learning ability of the model becomes stronger with the increase of n_estimatores, but at the same time, the risk of overfitting also increases. The second category is the structure of weak learners. The weak learner in the random forest model is decision tree, mainly including two parts: branching criterion and tree structure parameters. Generally, the more complex the structure of a single tree, the higher the overall complexity of the ensemble algorithm, the model is more prone to overfitting. Therefore, it is necessary to adjust the parameters appropriately to prune the decision tree and find the optimal structure for the decision tree. At the same time, when abnormal data appears in the training samples, the anti-jamming ability of the decision tree is poor, and the random forest algorithm to some extent enhances the anti-jamming ability of the model. The third category is the data used to train the weak learners. By controlling the random selection of data samples and feature variables, the risk of overfitting of the model is reduced, thereby improving the generalization ability of the model.

The adjustable parameters and parameter importance of random forest are presented in Table 3.

Table 3 Parameters of Random Forest

Three rounds of parameter optimization are conducted for this model. For the 7 adjustable parameters, three values are selected for each parameter in each round of model training, resulting in 2187 possible combinations for evaluation.

After each training, the optimal hyperparameter combination for the random forest model based on the Zhengzhou region is presented in Table 4.

Table 4 Optimal Hyperparameter Combination

The model's goodness of fit on the test set is 0.827, with a mean squared error of 8.389. Compared to the initial model, there is an improvement in the goodness of fit by 0.081 and a decrease in mean squared error by 3.986, indicating an improved fitting performance.

The predicted values and actual values are compared in Fig. 4.

Fig. 4
figure 4

Forecast comparison results of the Random Forest model

3.2.3 Support vector regression model

Import the Linear SVR package, use the preprocessed data set to train the model, and test all kernel functions, which are 'linear', 'poly', 'rbf', 'sigmoid', and 'precomputed' in turn. By comparing the predicted results, the Linear SVR model has the best prediction performance, with a goodness of fit of 0.830 and mean squared error of 8.246 on the test set. The predicted values and actual values are displayed in Fig. 5. The vertical axis is rainfall, and the horizontal axis is sample number [39].

Fig. 5
figure 5

Forecast comparison results of the SVR model

3.2.4 Bayesian ridge regression model

Import the linear_model package, call the linear_model. Bayesian Ridge function to train the model using the preprocessed data set, and tune the regularization strength parameter [40]. The final model had a goodness of fit of 0.841 and mean squared error of 7.703 on the test set. The predicted values and actual values are compared in Fig. 6.

Fig. 6
figure 6

Forecast comparison results of the Bayesian Ridge Regression model

Following the above steps, linear regression (LR), random forest regression (RFR), support vector regression (SVR), and Bayesian ridge regression are used to predict the rainfall in eight other regions, including Beijing. The comparison of predicted and actual results are presented in Figs. 7, 8, 9, 10, 11, 12, 13, 14.

Fig. 7
figure 7

Forecast results comparison of the LR, RF, SVR, BRR models for Beijing in (a), (b), (c), (d) respectively

Fig. 8
figure 8

Forecast results comparison of the LR, RF, SVR, BRR models for Hohhot in (a), (b), (c), (d) respectively

Fig. 9
figure 9

Forecast results comparison of the LR, RF, SVR, BRR models for Kunming in (a), (b), (c), (d) respectively

Fig. 10
figure 10

Forecast results comparison of the LR, RF, SVR, BRR models for Lhasa in (a), (b), (c), (d) respectively

Fig. 11
figure 11

Forecast results comparison of the LR, RF, SVR, BRR models for Lanzhou in (a), (b), (c), (d) respectively

Fig. 12
figure 12

Forecast results comparison of the LR, RF, SVR, BRR models for Chengdu in (a), (b), (c), (d) respectively

Fig. 13
figure 13

Forecast results comparison of the LR, RF, SVR, BRR models for Jinan in (a), (b), (c), (d) respectively

Fig. 14
figure 14

Forecast results comparison of the LR, RF, SVR, BRR models for Xi’An in (a), (b), (c), (d) respectively

3.3 Comparison of Predicted Results

Four machine learning models are used to predict rainfall in nine regions of China, and the results are summarized in Tables 5. Table 3 shows the adjusted R-squared values, and Table 6 describes the mean squared error (MSE). Table 7 presents the Relative Percentage Difference (RPD). The optimal model for each region based on both fitting performance, stability is shown in Table 8.

Table 5 Comparison of fitting performance of LR, RF, SVR, BRR models
Table 6 comparison of MSE of LR, RF, SVR, BRR models
Table 7 Comparison of RPD of LR, RF, SVR, BRR models
Table 8 Optimal rainfall prediction model for each region

From Tables 2 and 3, it can be seen that based on the predicted results of rainfall in the nine regions, the random forest model has the best prediction performance, with an average adjusted R-squared (Adjusted R2) of 0.801 for the nine regions. The SVR model is second, with an average adjusted R-squared of 0.785 for the nine regions, followed by the Bayesian ridge regression model and the linear regression model with an average adjusted R-squared 0.783 and 0.781, respectively. In addition, the mean MSE values of the four regression models are 11.319, 9.893, 11.191, and 11.245, respectively, which again demonstrates that random forest model outperforms other models.

As described in Table 5, all the constructed models have good stability, except Xi 'an, the RPD value of the other eight regions reaches 2.0, which is a high stability level.

Out of the nine regions analyzed, the random forest model exhibits superior performance in five regions. However, in Xi'an, the random forest model significantly deviates from the actual values, indicating that it is not suitable for predicting rainfall in that region [41]. The SVR model, on the other hand, demonstrates higher accuracy in predicting rainfall for that region. In Zhengzhou and Jinan, the Bayesian ridge regression model provided predictions closer to actual values [42]. Similarly, in Lanzhou, the SVR model achieves better accuracy. Although difference in adjusted R-squared values across the four models are relatively small for these three regions, the random forest model showed a degree of universality. The random forest model provides the optimal stability model for six regions based on the stability index (RPD), indicating its ability to obtain table predictions [43].

3.4 Prediction model analysis

Based on the prediction results, the random forest model demonstrates higher accuracy compared to other models in rainfall prediction. That is probably due to the following characteristics of the model:

  1. 1.

    The random forest model utilizes ensemble learning, employing bagging methods during the sampling process to extract training subsets. This ensures independence among the individual decision trees and effectively avoids overfitting.

  2. 2.

    When each decision tree is constructed in the random forest model, not all attributes are involved in the node splitting process of every tree. Instead, a random selection of a few attributes is used as feature indicators for attribute evaluation. The introduction of randomness guarantees the diversity among the sub-models. The greater the differences between the sub-models, the better the effect of model fusion, effectively enhancing the model's tolerance to noise and outliers. Therefore, compared to other algorithms, the random forest algorithm demonstrates higher predictive accuracy and stronger model generalization ability in forecasting future rainfall. Moreover, the robust predictive performance of the random forest model also holds value for exploring other regression problems in real-world applications.

4 Conclusions

Through the analysis of rainfall data from nine typical regions in China, four regression models are constructed to predict rainfall, with the grid search technique used to optimize their performance. Based on the experimental results, the following conclusions can be drawn.

  1. 1.

    The four machine-based regression models yield good results overall. Among these models, the random forest model stand out with its high accuracy in predicting rainfall, achieving a goodness of fit of 0.8 or higher for seven regions and demonstrating good model stability with a small mean square error value. However, for some cities in China North regions with significant fluctuations in rainfall, such as Lanzhou and Xi'an, the support vector regression model exhibits better fitting and demonstrates unique advantages in solving small sample and nonlinear problems.

  2. 2.

    (2) The prediction accuracy of the random forest model is heavily influenced by its parameters, and the use of grid search method to finetune these parameters results in significant improvement in the model's performance. However, it should be noted that this approach requires considerable training time to explore all points on the grid. Further refinement of the grid search method is needed to optimize the model and improve its training speed and prediction efficiency.

This study successfully utilizes a combination of causes of the July 20, 2021 Zhengzhou flood and necessary conditions for heavy rainfall to accurately select five variables, resulting in precise rainfall prediction with minimal variables. It significantly reduces the data collection difficulty. Furthermore, the study employs the grid search method to optimize the model parameters of the initial four regression models, which substantially enhance the prediction accuracy and generalization ability of the models. By comparing the rainfall prediction outcomes of nine typical regions in China, the study evaluates the fitting effect and stability performance of the four machine learning regression models. The results confirm that the random forest model is more appropriate for predicting rainfall than other forecasting models. This extensive research is valuable for machine learning researchers in developing better methods for predicting rainfall with higher accuracy in the future.

In recent years, the China Meteorological Administration has adopted artificial intelligence for rainfall forecasting. As a core field of artificial intelligence, machine learning has demonstrated its importance in predicting extreme weather and other fields. However, some limitations have been identified in these methods, such as issues with data quality, feature selection, and model selection. Improvements in monitoring meteorological data and refining models can lead to more accurate rainfall predictions, enabling more rational management and integrating utilization of water resources and providing scientific guidance for the prevention of flood disasters.