Introduction

Disinfection is a conventional method to ensure water safety during the water treatment process. Chlorine disinfection is widely used in drinking water systems due to its low cost and simple operation. However, chlorination disinfection has the disadvantage of reacting with certain organic compounds in water to produce disinfection by-products (DBPs) (Chen et al., 2023; Yang et al., 2021). Studies have shown that DBPs have cytotoxicity and reproductive toxicity, leading to abnormal embryonic development (Egwari et al., 2020; Kar & Senthilkumaran, 2020; Srivastav & Kaur, 2020; Zhang et al., 2023). Currently, hundreds of kinds of DBPs have been discovered, among which trihalomethanes (THMs) and haloacetic acids (HAAs) are the most concerned and commonly monitored (Dubey et al., 2020; Ozgur & Kaplan-Bekaroglu, 2022). Worldwide implementation of relevant policies and water quality standards strictly controls the concentration of DBPs. For example, the US Environmental Protection Agency has set a maximum level of 80 μg/L for total trihalomethanes (T-THMs), bromodichloromethane (BDCM), dibromochloromethane (DBCM), and tribromomethane (TBM). China also specifies that the sum of the measured concentrations of THMs and their respective limits should not exceed 1 (CHCl3: 60 μg/L, DBCM: 100 μg/L, BDCM: 60 μg/L, TBM: 100 μg/L).

Although THMs testing is critical, the process is usually cumbersome and time-consuming. It requires not only expensive instrument equipment and experimental reagents but also complex pre-processing work (Liu et al., 2022; Mohammadi et al., 2020; Pérez-Lucas et al., 2022; Shi et al., 2023).

It is compressively known that the generation of THMs is closely related to disinfection conditions, including pH, disinfection time, chlorine dosage, and water quality parameters such as UV254, dissolved organic carbon (DOC), temperature, Br (Hong et al., 2020; Liang & Singer, 2003). Therefore, many researchers used the relatively easy-to-obtain water quality parameters and disinfection conditions to establish a predictive model of DBPs (Albanakis et al., 2021; Peng et al., 2020; Uyak et al., 2005) in order to more easily monitor the concentration of DBPs. Multiple linear/nonlinear regression models are frequently used to predict THM concentrations. A predictive model for trihalomethanes was established using influencing factors, and the results showed that the stepwise regression model was superior to the least squares regression model and the multiple linear regression model (Albanakis et al., 2021). Another multiple regression model was established to predict the generation of THMs and haloacetonitriles (HANs) during the chlorination process of low SUVA source water, but the predictive performance of such models was unsatisfactory (Hong et al., 2016). Multiple regression models were built for THMs, HANs, and haloacetic acids (HAAs) in source water. The results showed that some models had high predictive accuracy, while others had low predictive accuracy (Lin et al., 2018). Uyak et al. (2005) established a multiple regression model to predict the THM concentration in the effluent, in which the result was accurate but the formula was relatively complex.

Recently, machine learning (ML) has gradually become a research hotspot in many fields, such as Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Random Forest (RF). However, there is little research on predicting THMs using ML methods. The research on DBPs prediction by different ML methods is summarized in Table 1. A small amount of research has used ANN to predict THMs, such as backpropagation neural networks (BPNN) and radial basis function neural networks (RBF ANN) (Hong et al., 2020; Liu et al., 2023). Compared to BPNN, RBF ANN performs better in predicting THM concentration. Hong et al. (2020) predicted THMs by combining RBF ANN with gray relational analysis (GRA). Compared with linear regression models, GRA can establish a well-performing RBF ANN model with fewer factors. Liu et al. (2023) compared the prediction effects of BPNN, genetic algorithm backpropagation neural network (GA-BPNN), and generalized regression neural network (GRNN) models on THMs. The study showed that BPNN had average prediction performance, GA-BPNN had good performance but a long prediction time, and GRNN had the best prediction performance. SVM is a supervised ML algorithm, while Support Vector Regression (SVR) is a data regression algorithm based on SVM. SVR can learn the relationship between complex data and solve nonlinear problems and is therefore increasingly applied in the field of water quality monitoring (Zheng et al., 2013). A large amount of research has found that the SVR model performs better than traditional multiple linear regression models. Meanwhile, RF is an ensemble algorithm (Ma et al., 2023), which can calculate simultaneously, greatly shorten the running time, and effectively process noisy data and outlier values. Based on the RF algorithm, Peng et al. (2023) compared the effects of support vector machine, random forest, and stepwise multiple linear regression on the prediction of emerging DBPs. The results showed that ML methods were more suitable for managing the generation of DBPs than stepwise multiple linear regression. Hu et al. (2023) studied the predictive performance of eleven machine learning models for emerging DBPs, and results showed that RF was the best model among the regression tree categories. However, RF is rarely used to predict DBPs, and it can determine the degree of importance of variables compared to SVM and ANN. Therefore, RF was further applied to the prediction of DBPs in this study. Moreover, the building process and parameter optimization process of most ML models are fuzzy, which makes it difficult for the models to be further applied in other conditions. These processes are explained in detail in this study, which is conducive to the further popularization and application of the model. Additionally, although some studies (Okoji et al., 2022; Singh et al., 2012) have achieved good predictive performance, the sample size of the dataset is small, which makes it unable to fully represent the overall sample distribution, thus limiting the reliability and generalization ability of the results. Therefore, this study expanded the sample size of the dataset, making the prediction results more convincing.

Table 1 Summary of the application of ML methods in DBPs

To further investigate the performance of ML on THMs, this study utilized water quality parameters collected from a water plant in Taihu Lake to establish Log-linear regression models, SVR models, and RFR models for predicting the concentration of T-THMs, DBCM, and BDCM. The models were then compared with actual measurements using evaluation indicators to assess their predictive effectiveness.

Materials and methods

Datasets and grouping

The data used in this study, including 175 samples from Water Plant B between 2016 and 2021 were all from Liu et al. (2023). Water quality data were collected from the inlet pipe of Water Plant B. THMs levels were sampled from the pipeline connected to the clean water reservoir of Plant B. The concentration ranges of THMs and corresponding water quality parameters are shown in Table 2. The water plant sources from Taihu Lake are treated by traditional processes including coagulation, sedimentation, filtration, activated carbon treatment with ozone, and chlorination. The data set includes T-THMs, DBCM, BDCM concentrations, and nine water quality parameters (temperature, pH, UV254, TOC, COD, NH4+–N, NO3–N, NO2–N, Br). The USEPA 551.1 method was used to measure THMs, and the water quality parameters were determined by the standard method (APHA, 1998).

Table 2 Concentration range of THMs and corresponding water quality index of waterworks B

THMs generation is related to multiple water quality parameters. Temperature reflects the seasonal changes in water temperature, and pH affects the removal of organic matter, thereby affecting the generation of THMs. Organic matter is an important precursor to the formation of THMs, and its concentration is often characterized by COD, TOC, and UVA254, NH4+–N, NO3–N, NO2–N reduce the concentration of THMs by consuming free chlorine in water, while the presence of Br increases the concentration of brominated THMs. Therefore, nine water quality parameters were selected as input variables to predict THMs concentrations.

The data set was divided into a training set and a test set. In this respect, 80% (141 samples) were used as the training set and 20% (34 samples) as the test set. The maximum and minimum samples of T-THMs, DBCM, and BDCM were added to the training set to ensure the predictive ability of the model, while the remaining data were randomly assigned to the training and test sets.

Establishment of Log-linear regression model

The multiple nonlinear regression model for THMs was established using SPSS 26.0. The concentration (\(\log_{10} Y_{i}\)) of T-THMs, DCBM, and BDCM from 141 training was set as dependent variables (\(Y\)), and the logarithmic values (\(\log_{10} X_{i}\)) of nine water quality parameters as independent variables (\(X\)). The equation was established using a stepwise regression method. For example, the equation for T-THMs was log10T-THMs = log10 \(k_{0}\) + \(k_{1}\) log10 \(X_{1}\) + \(k_{2}\) log10 \(X_{2}\) + ... + \(k_{i}\) log10 \(X_{i}\), (\(i\) = 1, 2,, 9). Finally, the equation was transformed into T-THMs = \(10^{{k_{0} }} X_{1}^{{k_{1} }} ...X_{i}^{{k_{i} }}\), where \(k_{0}\),\(k_{1}\),..., \(k_{i}\) are constants.

Establishment of SVR model

SVM is a binary classification algorithm based on ML theory that can effectively find the globally optimal solution (Vapnik, 1995; Zheng et al., 2023). In Fig. 1, the basic principle is to map the sample data through a non-linear kernel function to a high-dimensional space, find a few effective support vectors to represent the entire dataset, and make the data linearly separable in the high-dimensional space. By learning from a limited amount of data, the optimal solution can be obtained (Q. Xu et al., 2015).

Fig. 1
figure 1

Support vector machine (SVM) schematic diagram

SVR differs from SVM because SVM primarily finds an optimal hyperplane that separates two or more classes of samples, commonly used for data classification. On the other hand, SVR requires that the total deviation of all sample points from the optimal hyperplane is minimized (Peng & Chen, 2018). For a given dataset \((x_{i} ,y_{i} )\)(with \(x_{i}\) as input variables, \(y_{i}\) as corresponding output variables, \(i = 1,2, \ldots ,l\)), the optimal linear regression function is constructed in a high-dimensional space:

$$f(x) = \omega^{T} \varphi (x) + b$$
(1)

where \(\varphi (x)\), \(\omega\), \(b\) represents nonlinear mapping kernel function, weight vector and threshold value, respectively.

SVR introduces the insensitive loss function \(\varepsilon\) based on SVM, representing the boundary error of the decision function in high-dimensional space. The prediction accuracy of the SVR model mainly depends on the penalty factor \(c\), kernel parameter g, and \(\varepsilon\). \(c\) represents the punishment intensity for sample errors exceeding \(\varepsilon\), where a large \(c\) indicates an emphasis on errors, but may result in overfitting, while a small \(c\) indicates a high tolerance for errors but may lead to underfitting. \(g\) is used to adjust the shape and range of the radial basis function, affecting the decision boundary of the model. \(\varepsilon\) affects the accuracy of the regression model (Abu Awad et al., 2017), and the larger the value, the lower the error tolerance, resulting in relatively high accuracy. An optimal training model with specific values of \(c\) and \(g\) was obtained through grid search and manual tuning. The search range for \(c\) was set to [0.1, 2] with a step size of 0.1, while the search range for \(g\) was [0.1, 8] with a step size of 0.1. The R2 and RMSE of the training set and test set of the models under different parameter combinations were calculated and then output into Excel tables. For further comparison, 5 groups of parameter combinations whose R2 is closest to 1 and RMSE is smaller were selected, and then the optimal parameter combinations of the model were found by manual tuning.

The SVR model was constructed using MATLAB R2018a, which mainly includes seven steps in the modeling process, Fig. 2 shows the main steps. The “L2 regularization” in the third step is a mathematical technique employed to mitigate model overfitting and enhance generalization capability. It is frequently employed to fine-tune the trade-off between the error term and regularization coefficient in SVR.

Fig. 2
figure 2

SVR model construction method

Establishment of RFR model

The RF algorithm is an ensemble learning method that builds multiple decision trees related to random vectors to classify or regress samples. It is capable of handling outliers and noise effectively. In Fig. 3, the basic principle of the RF algorithm is to generate multiple decision trees randomly by sampling from the training set. When making regression predictions, all decision trees predict each sample, and the average is taken as the predicted result of the sample (Karabadji et al., 2023).

Fig. 3
figure 3

Random forest regression (RFR) schematic diagram

The steps of the RF algorithm are as follows: (1) Assume a training set \(T = \{ (x_{1} ,y_{1} ),...,(x_{n} ,x_{n} )\}\), then generate a random sequence \(\delta_{i} (i = 1,...,k)\), and re-sample k new sample sets from the training set T through bootstrapping. (2) Generate a decision tree \([g(X,\delta_{i} ),i = 1,...,k]\) for each new sample set, where matrix X is the independent variable of the model. (3) Generate all decision trees completely and obtain the random forest model sequence \([g_{1} (X),g_{2} (X),...,g_{k} (X)]\). (4) Input the new sample into the decision tree, obtain the predicted result of each decision tree, and take the average value to obtain the predicted value of the random forest:

$$G(x) = 1/k\sum\limits_{i = 1}^{k} {g_{i} (x)}$$
(2)

where \(G(x)\), \(g_{i} (x)\), \(k\) are the final prediction result, the prediction results for a single decision tree, and the number of decision trees, respectively.

The RFR model was established using MATLAB R2018a. Nine water quality parameters, similar to those used in the SVR model, were selected as input variables for the model, while three types of THMs were used as output variables. Based on the RF algorithm principle, the TreeBagger function was used to train the RFR model, setting parameters such as the number of decision trees M and the minimum leaf number N. A grid search method was used to determine values for \(M\) and \(N\), the range for \(M\) was [10,60] with a step size of 1, and \(N\) was [1,10] with a step size of 0.1. The subsequent steps were the same as those for the SVR model.

Model evaluation

To evaluate the error between the predicted values and the measured values of the three models, the mean absolute error (MAE) and Pearson correlation coefficient (rp) were used as the model evaluation indicators. Following the research conducted by Xu et al. (2022), the absolute error (E) between the predicted values and the measured values was calculated, and the accuracy of the predicted values was defined as N25, which is the percentage of E < 25%. The low value of MAE and the high values of N25 and rp represent the high accuracy of the model.

$$\begin{aligned}E &= abs(\{ [THMs]_{predicted} - [THMs]_{measured} \} /[THMs]_{measured} )\\ &\quad \times 100\%\end{aligned}$$
(3)
$$MAE = 1/m\sum\limits_{i = 1}^{m} {(\left| {y_{i} - f(x_{i} )} \right|)}$$
(4)

where \(y_{i}\), \(f(x_{i} )\), and \(m\) represents the measured values, the predicted values, and the number of samples, respectively.

Results and discussion

Log-linear regression model

Analysis of Log-linear regression model

Regression models of various THMs generated by training sets are shown in Table 3.

Table 3 Log-linear regression models of various THMs

The F-test results of all models (S-Table 1) exceeded the F-test critical values at 99% confidence levels (Fα=0.01(3137) = 3.928, Fα=0.01(4136) = 3.460), indicating a significant linear relationship between THMs and water quality parameters. Moreover, all models had variance inflation factors (VIF) much less than 5, indicating the absence of multicollinearity problems among the variables.

Models (1) and (2) had three important water quality parameters, while model (3) had four. In this study, T-THMs, DBCM, and BDCM showed a positive correlation with temperature, as increasing temperature will accelerate the reaction rate and thus affect THM generation. T-THMs, DBCM, and BDCM were negatively correlated with NO3–N, indicating that NO3–N affected THMs concentration by consuming available chlorine. T-THMs and DBCM were positively correlated with Br while Br had a greater impact on DBCM than T-THMs. Only the regression model for BDCM included UV254, indicating that UV254 has a certain influence on BDCM formation, which was consistent with the results of this research (Xu et al., 2022). Furthermore, model (3) also included COD, whose impact on BDCM formation was lower than temperature.

Evaluation of log-linear regression model

The testing set, training set, and complete set were input into the models (1) ~ (3), and compared with the measured values. Correlation analysis was conducted in Fig. 4.

Fig. 4
figure 4

Comparison of the predicted and measured values of the Log-linear regression model. a, b, and c represent T-THMs, DBCM, and BDCM. 1, 2, and 3 represent the complete set, training set, and test set, respectively

In Fig. 4(a1), (b1) and (c1), it can be inferred that for the complete set, the N25 values of T-THMs, DBCM, and BDCM were 77–81%, while the rp values were 0.65–0.78. In Fig. 4(a2), (b2) and (c2), regarding the training set, the N25 values of T-THMs, DBCM, and BDCM were 77–79%, and the rp values were 0.66–0.79, which demonstrated comparable performance to the complete set. In Fig. 4(a3), (b3) and (c3), for the testing set, the N25 value of T-THMs reached 85%, while those of DBCM and BDCM were both 76%. However, the rp values ranging from 0.62 to 0.76 were not ideal, and the overall predictive performance was not satisfactory. The predictive ability of the Log-linear model was relatively intermediate.

Support vector regression model (SVR)

Determination of model parameters

Nine water quality parameters were selected as input variables, and T-THMs, DBCM, and BDCM were selected as output variables. Parameters obtained from grid search and corresponding tests-RMSE are shown in Fig. 5. The parameters with the smallest RMSE and the R2 closest to 1 for both the training and testing sets were selected. When \(c\) was in the range of [0.3, 2] and \(g\) was in the range of [6, 8], the RMSE of T-THMs and DBCM in the testing set could be minimized. When \(c\) was in the range of [0.5, 1] and \(g\) was in the range of [8, 10], the RMSE of BDCM in the testing set could be minimized.

Fig. 5
figure 5

RMSE under different c, g parameter combinations

For T-THMs, when \(c\) = 1.1 and \(g\) = 8, the RMSE of the training set and the test set were both small, and R2 was closest to 1. The same parameter optimization method was used for DBCM, and the value range of \(g\) for BDCM was adjusted to [0.1, 10] with a step size of 0.1. In S-Tab.2, the best values of \(c\), \(g\) and \(p\) for T-THMs, DBCM, and BDCM were obtained.

Prediction of THMs concentration

The fitting effect of the SVR model with optimal parameters on THMs is shown in Fig. 6(a1–c1), (a2–c2). The simulation effect of the complete set and training set was better than that of the Log-linear regression model. Under the complete set simulation, the N25 of T-THMs, DBCM, and BDCM were 94–96%, and the rp of predicted values and measured values were 0.94, which were higher than the N25 (77–81%) and rp (0.65–0.78) of the Log-linear regression model. Under the training set, the N25 of T-THMs, DBCM, and BDCM were 96–98%, and the rp of predicted values and measured values were 0.95–0.97.

Fig. 6
figure 6

Comparison of the predicted and measured values of the SVR model. a, b, and c represent T-THMs, DBCM, BDCM; 1, 2, and 3 represent the complete set, training set, and test set, respectively

Following the utilization of the trained model to make predictions on the test set, the predicted performance of THMs is shown in Fig. 6(a3), (b3) and (c3). The SVR model showed improved performance in the test set (N25 = 82–85%), but its rp (0.69–0.75) was comparable to that of the log-linear regression model. In addition, the SVR model exhibited overfitting, possibly due to the inadequate settings of the parameters \(c\) and \(g\), which resulted in the insufficient generalization ability of the model.

Random forest regression model (RFR)

Determination of model parameters

The RFR model has two main adjustable parameters, namely the number of decision trees (\(M\)) and the minimum number of leaf nodes (\(N\)). Due to the stochastic nature of predictive performance in the RFR model, three sets of parameter combinations resulting in good predictive parameters were selected for further training. The optimal parameter combinations for THMs were obtained as shown in S-Table 3.

Prediction of THMs concentration

When using the optimal parameter combination, the RFR model exhibited a fitting effect on THMs, as shown in Fig. 7. In Fig. 7(a3), (b3) and (c3), the predicted results of the test set indicated a high regression coefficient rp (0.70–0.80) and N25 (82–88%) between the predicted values and the measured values by the RFR model. In Fig. 7(a1), (b1) and (c1), for the complete set, rp = 0.92–0.95 and N25 = 91%-96%. In Fig. 7(a2), (b2) and (c2), for the training set, rp = 0.96–0.98 and N25 = 94–98%. Overall, the predictive ability of the RFR model for the prediction of THMs was better than the results of the Log-linear regression model, which obtained a rp of 0.62–0.78 and a N25 of 76–85%.

Fig. 7
figure 7

Comparison of the predicted and measured values of the RFR model. a, b, and c represent T-THMs, DBCM, BDCM; 1, 2, and 3 represent the complete set, training set, and test set, respectively

Model comparison

Based on the evaluation indicators from Table 4, it can be seen that the prediction accuracy of the RFR model for the T-THMs, DBCM, and BDCM in the test set (N25 = 82–88%, rp = 0.70–0.80) was higher than that of the Log-linear regression model (N25 = 76–85%, rp = 0.62–0.76), indicating that the prediction ability of the RFR model was greater than the Log-linear regression model. In the prediction of the THMs of the complete set and training set, the prediction effect of the RFR model (N25 = 91–98%, rp = 0.92–0.98) was also better than the Log-linear regression model (N25 = 77–81%, rp = 0.65–0.79), which further indicated that the RFR model could better predict the concentration of THMs under the influence of multiple water quality factors. In the complete set, the MAE for T-THMs in the RFR model was 1.016, which was lower than that of 5.843 in the Log-linear regression model and 1.848 in the SVR model. However, for DBCM and BDCM, the MAE of the SVR model was lower than the other two models. The performance of SVR model for prediction of BDCM was better than the results of the RFR and Log-linear regression models which obtained MAEs of 0.702 and 1.334.

Table 4 Evaluation indicators of three models under optimal parameter combination

To better compare the performance of the Log-linear regression model, SVR model, and RFR model, the prediction results on the test sets of T-THMs, DBCM, and BDCM are shown in Fig. 8. All three models conformed to the overall trend of observation. Among them, the RFR model was more accurate in fitting T-THMs and DBCM compared to other models. For BDCM, the deviation of the three models from the measured values was relatively large, especially for noise points with a large deviation, and it was not enough to determine which model was superior based on the figure.

Fig. 8
figure 8

The comparation of measured results and predicted results of the test sets of the three models

In Fig. 9, the box plot displayed the absolute relative errors between predicted and measured values for the three models on the test set. The values at the bottom of the figure represent the mean and median values of the absolute relative errors. A box-plot generally measures the dispersion of data using the interquartile range (IQR). It can be observed that BDCM had the most outliers among the three THMs, with three outliers, while T-THMs had one outlier. However, the number of outliers for all THMs was within a reasonable range. For BDCM, the RFR model had the smallest mean value of 0.141 and median value of 0.107. However, the IQR value of the RFR model was higher than the SVR model, indicating that predictions for BDCM in the RFR model were more dispersed than those of the SVR model. For DBCM, the RFR model had the smallest mean of 0.127 and a median of 0.09. The IQR value was also the lowest, indicating that the RFR model had the highest overall prediction quality compared to the other two models. For T-THMs, the RFR model had a smaller mean and lower IQR value than the SVR model, indicating higher stability of the prediction results. However, the median value was larger than that of the SVR model. Further analysis of Figs. 6 and 7 revealed that for T-THMs, the RFR model's N25 (88%) was greater than the SVR model's N25 (85%), and the rp (0.79) was also greater than the SVR model's rp (0.75). Therefore, for T-THMs, the RFR model's predictive performance was superior to the SVR model.

Fig. 9
figure 9

The absolute relative error of the three models applied to the test set

The predictive accuracy of models is generally positively correlated with the technical complexity of their construction (Hu et al., 2023). Although the accuracy of Log-linear regression models is not high, these models can provide specific formulas and good interpretability, which helps explain the relationship between different water quality parameters and DBPs. On the other hand, ML models such as BPNN, CNN, SVR, and RFR do not possess interpretability, but they are able to provide higher accuracy.

In this study, compared to the Log-linear regression model, the SVR and RFR models were relatively simple but had better predictive performance for three types of THMs. For T-THMs, the average absolute relative errors of the SVR and RFR models were reduced by 21% and 38%, respectively. For DBCM, the SVR and RFR models resulted in a decrease of 22% and 66%, respectively. For BDCM, the average absolute relative error of the SVR and RFR models decreased by 43% and 79%, respectively. Based on the above analysis, the RFR model showed better performance than the SVR and Log-linear regression models for DBCM and T-THMs. However, for the prediction of BDCM, the SVR and RFR models had their strengths. Overall, for the THMs studied in this research, the performance of the RFR model was better than the SVR and Log-linear regression models. In addition, the rp values of the SVR, RFR, and Log-linear regression models were 0.70, 0.70, and 0.62, respectively, all of which were higher than the GANN model (rp = 0.473, DWTPs B) (Liu et al., 2023). This shows that the SVR model and RFR model have excellent performance in predicting DBPs, and further indicates that the parameter optimization method used in this study is feasible.

Conclusion

This study utilized nine water quality parameters (temperature, pH, UV254, TOC, COD, NH4+–N, NO3–N, NO2–N, Br) as input variables to establish the Log-linear regression model, SVR model, and RFR model to predict the concentrations of THMs (T-THMs, DBCM, and BDCM). By comparing the predicted values with the measured values, the performance of the three models was analyzed. From the comparison results of the model evaluation indicators (R2, MAE, and RMSE), the results showed that the prediction accuracy of the Log-linear regression model was not ideal. However, the model can explain the relationship between water quality parameters and THMs, which helps to accurately take measures to control the generation of THMs. The SVR model achieved a good prediction effect for the training set, but the prediction effect for the test set was not ideal, while the RFR model had an excellent prediction effect for both the training set and the test set. Among them, the RFR model had a better predictive ability for T-THMs and DBCM than other models, with higher correlation coefficients and prediction accuracy, indicating that the RFR model could handle the complex relationship between T-THMs and DBCM concentration generation and water quality parameters. For BDCM, the prediction effect of the RFR model and the SVR model was similar, and the model performance had its strengths. Although the average error of the RFR model is smaller than that of the SVR model, the distribution difference of the prediction error in different parts is larger than that of the SVR model, which also means that the prediction accuracy and stability of the model may have potential risks. Overall, the prediction effect of the RFR model was better than the SVR model.