Introduction

The disinfection process is one of the crucial steps in drinking water treatment plants (DWTPs) to reduce waterborne diseases by inactivating harmful pathogens and microorganisms. Chlorine is widely used as a disinfectant in South Korea and elsewhere because it is highly efficient in preventing pathogens and microorganisms and cost-effectiveness(Abdullah et al. 2003; Uyak et al. 2005). However, chlorine reacts with natural organic matter (NOM) present in source water and forms undesirable carbonaceous and nitrogenous disinfection by-products (C- and N-DBPs) (Sérodes et al. 2003; Chowdhury et al. 2010; Maeng et al. 2018). Epidemiological studies conducted repeatedly in laboratory animals have shown that ingestion of chlorinated by-products containing water causes bladder, colon, and rectal cancer (Morris et al. 1992; WHO 2005; Brown et al. 2011). In addition, toxicological studies have shown that ingestion of certain DBPs causes cancer in the liver and kidneys, as well as adverse reproductive and developmental disorders in laboratory animals (Babaei et al. 2015; Krasner et al. 2017). Among all the DBPs formed, C-DBPs, i.e., trihalomethanes (THMs) and haloacetic acids (HAAs) are found in higher concentration levels in DWTPs (Richardson et al. 2007; Hua et al. 2015) and are considered potentially carcinogenic (Golfinopoulos et al. 1998; Golfinopoulos and Arhonditsis 2002; Uyak et al. 2005; Platikanov et al. 2012). Moreover, bromine-containing species are reported to be more geno- and cytotoxic than their chlorinated form and are of concern (Krasner et al. 2017). These DBPs can enter the human body through ingestion of drinking water, inhalation, and dermal contact during regular indoor activities such as showering, bathing, swimming, and cooking (Chowdhury et al. 2010; Chowdhury et al. 2011; Domínguez-Tello et al. 2017). Thus, several DBPs are regulated by international regulatory agencies worldwide. The US Environmental Protection Agency developed a Disinfectants/DBP (D/DBP) rule in 1998 and set minimum contaminant levels of 80 μg/L for THMs (Uyak et al. 2005; Singh and Gupta 2012) and 60 μg/L for HAAs (Ged et al. 2015). The formation of THMs and HAAs depends on the quality of the source water and the treatment process, i.e., chlorine dose, contact time between chlorine and organic matters, pH, water temperature, and other factors (Sadiq and Rodriguez 2004; Fooladvand et al. 2011). Continuous monitoring throughout the operation of DWTP is needed to ensure compliance with the guidelines. More than 100 predictive models have been developed because of active research on DBPs (Chowdhury et al. 2009; Domínguez-Tello et al. 2017). These models are based on either laboratory or field scale data and have shown varying levels of predictive capabilities. Most of the mathematical models developed are empirical in nature and are site specific, which means their predictive capabilities for different environmental conditions and treatment processes remain inappropriate (Elshorbagy et al. 2000; Uyak et al. 2007; Ata et al. 2015). On the other hand, seasonal, locational, and temporal factors, and the complexity of the reaction between organic matter and chlorine and the formation of DBPs makes it very difficult to develop mechanistic models (Semerjian et al. 2009; Kulkarni and Chellam 2010). Most of the models lack interacting parameters (Sohn et al. 2004). Mathematical models that are developed and based on real DWTPs and distribution systems, and which consider all the water quality parameters and operating variables that can predict THMs and HAAs, are very useful tools as alternatives to field measurements. Laboratory tests for the measurement of DBPs are very time consuming and expensive. Predictive models can provide quick and reasonable estimates and can help in making decisions to optimize the treatment process (Westerhoff et al. 2000; Mukundan and Van Derson 2014; Lin et al. 2018).

The aim of this study was to develop a mathematical model for predicting THMs and HAAs, based on multiple regression analysis and using water quality parameters of both raw and treated water and operational conditions from three DWTPs located in Seoul, South Korea. Models were validated using more recent data from the three treatment plants and were applied to two different DWTPs for evaluating their predictability capability. Most of the mathematical models that were reported previously lack principal factors such as dissolved organic carbon (DOC), bromide ion (Br), and chlorine dose (Chowdhury et al. 2009; Bond et al. 2014). Besides these, most of the models do not consider interactive variables (effect of two or more varying together). The THM models suggested by Amy et al. (1987), Golfinopoulos et al. (1998) and Uyak et al. (2005) have good predictability (R2 = 0.90, 0.98, and 0.98, respectively) and are based on raw water characteristics. Raw water characteristics do not represent treated water characteristics. In addition, the models do not consider reaction time and chlorine dose. The objective of this study was to develop models that would address the shortcomings that existed in previously published research works. In addition, this work is the first of its kind to develop mathematical models using multiple regression analysis for both THMs and HAAs in South Korea. The model developed in this research can be considered robust because both raw water and treated water characteristics, along with most of the water quality and operational parameters with interactive parameters, are incorporated.

Materials and methods

Description of DWTPs

Seoul has a population of 10.178 million and is served by six DWTPs (SMG 2017). The Han River, which is the second longest river, serves as the main source of raw water to all these DWTPs. More than 3 million m3 of water is needed daily for citizens residing in Seoul from all six DWTPs (Fig. 1). Each day, a total of 4.44 million m3 of water is processed and supplied by all DWTPs, as shown in Table S1. The Seoul Metropolitan Government (SMG) monitors water quality and operation parameters every day to ensure compliance with the National Drinking Water Standard Guideline for the safety of the citizens. The treatment process combines conventional processes, i.e., prechlorination, coagulation, flocculation, sedimentation, filtration, and postchlorination, with advanced treatment processes, i.e., ozonation and powdered activated carbon treatment (Fig. S1).

Fig. 1
figure 1

Study area with all five drinking water treatment plants considered

Mathematical model development

For the purpose of the model development, water quality and operational data for both raw water and treated water were collected from 2015 to 2016 for three DWTPs (Gangbuk, Gwangam, and Yeongdeungpo). Data obtained were based on monthly analyses of water samples. These datasets included 120 and 66 measured values for THMs and HAAs, respectively, along with other water quality and operational parameters. Water quality parameters include DOC, ultraviolet absorbance (UV254), Br ion concentration, temperature and residual chlorine, THMs, and HAAs for treated water. Likewise, operational parameters such as pH and prechlorine dose for raw water, postchlorine dose, pH, temperature, and contact time were included. The models, which included at least five principal factors, i.e., predictor variables of the seven (DOC, UV254, Br, chlorine dose, temperature, contact time, and pH) showed high predictability for THMs and HAAs (Ged et al. 2015). In addition, the effect of two or more variables (interactive) and higher-order variables needed to be incorporated. Multiple regression analysis was carried out using Minitab 18 statistical software (Minitab, LLC, USA) and Excel (Microsoft Office 2016’s), to develop both linear and nonlinear (power) models. For the THMs, a forward selection process was used, and for the HAAs, a backward elimination process was carried out. Before multiple regression analysis, the statistical significance of each direct, quadratic, and interactive predictor variable was verified using a Pearson correlation matrix at a 95% significance level. The models investigated here include the principal factors, and interactive and higher-order factors for both linear and nonlinear forms. The principal factor models are direct and in their very simplest form (Chowdhury et al. 2011). The generalized form of the mathematical models for predicting the THMs and HAAs values are presented in Eqs. 1 and 2, where y represents the THMs and HAAs, β represents the model coefficient, x represents the predictor variables, and ε represents the residuals or errors.

$$ y={\beta}_0+{\sum}_{j=1}^k{\beta}_j{x}_{ij}^{\prime }+\varepsilon $$
(1)
$$ y={\beta}_0\cdotp {\sum}_{j=1}^k{x_{ij}^{\prime}}^{\beta_j}+\varepsilon $$
(2)

where i = 1, 2, …, n and j = 1, 2, …, k. The n > k and x′ij denotes the ith observation of independent variable xj. The independent variables x′ij includes principal factors (e.g., DOC, pH, and T), interactive variables (effect of two or more varying together e.g., T × t, UV×DOC × logClT) and higher-order variables (e.g., quadratic i.e., T2, Cl2). The models for both THMs and HAAs were developed based on the particular values of the independent variables x′ij (xi1, xi2,…,xik). Equations 1 and 2 are the generalized form of the linear and nonlinear models, respectively, for prediction of both THMs and HAAs. Their goodness-of-fit and performance were compared by performing F tests, Student’s T test, the coefficient of determination (R2), the standard error (SE, Eq. 3), the mean square error (MSE, Eq. 4), and the Durbin–Watson statistic (d, Eq. 5).

$$ \mathrm{SE}=\sqrt{\frac{\sum_1^{\mathrm{n}}{\left(\mathrm{Observed}-\mathrm{Predicted}\right)}^2}{\mathrm{n}}} $$
(3)
$$ \mathrm{MSE}=\frac{1}{\mathrm{n}}{\sum}_1^{\mathrm{n}}{\left(\mathrm{Observed}-\mathrm{Predicted}\right)}^2 $$
(4)
$$ d=\frac{\sum_1^n{\left({e}_n-{e}_{n-1}\right)}^2}{\sum_1^n{e}_n^2} $$
(5)

In Eq. 5, e is the residual value and is calculated by subtracting the predicted value from the observed value.

To determine the significance of the difference between the measured and predicted values, an F test was performed. For the F value > 0.5, Student’s T test with equal variance was performed. In contrast, if the F value < 0.5, Student’s T test with unequal variance was performed. If the p value from the T test is < 0.05, the two datasets, i.e., measured and predicted, do not have statistical similarities or are not equivalent. On the other hand, if the p value is > 0.05, the two datasets are equivalent or do not have significant statistical differences.

Model validation and applicability

The validation process determines or confirms how sound and effective the models are. In this study, it shows the stability and reasonableness of the THM and HAA models. For validation and applicability, the developed models were subjected to two different types of tests: (i) comparisons of measured and predicted values by performing internal evaluations, i.e., on more recent or additional datasets from the same DWTPs on which the models were based (calibration), and (ii) comparisons of measured and predicted values by performing external evaluations, i.e., on datasets from different DWTPs. The developed models were used to predict both THMs and HAAs for the additional datasets (January 2017 to July 2017) obtained from three DWTPs, as well as external datasets (January 2016 to December 2016). Analyses were done to calculate the R2, SE, and MSE values. A T test was performed on the predicted models to determine the biases by calculating the t value and t critical or p value. The values were compared and if t-calculated < t-critical or the p value > 0.05, bias was considered not significant and vice versa.

Results and discussion

The occurrence of THMs and HAAs

The range and average levels of THMs and HAAs in treated water from the three DWTPs are summarized in Fig. 2 and data were collected from 2015 to 2016. The formation of THMs is ranked for the DWTPs as Gangbuk DWTP > Gwangam DWTP > Yeongdeungpo DWTP. For the HAAs, there were no significant differences between the three DWTPs. The observed maximum values of THMs were 33 μg/L, 39 μg/L, and 29 μg/L for Gwangam, Gangbuk, and Yeongdeungpo, respectively. Very low values for HAAs were observed in all three DWTPs. The maximum values were found to be 6 μg/L, 6 μg/L, and 7 μg/L in Gwangam, Gangbuk, and Yeongdeungpo, respectively. The measured value for THMs were higher and dispersed compared with HAAs throughout the year because of high hydrophobic faction of NOM compared with hydrophilic fraction (Bond et al. 2012). Figure 3 shows seasonal variations of THMs and HAAs in treated water. For THMs, high values were observed during summer (June to August) and at the beginning of the autumn season (September) for all three DWTPs, especially in Gangbuk DWTP. This may be because of temperature changes and organic matter present in the source water. Although the temperature in autumn is slightly lower than in summer, the water is rich in organic matter. The main reason could be the rapid decay of vegetation (Kumari and Gupta 2015). Similarly, lower values were observed during the winter season (December to February). In contrast, HAA values were observed to be higher during spring (March to May). The major HAA species to contribute to the higher concentrations is dichloroacetic acid (DCAA)(Rodriguez et al. 2004) and shows high concentration levels during spring. In addition, DCAA is not affected by the pH levels of the source and treated water. The decrease in the concentration of HAAs during the summer and autumn seasons may be attributed to microbial activities. It has been reported that microorganisms do degrade HAAs over time (Zhou and Xie 2002; Rodriguez et al. 2004).

Fig. 2
figure 2

THMs and HAAs concentration in treated water from three DWTPs. HAAs is HAAs+2

Fig. 3
figure 3

Monthly variation of THMs and HAAs in three DWTPs

Correlation of independent variables with THMs and HAAs formation

In this research, the models were built by considering principal factors and interactive and higher-order variables. The correlation matrices for both THMs’ and HAAs’ formation with selected variables were obtained using Pearson’s correlation test and are shown in Table 1. A positive and very strong correlation was observed for temperature (quadratic form) and an interactive variable (UV254, temperature, reaction time, and total chlorine dose, i.e., UV254 × T2 × t × ClT) with THMs formation (r = 0.888 and 0.878, respectively). This indicated that higher-order and interactive variables have the largest influence on the formation of THMs. Besides this, the temperature is the variable which has the highest influence compared with other variables. This observation was also reported in other studies (Babaei et al. 2015; Kumari and Gupta 2015). The increase in temperature increases the reaction rate between organic matter and residual disinfection. Temperature acts as an energy source and activates the reaction (Kumari and Gupta 2015). Negative and very good correlations were observed between pHavg (average value of raw water and treated water pH) and THMs formation (r = − 0.709). The residual chlorine (ClR) and postchlorine dose (Clpost) showed moderate correlation (r = 0.581 and 0.509, respectively). Compared with other variables, the Br concentration does not show good correlation and was found to be negative (r = − 0.124). This result may be attributed to the very low concentration of Br in treated water. However, the models that included Br as an independent variable exhibit a better degree of accuracy than the models that excluded Br(Ged et al. 2015). The models excluding the Br concentration resulted in the overprediction of THMs for low concentrations of Br and underprediction for high concentrations of Br. The interactive variable with higher order (quadratic form) shows a negative and very low correlation with THMs formation. Similarly, an attempt was made to find the correlation of primary factors and interactive and higher-order variables with HAAs’ formation in treated water. For HAAs, interactive variables such as log (ClT × DOC/(T × t)) and log (ClT × DOC/(T × t × pHavg)) show positive and very good correlation (r = 0.707 and 0.706, respectively). For principal factors such as DOC and pHavg, the correlation was found to be positive and moderate (r = 0.47 and 0.448, respectively). The ratio of DOC and residual chlorine (DOC/ClR) and bromide ion and residual chlorine ((Br + 1)/ClR) shows positive correlation (r = 0.576 and 0.325, respectively). In contrast with THMs, temperature shows negative and moderate correlation (r = − 0.482) with HAAs’ formation. The logClR also shows negative and does not show good correlation (r = − 0.213).

Table 1 Correlation of THMs and HAAs with independent variables (principal factors and interactive and higher-order variables)

Mathematical models for DBPs within DWTPs

The variables (principal factors and interactive and higher-order) that were considered for the mathematical models are shown in Table 2. Before selecting variables, different variables and their combinations were tried to develop both linear and nonlinear models with the best statistical outputs and the accuracy of predictions vs observed values for THMs and HAAs (Table S2). Based on the results, both linear and nonlinear models for THMs and HAAs were developed and a comparative analysis (statistical test) was carried out to determine the best model.

Table 2 Models coefficients and statistical analysis results for the linear and nonlinear models for prediction of THMs and HAAs

Trihalomethane models

The linear and nonlinear models for THMs are shown in Eqs. 6 and 7, respectively.

$$ \mathrm{THMs}={\upbeta}_0+{\upbeta}_1{\left(\mathrm{UV}\times \mathrm{DOC}\times {\mathrm{logCl}}_{\mathrm{T}}\right)}^2+{\upbeta}_2\left(\mathrm{Br}+2\right)+{\upbeta}_3{\left({\mathrm{Cl}}_{\mathrm{R}}\right)}^2+{\upbeta}_4{\left({\mathrm{Cl}}_{\mathrm{Post}}\right)}^2+{\upbeta}_5\left({\mathrm{pH}}_{\mathrm{avg}}\right)+{\upbeta}_6{(T)}^2+{\upbeta}_7\left(\mathrm{UV}\times {T}^2\times t\times {\mathrm{Cl}}_{\mathrm{T}}\right), $$
(6)
$$ \mathrm{THMs}={\upbeta}_0\times {\left({\left(\mathrm{UV}\times \mathrm{DOC}\times {\mathrm{logCl}}_{\mathrm{T}}\right)}^2\right)}^{\upbeta_1}\times {\left(\mathrm{Br}+2\right)}^{\upbeta_2}\times {\left({\left({\mathrm{Cl}}_{\mathrm{R}}\right)}^2\right)}^{\upbeta_3}\times {\left({\left({\mathrm{Cl}}_{\mathrm{post}}\right)}^2\right)}^{\upbeta_4}\times {\left({\mathrm{pH}}_{\mathrm{avg}}\right)}^{\upbeta_5}\times {\left({(T)}^2\right)}^{\upbeta_6}\times {\left(\mathrm{UV}\times {T}^2\times t\times {\mathrm{Cl}}_{\mathrm{T}}\right)}^{\upbeta_7}, $$
(7)

where THMs, DOC, Br + 2, ClR, Clpost, and ClT are in μg/L, T is in degrees Celsius, t is in hours, and β0, β1, β2, β3, β4, β5, β6, and β7 are model statistical coefficients. The effects of bromide ion were expressed as Br + 2 to avoid zero prediction values for THMs when the value of bromide ion was under the detection level. The size of the sample (n), F test, T test, coefficient of determination (R2), standard error (SE), mean square error (MSE), Durbin–Watson statistic (d), and the model statistical coefficients values are summarized in Table 2. Student’s T test results for both linear and nonlinear (p value > 0.05, i.e., 1 and 0.803, respectively) show no significant statistical difference between observed and predicted values. In addition, the analysis of variance (ANOVA) result showed that both linear and nonlinear models were statistically significant (p value = 0.000). The coefficient of determination for the linear model (R2 = 0.915) was found to be greater than for the nonlinear model (R2 = 0.852). In contrast, the observed values of SE and MSE for linear models (2.085 and 4.06, respectively) were found to be lower than the values for the nonlinear model (2.350 and 5.52, respectively). This suggests that the linear model performs better than the nonlinear model for THMs. This result is supported by the d value. The value of d is preferred to be in the range between 1.5 and 2.5 for a statistically best-fit model (Uyak et al. 2007; Kumari and Gupta 2015; Domínguez-Tello et al. 2017). The value of d was found to be 1.554 for the linear model and 1.207 for the nonlinear model. This indicated that the linear model is statistically the best-fit model. Figure 4 a shows the plot for the observed and predicted THMs values in the three DWTPs. The model predicted most of the peak observed values consistently.

Fig. 4
figure 4

Calibration of predicted vs. observed concentration a THMs linear model and b HAAs linear model

Haloacetic acid model

Table 2 summarizes both the linear and nonlinear HAA models’ statistical coefficient and regression analysis results. Although the dataset used for these models was relatively small because of the unavailability of all the independent predictive variables, and the treated water concentrations were low, this study still attempted to develop a model for HAAs. Very limited research has been conducted to develop the HAA model. The model suggested by Sérodes et al. 2003 has good predictive capability (R2 = 0.92), but the model did not consider the pH levels and was not validated. Similarly, the model developed by Nikolaou et al. in 2004 has very low predictability (R2 = 0.28) and did not incorporate TOC, DOC, and temperature. The ANOVA analysis of the models formulated in this research shows the models to be statistically significant (p value = 0.000). The models formulated are as follows:

$$ \left(\mathrm{HAAs}+2\right)={\upbeta}_0+{\upbeta}_1{\mathrm{logCl}}_{\mathrm{R}}+{\upbeta}_2\mathrm{DOC}+{\upbeta}_3{\left(\mathrm{DOC}-100\right)}_{\mathrm{For}\ \mathrm{DOC}>1000}+{\upbeta}_4\left(\frac{\mathrm{DOC}}{{\mathrm{Cl}}_{\mathrm{R}}}\right)+{\upbeta}_5\left(\frac{\left(\mathrm{Br}+1\right)}{{\mathrm{Cl}}_{\mathrm{R}}}\right)+{\upbeta}_6{\mathrm{pH}}_{\mathrm{avg}}+{\upbeta}_7{\left(T-5\right)}_{\mathrm{For}\ T>5}+{\upbeta}_8\log \left({\mathrm{Cl}}_{\mathrm{T}}\times \mathrm{DOC}/\left(T\times t\right)\right)+{\upbeta}_9\log \left({\mathrm{Cl}}_{\mathrm{T}}\times \mathrm{DOC}/\left(T\times t\times {\mathrm{pH}}_{\mathrm{avg}}\right)\right), $$
(8)
$$ \left(\mathrm{HAAs}+2\right)={\upbeta}_0\times {{\mathrm{logCl}}_{\mathrm{R}}}^{\upbeta_1}\times {\mathrm{DOC}}^{\upbeta_2}\times {\left(\mathrm{DOC}-100\right)}_{\mathrm{For}\ \mathrm{DOC}>1000}^{\upbeta_3}\times {\left(\frac{\mathrm{DOC}}{{\mathrm{Cl}}_{\mathrm{R}}}\right)}^{\upbeta_4}\times {\left(\frac{\left(\mathrm{Br}+1\right)}{{\mathrm{Cl}}_{\mathrm{R}}}\right)}^{\upbeta_5}\times {{\mathrm{pH}}_{\mathrm{avg}}}^{\upbeta_6}\times {\left(T-5\right)}_{\mathrm{For}\ T>5}^{\upbeta_7}\times \log {\left({\mathrm{Cl}}_{\mathrm{T}}\times \mathrm{DOC}/\left(T\times t\right)\right)}^{\upbeta_8}\times \log {\left({\mathrm{Cl}}_{\mathrm{T}}\times \mathrm{DOC}/\left(T\times t\times {\mathrm{pH}}_{\mathrm{avg}}\right)\right)}^{\upbeta_9}, $$
(9)

where β0, β1, β2, β3, β4, β5, β6, β7, β8, and β9 are statistical model coefficients, HAAs, DOC, Br + 1, ClR, and ClT are in μg/L, T is in degrees Celsius, and t is in hours. The HAAs concentration values were expressed as HAAs +2 to avoid zero values of prediction and to enhance correlation with independent variables. Student’s T test results for both the linear and nonlinear models (p value > 0.05, i.e., 1 and 0.652, respectively) show no significant statistical difference between the observed and predicted values. It is noted that the coefficient of determination for the linear model (R2 = 0.772) is higher than for the nonlinear model (R2 = 0.652). The SE and MSE were found to be 0.997 and 0.843, respectively, for the linear model and are lower than for the nonlinear model. Although the d values for both models are in the range of 1.5 to 2.5, the model selection was made based on the R2, SE, and MSE values. Based on the statistical analysis, the linear model was adopted in this research study. Figure 4 b shows the predicted vs observed HAAs.

Validation of THM and HAA model

To demonstrate or confirm the effectiveness of the models (Eqs. 6 and 8) for which they are intended, the models must be validated. For validation, data from the same DWTPs for January 2017 to July 2017 were used. The concentration of THMs and HAAs were predicted using new independent datasets and were compared with measured values. Analyses were done to determine R2, SE, and MSE for the validation dataset and the results are summarized in Table 3. The R2 for THMs and HAAs were found to be 0.914 and 0.794, respectively, for the validation dataset. Both models showed a good correlation with the measured and predicted values. The bias of both THM and HAA models was demonstrated using Student’s T test. The t-critical values for a two-tailed test at the significance level of 0.05 were found to be 1.998 and 2.032 with degrees of freedom of 64 and 34, respectively. The t-calculated values for THMs (− 0.164) and HAAs (− 0.601) from the T test results were found to be less than the t-critical. Moreover, the p values for both THMs (0.87) and HAAs (0.552) were greater than 0.05. This suggests that the measured and predicted values did not display significant differences. The plots between measured and predicted values are shown in Fig. 5a, b for THMs and HAAs, respectively.

Table 3 Validation and application of linear models for THMs and HAAs
Fig. 5
figure 5

Predicted vs. observed concentration a THMs validation, b HAAs validation, c THMs application Amsa DWTP, d HAAs application Amsa DWTP, e THMs application Guui DWTP, and f HAAs application Guui DWTP

Application of THMs and HAAs to different DWTPs

The models developed were applied in two different DWTPs (Amsa and Guui) to evaluate their suitability, soundness, and effectiveness. The data from these two DWTPs were not used during model development and validation. The statistical results are shown in Table 3. A similar approach to the model validation was applied for the application of the models to these two DWTPs. Both THM and HAA models showed slight decreases in performance, as indicated by the coefficient of determination values. The values obtained were 0.840 and 0.872 for the THM model, and 0.684 and 0.704 for the HAA model at the Amsa and Guui DWTPs, respectively. Compared with the HAA model, the THM model showed better performance. The T test was performed to determine bias. For both DWTPs, the t-calculated values were found to be less than t-critical, and the p values were also greater than 0.05. The t-calculated values obtained for the THM model were 1.372 for the Amsa DWTP and 1.358 for the Guui DWTP. Likewise, for the HAA model, the values obtained were − 1.786 for the Amsa DWTP and − 0.861 for the Guui DWTP. These statistical results suggest that the measured and predicted values do not display significant differences. The measured and predicted value plots for the THMs and HAAs for the Amsa and Guui DWTPs are shown in Fig. 5c–f. However, the data generation for this model application study was limited. Overall, the models for both THMs and HAAs showed moderate to very good predictability.

Conclusion

This research study developed predictive models for both THMs and HAAs. Stepwise multiple regression analysis was used to develop both models. The approach used for the model development provided critical information regarding predictor variables. The quadratic form (temperature) and interactive variable (UV254, temperature, reaction time, and total chlorine dose, i.e., UV254 × T2 × t × ClT) for THMs formation, and interactive variables such as log (ClT × DOC/(T × t)) and log (ClT × DOC/(T × t × pHavg)) for HAAs, show more significance effects than their respective principal variables. The incorporation of higher-order and interactive variables enhances the predictability of the models. This study also indicated that a better understanding of the effects of interacting and higher-order variables is needed. For both THMs and HAAs, linear models were found to show better performance than nonlinear models. The validation and application of models showed no significant differences between measured and predicted values. These models may be useful in the identification of strategies and decision-making to improve the treatment and disinfection process of drinking water in South Korea and to other places with similar climatic conditions and treatment processes.