Introduction

Disinfection is one of the most important stages in water treatment to reduce the content of pathogenic material. In most of the world, chlorine disinfection is the most widely used method for its high effectiveness in preventing pathogenic microorganisms and its low cost (Mazhar et al. 2020). However, chlorine can react with natural organic matter (NOM) present in water from supply sources and generate disinfection by-products (DBPs) such as trihalomethanes (THMs) (Richardson and Plewa 2020). The formation of THMs is influenced by several factors: operational variables (e.g., pH, type and disinfectant dose, residence time), environmental conditions (e.g., water temperature and seasonal variation), and water characteristics (e.g., type and concentration of NOM, bromide ion concentration) (Al-Tmemy et al. 2018).

Various researchers have reported adverse human health effects from exposure to THMs, for example, bladder cancer (Costet et al. 2011), colorectal cancer (Rahman et al. 2010), miscarriage, and congenital anomalies (Wright et al. 2017). In addition, some THMs are classified as possibly carcinogenic (IARC 2021) . Therefore, maximum contaminant level (MCL) has been established for drinking water. The U.S. Environmental Protection Agency (US EPA) establishes an 80 μg/L MCL for total THMs (TTHMs) that include chloroform, bromoform, dibromochloromethane, and bromodichloromethane (US EPA 1998). In Costa Rica, MCL of 200 μg/L, 100 μg/L, 100 μg/L, and 60 μg/L, respectively, are established (MINSA 2018).

Monitoring of THMs is important to avoid the aforementioned adverse effects and for compliance with legislation. However, the most common method for THM determination by gas chromatography is expensive and time consuming (Mukundan and Van Dreason 2014). As a tool for decision making, multiple prediction models have been developed. These models can be generated from laboratory or field data by collecting samples at the treatment plant and/or distribution network (Sadiq et al. 2019). For the first case, they have the advantage that many variables can be controlled; however, it does not contemplate certain aspects that occur on a real scale (Chowdhury et al. 2009). The models obtained with field data have the advantage of contemplating variables such as the influence of the infrastructure of the distribution networks; however, they are specific to each site (Shahi et al. 2020) and therefore cannot be generalized to any context (Semerjian et al. 2009). The prediction models can be classified into mechanistic ones based on the kinetics of chlorine reactions, and empirical ones (Kumari and Gupta 2015). The DBP empirical models are based on the water quality, operational and environmental conditions that influence its formation. The models are developed using statistical regression or artificial neural networks (Sadiq et al. 2019). Accordingly with the same study, the generation of empirical models benefit in understanding the factors that contribute to the formation of THMs and are a tool for decision-making.

In the literature, most models predicting the formation of THMs have been developed in temperate and urban zones, for example, in Quebec, Canada (Rodriguez et al. 2000); New York, USA (Mukundan and Van Dreason 2014); and Seoul, South Korea (Shahi et al. 2020). Moreover, models have been reported for systems located in semi-arid areas like the city of Ahvaz, Iran (Babaei et al. 2015) and Wassit Province Southeast Iraq (Al-Tmemy et al. 2018), the Mediterranean region in Lebanon (Semerjian et al. 2009), and in few cases in tropical regions, for example, in Thailand (Feungpean et al. 2015). In general, considering that the NOM present in the different water sources is influenced by autochthonous and allochthonous production, it is expected to find differences in the nature of NOM depending on the region (Edzwald and Tobiason 2011). Therefore, it is expected to develop models for the different sites. The present research is the first attempt to develop a THM prediction in Costa Rica and to the best of the authors’ knowledge in the Central American and Caribbean region. Furthermore, this study was focused on rural and semi-urban areas, where no studies were found in the literature.

In Costa Rica, 93% of the population received drinking water in 2019 (PEN and CONARE 2020). Moreover, in the same year, 19.4% of homes in rural and semi-urban areas were supplied with water by local Associations Administrators of Aqueduct and Sewerage Systems (ASADAs in Spanish) (Sánchez-Hernández 2019). In addition, in 2016, 14.3% of the population was supplied by 24 municipalities and the rest by duly organized public companies (AyA 2016). The main water sources used are groundwater, springs, surface water, and the mixture of the two latter ones. In all cases, chlorine disinfection is the method used (Arellano-Hartig et al. 2020). In general, due to economic and analytical capacity limitations, monitoring of THMs is scarce, mainly at the ASADAs and municipal levels. Thus, the objective of this study was to develop a series of prediction models of TTHMs in the distribution systems of rural and semi-urban areas supplied by springs, surface water, and the mixture of both sources. This is the first study of its kind carried out in the country and is expected to serve as a tool for decision-making in the aqueducts regarding their operation and parameters to be monitored.

Materials and methods

Study site and drinking water systems

The study was performed in three different zones of the country (Fig. 1). The sites present a dry season from December to March, a rainy season from May to October, and two months of transition, April and November (Manso et al. 2005). Nineteen small distribution systems of rural or semi-urban areas were selected. The population of most of the systems ranges from 328 to 8000 inhabitants. The length of the distribution networks ranges from 1.2 to 13 km. The raw water sources of the systems were surface (6), springs (6), and a mixture of both (7). The surface water and the mixture of water sources were treated with conventional treatment systems (2), slow sand filtration (1), screening or sedimentation (5), multi-stage filtration (1), and coarse-layered filtrations (2). The water was chlorinated in 16 cases with solid Ca(ClO)2, in one case with liquid NaClO, and in two systems generated in situ by electrolysis. In this study, mainly in spring water, chlorination was the only treatment; therefore, water subjected solely to chlorination was considered as treated water.

Fig. 1
figure 1

Study site including all the drinking water systems in three provinces of the country: (a) Alajuela, (b) Puntarenas, and (c) Cartago

Water sampling and analytical procedures

Water samples from the 19 systems were collected from three different sampling campaigns, in the dry, transition, and rainy seasons, respectively. The study period was between April 2018 and September 2019. Each sampling day, four samples, at different points of the distribution network, were taken as recommended by the local legislation (MINSA 2018). Specifically, the sampling points were at the exit of the chlorination storage tank (minimum estimated contact time design of 30 min) and the beginning, the middle, and the end of the distribution network.

Total and dissolved organic carbon, TOC and DOC, respectively, were determined using a Teledyne Tekmar TOC Fusion model device following the SM5310 C method of the Standard Methods (APHA et al. 2017). The limit of detection and quantification were 0.03 and 0.05 mg C/L, respectively. For the determination of DOC, the samples were filtered using a cellulose nitrate membrane of 0.45 μm. The ultraviolet absorbance at 254 nm (UV254) was determined using a spectrophotometer Shimadzu model UV 1800 ENG120V with a 1-cm optical length and following the 5910B method of the Standard Methods (APHA et al. 2017). From the ratio of UV254 values to DOC concentrations, the specific ultraviolet light absorbance (SUVA) was calculated.

Total THMs (TTHMs) were calculated as the sum of chloroform, bromoform, dibromochloromethane, and bromodichloromethane. These substances were determined following method 6040 D (APHA et al. 2017) using Agilent 7890A equipment with an electron capture detector (ECD) and solid phase microextraction with a polydimethylsiloxane (PDMS) fiber. The THMs were analyzed using a calibration curve of 6 standards in a range between (0–10) µg/L (r2 > 0.995). Helium was used as a carrier gas (4 mL/ min) and a ZB-624 capillary column (length: 105 m, ID: 0.53 mm, layer thickness: 3.00 µm). The initial oven temperature was 35 °C and the final temperature was 250 °C with an increment of 5 °C/min. The detection and quantification limits of chloroform, bromoform, dibromochloromethane, and bromodichloromethane were 0.2 μg/L, 0.06 μg/L, 0.07 μg/L, and 0.06 μg/L and 0.6 μg/L, 0.2 μg/L, 0.2 μg/L, and 0.2 μg/L, respectively.

In the field, pH was determined at all sampling points using Hanna HI 8–124 equipment and free chlorine was determined using a colorimeter (Pocket Colorimeter II, Hach) following the DPD method (N, N-diethyl-p-phenylenediamine). Turbidity and apparent color were determined in the laboratory in less than 24 h after sampling using 2100Q and DR900 equipment (both Hach). In all cases, the methods of the Standard Methods (APHA et al. 2017) or those recommended by the equipment manufacturers were followed.

Mathematical model development

The models were developed using the data from the water samples taken at the exit of the chlorinated water storage tank and in the distribution network of each system. The models have developed from four data sets accordingly to the source water of the systems: (1) all sources, (2) spring, (3) surface, and (4) mixture of surface and spring waters refer as mixed. Before the analysis, an aleatory code was assigned to each sample, and with the help of Minitab 17 statistical software, each database was randomly divided into two groups: calibration data (70% of the total) and validation data (30% of the total). A similar procedure was reported by Golfinopoulos and Arhonditsis (2002) for the development of multivariate regression models for the prediction of THMs in a water treatment plant in Greece.

Initially, the normality of TTHMs and variables like temperature, pH, turbidity, color, free residual chlorine, TOC, DOC, and UV254 reported by Sadiq et al. (2019) as potentially influential in the formation of THMs were evaluated using the Anderson–Darling test (Ryan 2007). As it will be discussed later, the variables presented a non-normal distribution as shown in Table S1 (Online Resource 1); therefore, as recommended by Kargaki et al. (2020) for non-parametric data, the Spearman correlation test with a significance level (α) of 0.05 was used. Using this test, the Spearman correlation coefficient (rs) and their respective p-value were determined. Similar to Chowdhury et al. (2008) applied criteria for Pearson’s correlation coefficient in THM model development, in the present research an rs below 0.3 means weak correlation, between 0.3 and 0.7 moderate and greater than 0.7 strong correlation. Furthermore, the correlation was considered statistically significant if the p-value < 0.05 and vice versa.

Multiple regression analysis was performed in the Minitab 17 statistical software program for the development of linear and non-linear models. TTHM concentrations were considered as the dependent variables, while the other water quality parameters were considered as the independent variables. Once the potential variables to include in the models were identified, as recommended by Feungpean et al. (2015), the stepwise method was used to identify the significant variables in the explanation of variability provided by the model. In the stepwise method, each of the variables is included or excluded when evaluating the p-value of the F test, against the alpha values to enter or leave the model considering a significance level of 0.05.

To find the model that represents the best performance and goodness-of-fit of the data, for each data set, linear and non-linear models were generated. Transformations were applied in the dependent and/or independent variables (e.g., square root, exponential, logarithmic) (Pardoe 2012). In all cases, data exclusion criteria were used, such as studentized residual deleted greater than 3, high leverage points, Cook’s distance, and DFTIS (Acuña-Fernández 2004).

Subsequently, for the models obtained, the statistical assumptions were evaluated: normality, constant variance or homoscedasticity, and independence (Acuña-Fernández 2004) (Figs. S1S4, Online Resource 1). In addition, for the comparison of performance between the models, the statistical results were analyzed: R2, R2 adjusted, the significance of the model (F test), Durbin–Watson statistic, average standard error (SE; Eq. (1)), average square error (MSE; Eq. (2)), and mean absolute error (MAE; Eq. (3)).

$$\mathrm{SE}=\sqrt{\frac{\sum_1^{\mathrm n}\left({\mathrm{TTHM}}_{\mathrm M}-{\mathrm{TTHM}}_{\mathrm P}\right)^2}{\mathrm n}}$$
(1)
$$\mathrm{MSE}=\frac1{\mathrm n}\sum_1^{\mathrm n}\left({\mathrm{TTHM}}_{\mathrm M}-{\mathrm{TTHM}}_{\mathrm P}\right)^2$$
(2)
$$\mathrm{MAE}=\frac{\sum_1^{\mathrm n}\left|{\mathrm{TTHM}}_{\mathrm M}-{\mathrm{TTHM}}_{\mathrm P}\right|}{\mathrm n}$$
(3)

where TTHMM indicates the measured TTHMs, TTHMP indicates the predicted TTHMs by the models, and n refers to the number of observations evaluated. The SE, MSE, and MAE units are µg/L corresponding to the TTHM units.

Models’ validation and applicability

The best model obtained for each data set was validated using the excluded data used to obtain the models (30% of the total data). For validation, predicted TTHMs and those measured were compared using the criteria: R2, SE, and MSE (Shahi et al. 2020). In addition, as the study mentioned, a T-test was performed to determine a significant difference between the mean of the TTHMs measured and the predicted by the models. A test of equal variances was performed to determine whether equal variance could be assumed in the T-test. Next, the T-test was performed by calculating the t-value and its respective p-value. The values were compared, and if the p-value > 0.05, the difference between the measured and predicted values was considered as non-significant and vice versa.

Results and discussion

Water quality parameters

Table 1 presents the main characteristics of the treated/chlorinated water of the 19 systems. In general, the water quality was maintained from the outlet of the chlorinated water storage tank to the end of the network. The temperature range is typical for tropical countries and the pH values were close to 7. The turbidity and color of all samples were relatively low, indicating that the efficiency of the treatments and/or that the water sources were good. Similarly, in most cases, TOC and DOC were quite low. Moreover, UV254 indicates a low presence of humic substances, and SUVA, in most cases less than 2 L/mg·m, suggests non-humic NOM and low molecular weight aliphatic compounds. Furthermore, only slight seasonal variation was found in the water NOM-related parameters (Fig. S5, Online Resource 1).

Table 1 Water characteristics of the data sets

The low values in the above parameters related to NOM justify the low concentrations of TTHMs, where only two samples slightly exceeded the 80 μg/L regulated by the US EPA (US EPA 1998), despite the relatively high free chlorine (within the local regulation, i.e., 0.3 to 0.6 mg/L). Moreover, chloroform, even though at low concentration (10.60 ± 13.86 μg CHCl3/L), in most of the samples accounted for around 62% of the different THM species. In addition, the species CHBrCl2, CHBr2Cl, and CHBr3 were frequently found, but at much lower concentrations (i.e., < 2 μg/L). Such speciation of THMs has been reported in other studies (Sérodes et al. 2003). In general, in all the parameters (except in pH and free residual chlorine), surface water values at least double spring water ones, and the mixed and the whole data set values were in between. That is expected as surface water is highly influenced by allochthonous and autochthonous production, and the effect is also observed in the whole and the mixed water data sets. Furthermore, the higher concentration of precursor (e.g., TOC, UV254) is reflected in higher THM concentration.

Correlation of independent variables with THMs in treated water

The Anderson–Darling statistical test (Ryan 2007) showed that the dependent (TTHM concentrations) and most of the independent variables presented a non-normal distribution across all data sets (p-value < 0.05) (Table S1, Online Resource 1). This is expected because the data comes from systems with different operational characteristics. The data presented a positively skewed distribution, which is characterized by having a large amount of data in the low ranges of the parameter compared to the higher ranges. Therefore, to evaluate the correlation between the variables, Spearman’s non-parametric test was used (Kurajica et al. 2020).

Temperature and pH showed non-significant and weak correlations (p-value > 0.05, rs < 0.3) in all data sets (Table 2), expected as both parameters were relatively stable (Table 1). This differs from those reported by Al-Tmemy et al. (2018) for treated water from five treatment plants in Iraq where they found a significant and moderate correlation for both parameters. Accordingly, an increase in temperature tends to increase the reaction rate between organic matter and chlorine, and the THM concentrations increase with pH because many hydrolysis reactions, which occur in basic medium, promote their formation.

Table 2 Spearman correlation between TTHMs and the independent variables

Turbidity presented a weak correlation in all data sets (rs < 0.3) and was significant (p-value < 0.05) only in the whole data set and surface water data set (Table 2). Tsitsifli and Kanakoudis (2020) reported a greater correlation between turbidity and TTHMs (r = 0.553) for two treatment plants using surface sources. About apparent color, a low and significant positive correlation in the surface water data set was observed; in the others, the correlation was not significant (Table 2). Abdel Azeem et al. (2014) reported that Pearson correlation coefficient between THMs and color was between 0.87 and 0.93 for treated water at four treatment plants in Egypt.

Free residual chlorine showed a significant correlation in the whole data set and the spring and mixed water data sets (Table 2). In addition, the correlation was moderate and positive in all data sets. Contrary, some authors reported negative correlations between this parameter and TTHMs (Feungpean et al. 2015; Kumari and Gupta 2015). This inverse correlation can be attributable to radial diffusion and wall consumption of residual chlorine while THMs form (Kumari and Gupta, 2015). However, similar to the present study, positive and significant correlations have been attributed to the covariance of operational parameters or interactions between parameters (Salam et al. 2020).

The NOM, TOC, and DOC presented a moderate positive correlation (0.3 < rs < 0.7) and significant (p-value < 0.05) in all data sets (Table 2), which agrees with the correlation values reported by several authors between 0.47 and 0.57 (Kumari and Gupta 2015; Shahi et al. 2020). Considering that chlorine reacts with NOM to produce THMs, the trend is that as TOC and DOC increase, the concentration of THM increases, as long as sufficient free residual chlorine is available (Kumari and Gupta 2015). Also, it was found that UV254 presented a significant and moderate positive correlation in the whole data set and surface water data set; however, in the other data sets, the correlation was weak and not significant. Similar, significant, and moderate observations were reported by other researchers for UV254 and THMs (Semerjian et al. 2009; Kumari and Gupta 2015). Finally, the SUVA only presented a significant, but low negative correlation in the mixed water data set (Table 2). Other studies have reported low and negative correlations for SUVA, but not significant (Babaei et al. 2015).

Modeling THM formation within the distribution system

As shown in Table 3, linear, logarithmic, and exponential models were developed for each type of water. All models were significant (p-value < 0.05 of F-test), and in most cases, the Durbin-Watson value was found between 1.5 and 2.5 as recommended in the literature to avoid autocorrelation problems (Tsitsifli and Kanakoudis 2020). The models presented a wide range of adjusted R2, from 0.132 to 0.687 indicating a varied performance and adjustment of the data.

Table 3 TTHM predictive models for various data sets

The most appropriated models (in bold in Table 3) were selected not only because of the values of the coefficient of determination but also for statistical parameters related to the error (i.e., SE, MSE, MAE). For the whole data set, spring and mixed water data sets, the models 1, 4, and 10, respectively, presented the lowest values of SE, MSE, and MAE and they were selected although they presented a slightly lower R2. However, in these models, the R2 of 0.448, 0.657, and 0.531, respectively (Table 3), remain satisfactory and comparable to those reported by several authors (Babaei et al. 2015; Feungpean et al. 2015; Tsitsifli and Kanakoudis 2020). In the surface water data set, model 7 presented the lowest value of SE, MSE, and MAE, and the highest value of R2 (Table 3). Therefore, models 1, 4, 7, and 10, all linear, were selected as the ones with the best performance and goodness-of-fit. Among those models, a greater goodness-of-fit is observed in those of spring waters (of higher quality) followed by the model of the mixed water data set, then the model of the whole data set and lower performance in the case of the surface water data set. In general, those models can be considered moderately robust and could be improved by including some parameters and operational variables that affect the formation of THMs in distribution networks (e.g., bromide ion, contact time, chlorine dose) (Nikolaou et al. 2004).

Through a more detailed analysis of each of the chosen models, it can be determined which are the most influential variables in the formation of THMs by type of water source. Thus, model 1, similar to models reported by Kumari and Gupta (2015), includes the variables pH, free residual chlorine, DOC, and UV254. In the case of the spring water data set, model 4, free residual chlorine, DOC, and turbidity were included; the latter variable has also been used in THM prediction models (Al-Tmemy et al. 2018). Finally, in the surface and mixed water data sets, models 7 and 10, free residual chlorine and organic matter content such as DOC and TOC, respectively, are observed as influential.

Validation of THM models

Table 4 presents the validation results, R2, SE, MSE, and MAE, as well as the results of the T-test for each model. The values of R2 were between 0.359 and 0.772, which demonstrated a satisfactory level of explanation of the observed variability and are comparable with those reported by Golfinopoulos and Arhonditsis (2002) (i.e., 0.37 to 0.54). Similar to the calibration phase, SE, MSE, and MAE results showed that models 4 and 10 (spring and mixed water, respectively) performed better. Also, the bias of the four models determined by a T-test (Shahi et al. 2020) indicated no statistically significant difference between the predicted and measured average values (p-value > 0.05; Table 4). Furthermore, Fig. 2 shows that most of the data are within the prediction interval for all the models. In the case of the whole data set and surface water (Fig. 2a and c) the data tend to move away from the line of best fit above 30 µg/L. In the case of the models for spring water and the mixed water (Fig. 2b and d), with lower TTHM concentrations, the data tend to distribute more evenly. Therefore, these models seem to perform better at TTHM concentrations lower than 30 µg/L.

Table 4 Validation of proposed models for the prediction of TTHMs in the distribution systems
Fig. 2
figure 2

Validation of models and comparison of measured vs. predicted TTHM concentrations for (a) model 1, (b) model 4, (c) model 7, and (d) model 10. CI confidence interval, PI prediction interval

Conclusions

Several TTHM models were developed for the tropical Costa Rican rural and semi-urban chlorinated water. The TTHM concentrations ranged between < 0.20 and 91.31 μg/L with chloroform (CHCl3) accounting on average for 62% of the total. Depending on the data set, several parameters, including turbidity, total organic carbon (TOC), dissolved organic carbon (DOC), free residual chlorine, and ultraviolet absorbance at 254 nm (UV254), presented significant correlation (p-value < 0.05). Four linear models presented the best goodness-of-fit and were moderately robust. From the validation stage, it was found that according to the criteria of R2, standard error (SE), mean square error (MSE), and mean absolute error (MAE), spring water and mixed spring/surface water models showed a satisfactory level of explanation of the variability of the data. Moreover, all the models seem to better predict TTHM concentrations below 30 µg/L. Therefore, considering the specific chlorinated water characteristics (low NOM and TTHMs produced) the models developed could be useful for decision-making in drinking water supply systems.