Introduction

As one of the fastest developing countries, India grips only 4% of global potable water resources, supporting around 17% of the earth’s population (Chakraborty 2017). The major sources of drinking water supply in India are the surface reservoir, making it even more challenging to provide potable water due to microbiological contamination (Mahato and Gupta 2020; Marais et al. 2019). Chlorine is a predominant disinfectant used to date, which results in the formation of trihalomethanes (THMs) via reacting with natural organic matter (NOM) (Fig. 1) (Al-Tmemy et al. 2018; Mahato et al. 2019; Hur et al. 2014; Hong et al. 2008). These compounds include chloroform (CHCl3) (CF), bromoform (CHBr3) (BF), dibromochloromethane (CHBr2Cl) (DBCM), and bromodichloromethane (CHBrCl2) (BDCM), are of great concern as they were classified as potential human carcinogens (Padhi et al. 2019; Li and Mitch 2018). Previous findings showed that the concentration range of THMs (231–511 μg/l) in the Indian drinking water distribution system is greatly influenced by the seasonal and spatial variations (Kumari and Gupta 2015, 2018; Mishra and Dixit 2013; Thacker et al. 2002). Similarly, in other countries like Pakistan (575–595 μg/l) (Abbas et al. 2015), Japan (378 μg/l) (Imo et al. 2007), Canada (137.8–141 μg/l) (Milot et al. 2000; Rodriguez et al. 2003a), Turkey (96–102 μg/l) (Uyak et al. 2005), and China (92.77 μg/l) (Ye et al. 2011), wide fluctuations in THMs levels were recorded in their water supplies.

Fig. 1
figure 1

Reaction mechanism of CHCl3 formation in chlorinated drinking water

These changes were also arisen due to the variation in their precursor concentration (NOM) and other operational parameters (pH, residual chlorine [RC], temperature) (Padhi et al. 2019; Kumari and Gupta 2015; Rathbun 1996). The Bureau of Indian Standard (BIS) (BIS 2012) set the permissible limit for individual THMs, i.e., CF (200 μg/l), BDCM (60 μg/l), DBCM (100 μg/l), and BF (100 μg/l), which is similar to that of proposed by the World Health Organization (WHO) except CF (200 μg/l) (Cotruvo 2017). The United States Environmental Protection Agency (USEPA) also established the guideline value (80 μg/l) only for total THMs (TTHMs) (USEPA 2018).

Monitoring of THMs all through the treatment process is vital for the management of quality control and to ensure the compliance of regulatory standards. The development of predictive models was proven to be a more effective and instant approach (Rodriguez et al. 2000). Usually, the predictive models establish the empirical and mechanistic relationship between the level of THMs and the operational parameters (Ye et al. 2011; Di Cristo et al. 2013), whereas some of the models are based on statistical regression equations and described the formation of THMs kinetics (Rathbun 1996; Rodriguez et al. 2000; Elshorbagy et al. 2000; Sadiq and Rodriguez 2004; Milot et al. 2000; Amy et al. 1987). Previous findings demonstrated that the artificial intelligence (AI)-based modeling approach can provide greater set prediction accuracies even in low quantities of data than the conventional multiple linear regression (MLR) model (Peleato et al. 2018; Kulkarni and Chellam 2010; Uyak et al. 2005). However, the application of artificial neural network (ANN) and its comparative assessment with support vector machine (SVM) and MLR model to predict THMs in drinking were not explored earlier. It was also noticed that most of the studies are done on laboratory-generated simulated water, which differs from the actual drinking water utilities (Ye et al. 2011; Milot et al. 2000). Hence, our study's emphasis is to generate models based on the real water collected from different water treatment plants (WTPs) located in India's various regions. The objective of the study includes (1) develop the AI-based THMs models from filed scale real data (2) the comparative assessment of these machine learning approach with conventional MLR model and (3) investigate the correlation of various operational parameters on THMs formation.

The above study was carried out during pre-monsoon (PrM) (April to June) and post-monsoon (PoM) (October to December) seasons in the year 2016–2018 in five different states of India, i.e., Jharkhand, Utter Pradesh Chhattisgarh, Orissa, and West Bengal.

Materials and methods

Sampling protocol

Five major drinking water utilities from the city of four different contiguous states of Jharkhand were considered for this study, i.e., (1) Water Treatment Plant, Belatand, Dhanbad, Jharkhand (DWTP), (2) Water Treatment Plant, Bhelupur, Varanasi, Utter Pradesh (VWTP), (3) Water treatment plant Ravanbhata, Raipur, Chhattisgarh (RWTP), (4) Water Treatment Plant Palasuni, Bhubaneshwar Orissa (BWTP), and (5) Indira Gandhi Water Treatment Plant, Barrackpore, West Bengal (IGWTP). Triplicate samples of raw (intake) and treated water (supply water) from these WTPs were collected during PrM and PoM seasons in the year 2016–2018. A total of 150 samples were first analyzed to establish THMs levels. The description and location details of the study area are illustrated in Table 1 and Fig. 2, respectively. These utilities follow conventional water treatment processes comprised of coagulation–flocculation, sedimentation, sand filtration, and chlorination.

Table 1 General characteristics of utilities under the study
Fig. 2
figure 2

Location details of drinking water utilities selected for the study

Analytical method

The monitoring of physicochemical parameters was done as per the standard protocols of APHA, 2012. Total organic carbon (TOC) and dissolved organic carbon (DOC) (sample filtered through a 0.45 μm filter) were analyzed by a TOC analyzer (TOC-L CSH; Make: Shimadzu, Japan). Specific ultraviolet absorption (SUVA), which is an indicator of the aromatic character of NOM, was determined by the ratio of UV254 and DOC concentration, expressed as L mg−1 m−1. The concentration of THMs was determined by USEPA method 552.1 using a combination of liquid–liquid extraction and gas chromatograph electron capture detector (GC-ECD) (Thermo Fisher, CERES 800 plus) (Hodegeson 1990). The GC-ECD conditions used for analysis are given in Table 2.

Table 2 Operating conditions for analysis of THM through the GC

Quality assurance and quality control (QA/QC)

As to ensure the consistency of the analytical results, the blank sample was prepared and analyzed to determine the presence of background contamination. Sample injection was performed in triplicate for the precision of measurement, and the average value was considered the final value. In case the relative percent difference between the two samples tends to surpassed ± 10%, the instrument was considered out of calibration and recalibrated.

Modeling approach

For this study, two machine learning techniques, viz., ANN and SVM were employed for the prediction of THMs formation and compared with the conventional MLR model. A set of five water quality parameters, namely pH, temperature, RC, TOC, and UV254, were used as independent variables and total THMs (TTHMs) as dependent variables. For maintaining the measurement accuracy, triplicated samples were collected each time during sample collection. The average of the three sample readings was reported as a single observed value for each parameter. A total of 150 observations from various WTPs included triplicate samples that were averaged and confined to 50 observations, which were considered input data for the models' development and validation, wherein, for the modeling 60% of the data set and validation 40% of the data set were used, i.e., subsets of input and output data had the dimensions of 30 samples × 5 independent variables × 1 dependent variable, and 20 samples × 5 independent variables × 1 dependent variables, respectively. To meet the required algorithm and facilitate network learning, data normalization is essential before starting the training process. There are many methods for normalizing the input data, like external normalization, along channel, across channel, and mixed channel. In the present work, the input data were normalized using the min–max normalization method as stated in Eq. (1). This method has the advantage of preserving exactly all relationships in the data. It actually normalizes the raw values in the range of 0–1 for better prediction.

$$ {\text{Normalized data}} = {(}L{-}{\text{Min)/}}({\text{Max}} - {\text{Min}}) $$
(1)

where L is the raw value, Max and Min are the maximum and minimum of raw values, respectively.

The ANN is based on complex biological neural systems of the human brain, having certain theoretical advantages over the conventional modeling approach (MLR). It arises from the field of artificial intelligence and consists of several layers of processing elements with their nodes (neuron) (Rodriguez et al. 2003b; Singh and Gupta 2012). These neurons are arranged in an input layer that receives a signal input, one or many hidden layers that process the information actively, as well as an output layer that responds to the network (Fig. 3). Elements of different layers are highly interconnected by weighted links through which information may pass. The number of these elements in the input and output layer mainly depends on the number of input and output variables used in the specific problem to be solved. In the present work, a three-layer ANN was implemented with backpropagation algorithm in Python (3.7.1) by using Jupyter Notebook integrated development environment (IDE) with Sklearn library. The backpropagation algorithm has demonstrated several advantages to having the potential for determining networks with arbitrary mapping projections (Cook and Wolfe 1991; Rodriguez et al. 2003b). Hence, this algorithm was used to supervise the learning algorithm by changing the hyperparameters, viz., learning rate (LR) and momentum term (MT) to yield the best convergence. Moreover, the logistic relu activation function was used to activate the hidden and output layer. The number of nodes in the hidden layer for the optimal neural network was determined by optimization of hyperparameters using trial and error methods (Azadi and Karimi-Jashni 2016). The input layer consists of five nodes, i.e., pH, Temp., RC, TOC, and UV254, whereas the output layer has one node – TTHMs. The practical applications of ANNs require the correct selection of LR and MT to separate the noisy data and avoid over-fitting problems. The degree of correlation (R2) and MSE at various LR and MT is given in Table 3. A maximum of 10,000 iterations was performed to achieve the optimum network. In-depth theory and mathematical details of learning and estimation of the parameter are broadly explained in the previous literature (Peleato et al. 2018; Singh and Gupta 2012).

Fig. 3
figure 3

Basic structure of the ANN model

Table. 3 Degree of correlation and MSE at various MT and LR

SVM is a well-known supervised machine learning technique based on structural risk minimization (SRM), the theory of statistical learning (Singh and Gupta 2012; Vapnik 2013). It acts as a binary classifier to find the maximal margin (hyperplane) between two classes. In this approach, the original data points from the input space are mapped into a high or even infinite-dimensional feature space using a suitable kernel function (class of algorithms for pattern analysis), where the hyperplane is constructed. It can deal with a large number of features to find the optimal hyperplane from which the distance to all the data points is minimum and reduce the model dimensions and estimated errors: the theory and mathematical concept of the SVM model described in detail by Haykin (1999). The implementation of SVM was performed in MATLAB (9.5).

MLR is a relatively advanced concept of simple linear regression, used in various research fields to establish the strength of a linear relationship between a set of independent variables and dependent variables (Rodriguez et al. 2003b). This relationship can be described by following equation form (Rodriguez et al. 2003b).

$$ Y = \sum {\beta_{o} + \sum\limits_{i = 1}^{m} {\beta_{i} X_{i} } } $$
(2)

where Y and Xi represent the dependent and independent variables, respectively, with m denoting the number of independent variables considered, βo and βi are the intercept and partial slope coefficients, respectively, providing prediction for the value of Y. In this approach, predictor variables were classified first according to their statistical significance and then including one variable at a time at different steps. The MLR model works on the ordinary least squares (OLS) method, which minimized the vertical distances of the sum of squared from the observed data points to the line (Neter et al. 1990). The SPSS (IBM, 21.0) was used to perform the implementation of the MLR model.

Sensitivity analysis

Sensitivity analyses were carried out using various statistical metrics, that is, R2 (predicted vs. observed), root means square error (RMSE), means absolute percentage error (MAPE), and Index of Agreement (IA) to evaluate the performance of developed models. The equation for the determination of MAPE, RMSE, and IA is indicated in Eqs. (3) (4) and (5), respectively,

$$ {\text{MAPE}} = \frac{100}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{Y_{i} - \hat{Y}_{i} }}{{Y_{i} }}} \right| $$
(3)
$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {\hat{Y}_{i} - Y_{i} } \right)^{2} } } $$
(4)

where Yi and Ŷi are the observed and predicted value of TTHMs, and n is the number of samples.

$$ {\text{IA}} = 1 - \frac{{\sum\nolimits_{i = 1}^{N} {(O_{i} - P_{i} ){}^{2}} }}{{\sum\nolimits_{i = 1}^{N} {\left[ {\left| {O_{i} - O_{m} } \right| + \left| {P_{i} - P_{m} } \right|} \right]^{2} } }} $$
(5)

where Oi and Pi are observed and predicted TTHMs concentration and N is the number sample tested. Om and Pm represent the means of the observed and predicted total trihalomethane concentration.

Results and discussion

Concentration range of THMs species at five water utilities

The descriptive statistics of all the THMs species under the present study are given in Table 4. During this investigation, the highest concentration of TTHMs was found in VWTP for both the season. This may be attributed to the difference in THMs precursor content, RC, temperature, and other operational parameters (Padhi et al. 2019; Rathbun 1996). It may also be greatly affected by the geographical distribution and climatic conditions of the WTPs. The investigated range of TTHMs is consistent with the results obtained by Kumari and Gupta (2015) during their study of various water utilities situated in the Eastern part of India. Similarly, the higher concentration range of TTHMs was also monitored in other countries like Pakistan (575–595 μg/l) (Abbas et al. 2015) and Japan (378 μg/l) (Imo et al. 2007). Throughout the study, it was also noticed that the CF was the predominant compound among all four THMs species, which surpassed the WHO (300µg/l) and BIS (200µg/l) drinking water guideline value. The other two (BDCM and DBCM) were found well within the BIS and WHO standards, i.e., 60 and 100 μg/l, respectively. The BF was not detected in any of the water utilities because bromide ions were found below the detectable limit (BDL) (<0.1 mg/l). Source water with BDL bromide ions forms more chlorinated THMs than brominated THMs (Barrett et al. 2000; Nikolaou et al. 1999; Chowdhury et al. 2011; Imo et al. 2007; Lebel and Williams 1995), which can also be seen in the present study.

The percentage distribution of THMs species in various utilities and GG chromatograms (Fig. 4a, b) illustrates that CF shared more than 90% of TTHMs, followed by BDCM and DBCM. More than 90% of the THMs in the chlorinated drinking water supplies typically consisted of CF, while BDCM and DBCM contribute up to 2.1–14%. The observation of the present study was good in line with Zhang et al. (2011), where they reported that CF's contribution was up to 94% to that of other THMs compound in 13 WTPs of China.

Table 4 Descriptive statistics of THM species at five utilities
Fig. 4
figure 4

a Percentage distribution of THMs species and b representative chromatogram of THMs of VWTPs

Periodic fluctuations of THMs and their precursors in processed water

Assessment of periodic fluctuations in TTHMs and their precursors observed in all the water utilities are shown in Fig. 5a–d. There was substantial variation observed in the mean value of these species. The concentration range of TTHMs was spotted 1.12 ± 0.074 times higher in PrM than PoM. The TTHMs were in the order of VWTP followed by DWTP, IGWTP, RWTP, and BWTP for PrM, while in PoM, it was again VWTP followed by IGWTP, BWTP, RWTP, and DWTP. Organic content (TOC, UV254) and the temperature were also appeared to be higher during the PrM, which may favor the higher THMs formation (Nikolaou et al. 1999). According to Rodriguez and Serodes (2001), rates of chlorine decay are high at elevated temperatures. Hence, it required higher doses of chlorine for treatment this season, which ultimately reacts with available NOM, thus providing more THMs (Uyak et al. 2008) in processed water. Besides, high organic content in water will also require a higher chlorine dose (Rodriguez and Serodes 2001). Temperature and NOM in water during PoM observed slightly lower, resulting in lesser chlorine demand (Rodriguez et al. 2003b); thus, comparatively lower THMs formed in this season. The observation in the present study is good in line with the finding of Wei et al. (2010), Rodriguez et al. (2004), and Toroz and Uyak (2005) for the drinking water distribution system.

Fig. 5
figure 5

Box and whisker plot of periodic fluctuations of a TTHMs, b TOC, c UV254 and d temperature at various utilities

Correlation analysis

In order to investigate the effects of NOM and other operational parameters on THMs formation, the Pearson correlation matrix was established (Kumari and Gupta 2015) (Table 5a, b).

Table 5 Pearson correlation matrix of variables with TTHMs

Effect of NOM (TOC, DOC, and UV 254 )

TOC, DOC, and UV254 are essential surrogate measures of NOM, act as a key precursor for THMs formation (Padhi et al. 2019; Li and Mitch 2018; Sung et al. 2000). The Pearson correlation test confirmed that all these surrogates are strong and significantly correlated with TTHMs and each other. The THM formation rate is equal to that consumption of TOC, thus increasing in organic content of water, upswing the formation of THMs (Chang et al. 2001); Hasani et al. 2010; Arora et al. 1997). It was being reported previously that a water sample with high TOC can produce more THMs if enough RC is available (Babcock and Singer 1979). DOC constitutes approximately 83–98% of TOC in water and generally more representative of the soluble organic carbon than TOC (Owen et al. 1993). The strong and significant correlation between TOC and DOC under the study also supports this observation. Thus, concerning THMs formation, DOC follows the same trend parallel to TOC (Westerhoff et al. 2000; Müller 1998). UV254 is another essential key surrogate of NOM after TOC and DOC, provides an insight into the nature of organic content, and liable to form the THMs (Edzwald et al. 1985). The correlation coefficients of TOC with TTHMs were slightly higher than the DOC and UV254, indicating TOC as more influential parameters. Moreover, it was also noticed that a slow reaction between chlorine and NOM results in the formation of THMs under second-order reaction to TOC, especially for the long-term (Draper and Smith 1998). Thus, it is a multistage process that operates through an initial reaction of TOC with residual chlorine followed by many possible pathways to produce THMs. The second step is found to be rate determining through which the reactive chlorinated intermediates are formed in the initial step (Trussell and Umphres 1978). With respect to NOM, DOC and UV254 were found second and third most influential parameters after TOC responsible for THMs formation, respectively. A similar investigation was also reported by Hua et al. (2015).

Effect of pH and alkalinity

In the present investigation, pH and alkalinity have shown a moderate and statistically significant correlation with TTHMs, respectively (Table 5a-b). pH showed a positive correlation with THMs; in other words, increasing in pH formation of THMs also increases (Roccaro et al. 2014; Hong et al. 2013; Kim et al. 2003). The oxidation process of chlorine is more prevalent in alkaline pH required more chlorine may support the greater THMs formation. In contrast, acidic pH lowered the reactivity of the chlorine pathway and strongly disfavored the THMs formation (Navalon et al. 2008). Besides, during the chlorination process, when chlorine comes in contact with water leads to the formation of hypochlorous acid (HOC1) and a hypochlorite ion (OC1-). The formation of these two species is pH-dependent, as in acidic conditions, HOC1 is found to be dominated, whereas in alkaline pH OC1- (Uyak et al. 2005). Many researchers also widely accepted that base-catalyzed reactions play a major role in THM formation (Reckhow et al. 1990). In this regard, pH, and alkalinity seems to be an important operational parameter in controlling the THMs formation. The observation of the present study was well supported by the finding of Kim et al. (2003) and Oliver and Lawrence (1979).

Effect of temperature

THMs formation is proportional to the temperature; the higher the temperature greater the formation (Hua and Reckhow 2008). It was observed that every 10°C increase in the temperature doubles the rate, enhancing the activation energy of the reaction between organic matter and residual disinfectant (Engerholm and Amy 1983; Chowdhury and Champagne 2008). During the period of study, moderate relation was obtained between temperature and TTHMs. This observation is also good in line with the result of seasonal variation where PrM gives rise to the greater formation of THMs than PoM due to variation in temperature. Krasner (1999) also reported that the formation of THMs was higher during summer when there was high temp.

Effect of RC

The elevated range of RC present in treated water consequently increased the formation of chlorinated THMs (Chowdhury and Champagne 2008). However, the availability of organics beyond the chlorination breakpoint is so less than the THMs were not found to increase significantly after that point (Sung et al. 2000; Chowdhury and Champagne 2008). Pearson correlation test in this study indicated that RC has positively correlated with TTHMs. Hence, the THMs yield attains higher value due to the greater availability of RC (El-Dib and Ali 1995). This result appeared to be inconsistent with the finding of Al-Tmemy et al. (2018), Uyak et al. (2005), and Wei et al. (2010). Pearson correlation matrix of variables with TTHMs during PrM was found to exhibit similar trends as PoM.

A seasonal modeling approach for THMs formation

Modeling plays a very crucial role in predicting THMs formation in water supply systems. The study emphasizes the use of both conventional and models based on artificial intelligence to explore their accuracy and feasibility. The traditional modeling approach is based on multilinear regression, while machine language employs ANN and SVM to model the THMs formation in drinking water. At first, data of PrM season were utilized for model development and PoM data for validation studies. But, surprisingly, all the three models failed by giving significantly lower values of R2 = 0.5619 (ANN), R2 = 0.5678 (SVM), and R2 = 0.5670 (MLR) (Fig. 6a–c). This indicates that the model developed from the PrM season cannot predict THMs in PoM owing to the seasonal variation in water quality parameters, especially the change in temperature, which largely influences the rate of THM formation (Rodriguez et al. 2003b; Hua and Reckhow 2008). To overcome the lacunae, separate models were developed to predict THMs during both PrM and PoM seasons (Fig. 7a–f). The performance data indicated that out of all the three models, ANN gave the most promising results with R2 = 0.9621, followed by SVM (R2 = 0.9554) and MLR (R2 = 0.9553). The applicability of ANN can further be justified by significantly lower values of RMSE and MAPE than other models (Table 6). Moreover, the observed value of IA, closer to unity (0.99), also confirmed the better compliance of ANN than SVM and MLR models. This may be attributed to the higher generalization capacity of ANN and its increased tolerance to noisy data (Rodriguez et al. 2003b; Milot et al. 2002; Hashem and Karkory 2007). Ye et al. 2011) also modeled DBPs in the drinking water of China using artificial neural networks and reported that the performance of the ANN model was excellent (r > 0.84). Significantly higher correlation for ANN in our study may be attributed to the precise calculation in neural networking, which eliminates the chances of any biased prediction on account of uneven distribution of modeling and testing data sets.

SVM and MLR models were also used in the study to model THMs in drinking water. The results dictated poor performance wrt ANN; however, close linearity between observed and predicted values was obtained for both SVM (R2 = 0.9554) and MLR (R2 = 0.9553). The values corresponding to MAPE and RMSE (Table 6) were also comparatively higher for SVM and MLR, indicating lesser suitability of these models than ANN. The variation in the models' performance may be due to the application of different prediction algorithms in machine language-based models. Hong et al. (2016) have developed an MLRs model for predicting THMs in the water distribution network of China, where they observed this regression model exhibited good accuracy and precision, as well as 86–97 % of the calculated fell within ±25% of the measured values. However, it is essential to note that the developed models were site-specific, and the predictive capabilities may vary according to the changes in environmental conditions.

Fig. 6
figure 6

Model plots of TTHMs using various approach a ANN, b SVM, and c MLR

Fig. 7
figure 7

af Season-wise validation model plots of TTHMs using various approach ANN, SVM, and MLR

Table 6 Descriptive performance of ANN, SVM, and MLR for both the season

Conclusion

The study established the concentration range of THMs and their precursors in drinking water utilities of five different Indian states. The study highlighted the need to adopt effective control measures for bringing down the high concentration of THMs to their permissible limit. THMs concentration showed a strong correlation with temperature followed by pH and NOM. Conclusive evidence from the analysis of performance data of various models dictated that the prediction of THMs through AAN was found relatively more precise than SVM and MLR models, hence, can be invariably adopted for quality control in drinking water supplies.