1 Introduction

The rapid change in land use and agricultural practices can alter the stream water quality (Castela et al., 2008; Mishra et al., 2020; Walsh et al. 2005; Wang et al., 2021a, b). Maintaining healthy streams poses a challenge, mainly because of the many pollutant sources and the complex interaction between different watershed characteristics (Waite et al., 2010; Walsh & Webb, 2016; Yu et al., 2014). The increase in total nitrogen (TN) and total phosphorus (TP) concentrations in rivers is often linked to the high percentages of urban lands within a watershed (Bucak et al., 2018; Castela et al., 2008; Johnson & Ringler, 2014; Mattsson et al., 2005; Walsh et al. 2005). This high level of pollutants in streams can lead to eutrophication and water quality degradation (Correll, 1999; Hecky & Kilham, 1988; Paerl, 1988).

Previous studies highlighted the significant correlations between the anthropogenic variables (e.g., urbanization and agricultural activities) and the concentration of TN and TP in a watershed (Allan, 2004; Giri & Qiu, 2016; Lintern et al., 2018). Watershed characteristics and climatic variables (e.g., topography, soil, climatic data) can also influence stream water quality(Alnahit et al., 2020; Lintern et al., 2018; Tramblay et al., 2010). For example, a steep slope may influence stream water quality by mobilizing pollutants into streams, leading to water quality degradation (Alnahit et al., 2020; Kang et al., 2010; Lintern et al., 2018). Similarly, soil properties can also affect water quality (Alnahit et al., 2020; Lintern et al., 2018; Varanka et al., 2015). For instance, watersheds dominated by parent rock showed low values of dissolved ions; on the other hand, soft sedimentary rocks showed high values of dissolved ions (Young et al., 2005). Furthermore, a high phosphorus level in rivers was noticed in a watershed with high values of sediment depositions (Dillon & Kirchner, 1975). Different watershed characteristics can potentially influence water quality since they influence the mobilization process and the delivery of indicators into rivers (Granger et al., 2010; Lintern et al., 2018).

Overall, there are two commonly modeling strategies for predicting stream water quality in ungauged watersheds, (1) deterministic physically-based models (e.g., distributed hydrologic and water quality models) and (2) statistical and machine learning methods (e.g., decision tree models). This study uses machine learning methods to estimate the long-term median stream water quality indicators using several climate and watershed characteristics. Linear regression models are commonly used to explore the relationship between water quality and different land-use variables (Seber & Lee, 2012; Tong & Chen, 2002; Zampella et al., 2007). However, the effects of watershed characteristics on water quality indicators are often complex and nonlinear. Recent machine learning algorithms can handle nonlinear relationships associated with complex watershed processes (Alpaydin, 2020; Konapala & Mishra, 2020; Shen et al., 2020). Moreover, these algorithms determine the relationship between response variables (e.g., water quality indicators) and predictors (e.g., land-use variables) instead of a priori assumption, improving the model prediction accuracy. Several studies have applied techniques adapted from machine learning models to understand the relationships between water quality and land use variables(e.g., Bui et al., 2020; Castrillo & García, 2020; Fatehi et al., 2015; Ko et al., 2015; Puissant et al., 2014; L. Q. Shen et al., 2020; Singh et al., 2017; Tu & Xia, 2008; R. Wang et al., 2021a, b). These studies highlighted that these algorithms are more suitable than linear models such as Bayesian linear regression, stepwise linear regression, and partial least squares regression, especially when human/landscape interactions are complex (Giri et al., 2019; Mouazen et al., 2010).

Among the previously used machine learning algorithms, the boosted regression tree (BRT) algorithm and the random forest (RF) algorithm recently gained a lot of attention (Chen et al., 2020; Fang et al., 2021; Knierim et al., 2020; Konapala & Mishra, 2020; Shen et al. 2020; Veettil & Mishra, 2020). BRT and RF have fewer parameters, and both can investigate and provide estimates related to the hierarchy of variables in the classification (Everingham et al., 2016). Additionally, RF and BRT algorithms (1) have less user-defined parameters; (2) are flexible in handling nonlinear relationships, missing values, and outliers; (3) can limit model overfitting; (4) are capable of incorporating qualitative and quantitative variables; and (5) have been applied successfully in different areas (Giri et al., 2019; Konapala & Mishra, 2020; Veettil & Mishra, 2020; Yang et al. 2016; Shen et al. 2020).

Many recent studies highlighted the use of machine learning algorithms to study the potential influence of human activities on water quality parameters (e.g., Giri et al., 2019; Jeung et al., 2019; Onderka et al., 2012; Tramblay et al., 2010; Tung & Yaseen, 2021; Wang et al., 2021a, b). However, prior studies have used a limited number of watersheds and associated variables. Additionally, no prior studies performed a comprehensive analysis using RF and BRT algorithms to predict water quality indicators (TN, TP, TUR) for a large number (97 nos) of watersheds based on a combination of climate, watershed, and morphological variables in the southeast USA.

This study will complement previous studies that used only a limited number of watersheds and associated variables. The median values of water quality indicators are selected for individual watersheds, and corresponding 28 variables associated with watershed, climate, and topographic and soil characteristics are used for the model development. The selected watersheds represent various land use, climate, watershed characteristics with different watershed areas to improve our understanding of the predictive power of two selected machine learning algorithms that can capture the linkage between climate-watershed characteristics and water quality indicators. The RF and BRT algorithms use an ensemble of many simple tree models to optimize predictive performance instead of a single tree model used in the traditional simple regression. The water quality indicators investigated in this study are TN, TP, and TUR, while the predictors (independent variables) represent a combination of the climatic and watershed characteristics.

Overall, this study aims to address the following research questions: (1) to compare and identify the best machine learning algorithms based on the classification and decision tree approach for water quality (TN, TP, and TUR) prediction in streams; and (2) to investigate the functional relationships and interactions among dominant variables influencing stream water quality based on the interpretive machine learning techniques (i.e., partial dependence analysis). The remainder of the manuscript is organized as follows: Sect. 2 introduces the study area and data used in the study. The methods employed in the study are discussed in Sect. 3. Section 4 presents the results, while the discussion is provided in Sect. 5. The conclusions drawn from this study are summarized in Sect. 6.

2 Study area and data

2.1 Study area

This study includes 97 watersheds located in North Carolina, South Carolina, and Georgia (Fig. 1a). These watersheds are located in three main physiographic regions, including coastal plain, blue Ridge, and Piedmonts (Turner & Ruscher, 1988). There are more than 250 watersheds with water quality monitoring stations in the region; however, only 97 watersheds were selected based on the following criteria: (1) nested watersheds were not included to avoid pollutant transfer from other watersheds; (2) watersheds with reservoirs covering more than 25% of the watershed were excluded, and (3) water quality stations located less than 50 km downstream of a reservoir outlet were eliminated.

Fig. 1
figure 1

a Selected watersheds located within the Southeastern part of the USA. b Examples of watershed characteristics: land use/land cover and the digital elevation model (DEM) over the selected watersheds

The watersheds were delineated using a 10 m Digital Elevation Model (DEM). The latitude and longitude of each watershed outlet were located, and then the Soil and Water Assessment Tool (SWAT) was used to generate the watershed boundary (Arnold et al., 2012). The selected watersheds vary in size from 72 to 5786 km2. In addition, the selected watersheds experience different degrees of human activities (urbanization and agricultural activities) (Fig. 1b). The primary urbanization form is expanding low-density residential areas, medium-density residential areas, and high-density residential areas. Such changes in land use have altered watersheds hydrology and the environmental conditions of streams in the study area.

The study area climate is characterized by a humid subtropical climate, with hot summers and mild winters. The mean annual temperature is 20 °C, while the mean annual evapotranspiration is 635 mm/year (SCDHEC, 2016). The study area runs from the north to the south, with elevation ranging from 2035 to 0 m above sea level (Fig. 1b). Land use is dominated by forest (approximately 55%, mainly located in the northern side of the study area, Fig. 1b).

2.2 Datasets

For each watershed, the water quality monitoring data from 2000 to 2019, including TN, TP, and TUR, were downloaded using data retrieval tools from R software package "dataRetrieval" (https://github.com/USGS-R/EflowStats). The water quality monitoring data was expressed as a concentration (mg/l) (or in NTU in the case of TUR). Since the stationarity of the time series is crucial, the stationarity was checked at each site using two methods. Specifically, each time series was split into four sections, and the mean and variance were computed for each section. The augmented Dickey-Fuller (ADF) unit root test and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test were utilized (Vazifehkhah et al., 2019). Most of the time series passed the stationarity tests. We performed first-order differencing for the time series that did not pass the stationarity test to generate stationary time series (Mishra & Desai, 2005). Furthermore, a t-test at a 95% confidence interval was performed to exclude outliers for each time series.

The watershed characteristics selected in this study were land use, topography, geology, and climatic data (Table 1). The land use data were obtained from the National Land Cover Dataset (NLCD) for the year 2011. The land use data of 2011 was used to represent the whole period (2000 to 2019) to capture the broad impacts of land use on water quality. The Soil data was downloaded from the Soil Survey Geographic (SSURGO) database (SSURGO, 2018). The climate data (precipitation and temperature) data from 2000 to 2019 over the study area was downloaded from Parameter-elevation Relationships on Independent Slopes Model (PRISM) (Daly et al., 2008). PRISM was developed employing ground rain gauge data, DEM, and interpolation schemes (Daly et al., 2008). The precipitation and temperature data were averaged over each watershed (areal average) using the Zonal Statistics tools in ArcMap (Esri, 2014). The topographic data for each watershed (e.g., mainstream length–width ratio, watershed slope, and watershed elevation) was extracted from a 10 m DEM using SWAT model. Twenty-eight different watershed/climatic characteristics were obtained from these datasets (Table 1). Following previous research, these characteristics were selected to identify the essential predictors influencing the water quality indicators (Alnahit et al., 2020; Lintern et al., 2018; Mainali & Chang, 2018; Varanka & Luoto, 2012). Based on EPA criteria, the concentration for TN and TP should be about 0.90 mg/l and 0.04 mg/l, respectively (US EPA, 2002; Ice & Binkley, 2003). The water quality indicators and land use vary within the selected watersheds (Fig. 2). For example, the median TN based on the 97 watersheds ranged from 0.54 to 1.9 mg/l, while the overall median for all watersheds is about 0.9 mg/l (Fig. 2a). FRST land has the highest percentage among different types of land use, followed by URBAN, AGRL, GRAS, HAY, and WTLN (Fig. 2b).

Table 1 Definitions of the selected independent variables to quantify relationships between watershed characteristics and climatic variables on the mean water quality indicators
Fig. 2
figure 2

Box plots showing the range of a water quality constituents (TN, TP, and TUR) and b land-use types. Definitions of land-use variables are shown in Table 1

3 Model development

The Classification and Regression Tree (CART) (Breiman, 2001; Friedman & Meulman, 2003; Golden et al., 2016; Yang et al. 2016) is a flexible and nonparametric method implemented in this study. The CART method can handle outliers, missing values, multicollinearity, and heteroscedasticity in the datasets. CART method is commonly used to investigate complex datasets with numeric and/or categorical variables (predictor variables) that interact with each other nonlinearly (De’ath and Fabricius 2000). Both RF and BRT belong to the CART family, which has been implemented in different disciplines, such as species distributions (Shabani et al., 2017), groundwater mapping (Naghibi et al., 2016), water quality (Golden et al., 2016; Povak et al., 2014), aquatic ecosystems (Elith et al., 2008; Smucker et al., 2013; Tonkin et al., 2014), and environmental modeling (Giri et al., 2019; Strobl et al., 2008).

Watershed characteristics and climatic variables (total of 28 characteristics) were chosen as predictor variables (independent variables), while the water quality indicators (TN, TP, and TUR) were chosen as dependent variables. The median values of temporal variations of TN, TP, and TUR at each watershed outlet were calculated and used as the dependent variables. The one-way variance test indicated significant differences in water quality indicators' median values among the watersheds [confidence interval of 95%; α = 0.05; n = 97]. The overall modeling framework is shown in Fig. 3, which are discussed in the following sections.

Fig. 3
figure 3

The modeling framework to model the median water quality constituents in streams

3.1 Variables selection

Three different approaches were used to select the predictor variables (Fig. 3b). In addition to using all the 28 predictor variables, a stepwise linear regression (SR) was used to select the smallest number of relevant variables that provide the best linear combination (Lima et al., 2016; Wang et al., 2018). However, SR may have statistical deficiencies, such as bias estimates, standard error, and size of p-values (Harrell, 2001; Mo et al., 2016); therefore, the Least Absolute Shrinkage and Selection Operator (LASSO) was also used for variable selection (Bardsley et al., 2015; Tibshirani, 1996). LASSO uses a cross-validation technique to find a set of significant variables with the optimal performance; LASSO shrinks regression coefficients to zero if there is a strong correlation with another variable (Bardsley et al., 2015). Furthermore, a non-linear method (genetic algorithm, GA) was included to choose the most significant climatic/watershed characteristics (Huang et al. 2016; Taghizadeh-Mehrjardi et al., 2016). GA is an adaptive optimization search method that mimics Darwinian natural selection theory to find optimal values of a function (Huang et al., 2016; Taghizadeh-Mehrjardi et al., 2016). Three standard parameter settings were defined for the GA, population size of 50, crossover rate of 0.80, and mutation rate of 0.1 based on the recommendation of (Welikala et al., 2015). The relevant variables based on the four different datasets were used to develop predictive models based on RF and BRT algorithms.

3.2 Random forests (RF) model

The RF algorithm approach uses an ensemble of regression (or classification) tree models (Breiman, 2001). Specifically, a series of individual trees are build based on random subsamples from the original data. Each subsample provides a decision tree, and each decision tree is used to predict the response variable (or a class). In the end, an ensemble average of all individual trees is computed. The inclusion of several trees increases the probability of deriving an effective prediction model (Breiman, 2001; Strobl et al., 2008). The accuracy of the random forests algorithm relies mainly on the strength of the individual tree classifiers and the dependency between the classifiers (Amit & Geman, 1997). Therefore, key parameters for RF models are the number of trees and predictor variables used to determine the split at each node (Vorpahl et al., 2012). Figure 3d illustrates the steps used to develop the RF prediction model for each watershed's median water quality indicators. The RF modeling requires two parameters: the number of trees (ntree) and the number of variables at each tree node (mtry). To optimize the two parameters, a grid search was performed using different combinations of ntree and mtry. The range of the number of ntree was set between 100 and 2000 with an increment of 50. The number of selected independent variables (mtry) ranged from 1 to 28 (or the total number of significant variables based on SR, LASSO, and GA) with an increment of 1 (Rodriguez-Galiano et al., 2015). The data was split into 10-folds for cross-validations, and the error rates for each of the 10 cross-validation partitions were aggregated into a mean percentage error. Three replicates of the tenfold cross-validation were performed, and the process was repeated 50 times to evaluate the reliability of the predicted model (Fig. 3d).

The relative importance of each variable was calculated based on the mean decrease in accuracy (%IncMSE), as suggested by Genuer et al., (2010). The mean decrease in accuracy was calculated as a percentage of mean square error (MSE) increment when removing that variable from the prediction set. A higher value of %IncMSE for a variable indicates that the predictor has higher relative importance than other predictors. Partial dependence plots in RF model were also calculated for each independent variable.

3.3 Boosted regression trees (BRT) model

The Boosted regression trees (BRT) technique is an improvement of the regression trees model. BRT uses a boosting technique to combine decisions from a sequence of base models to enhance the accuracy of the final model (Elith et al., 2008; Naghibi et al., 2016; Yang et al. 2016). BRT is a forward and stagewise procedure, where a subsample of the original data is randomly selected to fit new tree models to minimize a loss function (Golden et al., 2016). The final fitted model is a linear function of the sum of all trees multiplied by the contribution of each tree used to build the model (Elith et al., 2008). The bag fraction (BF) in BRT is the proportion of the training set used for each model fit, learning rate (LR) is the contribution of each tree to the model development, and tree complexity (TC) is the number of nodes in a tree. The number of trees (NT) required for the best model prediction is calculated based on LR and TC (Elith et al., 2008).

In BRT modeling, four parameters (LR, T, NT, and BF) need to be defined, and to optimize these parameters, several experiments were conducted using different combinations of LR, TC, and NT. The values of LR varied from 0.001 to 0.03 at 0.002 increments; the values of TC were varied from 1 to 7 with an increment of 1; the NT values varied from 100 to 2000 at an increment of 100. These combinations generated an optimal BRT model using three repetitions of tenfold cross-validation. As in the case of RF model, the process was repeated 50 times (Fig. 3d). The variable of importance was found by the number of times a variable appeared in all trees. The mean of the relative importance of each variable from various trees was calculated. This mean was used to build a hierarchy of overall relative importance (Elith et al., 2008; Friedman & Meulman, 2003; Golden et al., 2016; Yang et al. 2016). The partial dependence plots were generated to determine the effect of the individual independent variables on the fitted function.

Both BRT and RF algorithms use several decision trees to enhance the predictive performance. BRT and RF use different techniques (boosting in the case of BRT and bagging method in the case of RF) that may lead to different results. Specifically, the boosting method is built-in subsequent trees, while the bagging approach is built-in parallel (independently). In addition, boosting is an iterative process, where tree models are built to improve the weak learners in each tree to enhance the overall model prediction accuracy (Elith et al., 2008). In the case of boosting method, the fitted values in the final model are the sum of all trees multiplied by the contribution of each tree (Elith et al., 2008). On the other hand, trees are grown independently in the bagging method, which means that each event would have an equal probability of being selected in subsequent samples. Each tree is given equal weight for final decision-making instead of higher weight for a better performing tree during training in the boosting method (Breiman, 2001; Yang et al. 2016).

3.4 Partial dependence

The concept of partial dependence aims to quantify the functional relationship between dominant predictors and the water quality indicators in streams. Partial dependence is evaluated by integrating the effects of all the predictors beside the covariate of interest (Breiman, 2001). Partial dependence of a variable \({\mathrm{x}}_{\mathrm{k}}\) is computed by averaging it over the input predictors \(\left\{{\mathrm{X}}_{\mathrm{i}},\mathrm{ i}=1,\dots ,\mathrm{n}\right\}\) with fixed \({\mathrm{x}}_{\mathrm{k}}\) as

$$\mathop {f_{k} }\limits^{\sim} (x_{k} ) = \frac{1}{n}\sum\limits_{i = 1}^{n} {\mathop {f_{k} }\limits^{\sim} (x_{{i,C_{k} }} ,x_{k} )}$$
(1)

where \(\widehat{f}\) is the output based on the RF and BRT models. This partial dependence estimate is usually constructed to understand the functional relationship between the variables (\({x}_{k}\)) and their potential influence on the water quality indicators. Here, we assessed partial dependence for a subset of dominated predictors for each model (RF and BRT) to visualize the effects of a given single predictor on the outcomes of classification (RF and BRT). For a given value of the predictor, the prediction is quantified by averaging the predictions over all other predictors in the dataset.

3.5 Model validation

BRT and RF models were evaluated using a tenfold cross-validation method. The final models for each of the water quality indicators were evaluated using three statistical measures: Nash–Sutcliffe efficiency (NSE), mean absolute error (MAE), and root mean square error (RMSE) (shown in Eqs. 24, respectively).

$$NSE = 1 - \frac{{\sum\nolimits_{i = 1}^{n} {(O_{i} - P_{i} )^{2} } }}{{\sum\nolimits_{i = 1}^{n} {(O_{i} - \overline{O} )^{2} } }}$$
(2)
$$MAE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {O_{i} - P_{i} } \right|}$$
(3)
$$RMSE = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {(O_{i} - P_{i} )^{2} } }$$
(4)

where n is the number of watersheds, \(O_{i}\) is the observed water quality variable at the watershed \(i\), \(\overline{O}\) is the mean of the observed data, \(\overline{P}\) is the mean of the predicted data, and \(P_{i}\) is the predicted water quality constituent at the watershed \(i\).

NSE represents the observed and predicted data in 1:1 line, and the prediction becomes optimal as NSE approaches to 1.0. MAE indicates how close the prediction to the observation, while RMSE is the standard deviation of the residuals. MAE and RMSE are computed and reported in the same units as the variable being evaluated (Moriasi et al., 2015). Empirical relationships were categorized by R2 values as weak (R2 ≤ 0.25), moderate (0.25 < R2 < 0.75), and strong (R2 ≥ 0.75) correlation following the recommendation of Hair et al., (2013).

4 Result

4.1 Variables selection using three methods

We performed a preliminary analysis based on Spearman's correlation matrix (31 × 31) for water-quality indicators (TN, TP, and TUR) and watershed/climatic characteristics (28 variables) (Fig. 4). A cell with a white color indicates that the correlation is statistically insignificant (p > 0.05). A positive correlation between TN and URBAN and a positive correlation between TN and SOL_AWC was observed. There is also a negative correlation between TN and FRST, a negative correlation between TN and WT_S, and weak correlation between TN and other watersheds/climatic characteristics. Similarly, a positive correlation between TP and both URBAN and SOL_AWC, negative correlations between TP and both FRST and GRAS, and a weak correlation between TP and other watersheds/climatic characteristics was observed. There are positive correlations between TUR and URBAN, FRST, and HAY across each watershed. Additionally, the proportion of clay and silt and the SOL_pH in a watershed are positively correlated with the median TUR values. On the other hand, TUR shows a negative correlation with GRAS, AGRL, WTLN lands at each watershed. Similarly, the proportion of sand and SOL_OM and SOL_K show a negative correlation with TUR. The Elev and CH_S show a positive correlation with TUR. All climatic variables (except MeanRain and DryMRain) exhibit a negative correlation with TUR values in streams.

Fig. 4
figure 4

The correlation matrix showing Spearman's correlation analysis between the median water quality constituents and watershed characteristics/climatic variables. [Note: A cell with a white color indicates that the correlation is statistically insignificant (p > 0.05). The definition of predictors is shown in Table 1

Interestingly, while the concentrations of TN and TP are negatively correlated with FRST according to the Spearman correlation (Fig. 4), FRST is positively correlated with TUR. This is likely due to the spatial correlation between FRST lands and climatic and topographic watershed characteristics. Specifically, watersheds with higher elevation and steep slope are dominated by FRST (positively correlated with elevation and mean steep channel). Hence, FRST under these conditions may lead to more sediments and particulates being transported into receiving streams, resulting in higher TUR values (Lintern et al., 2018; Alnahit et al., 2020).

Figure 5 shows the significant predictors for each water quality indicators (TN, TP, and TUR) selected based on SR, LASSO, and GA methods. Overall, based on the SR approach, three significant predictors are found for TN, four significant predictors are found for TP, and five significant predictors are found for TUR. On the other hand, the LASSO approach suggests that eleven predictors are significant for TN, ten predictors are significant for TP. In contrast, only eight predictors are found to be significant for TUR. A higher number of predictors are selected based on the GA approach; for example, sixteen significant predictors are selected for TN and TP, and nine predictors for TUR (Fig. 5). Specifically, URBAN, and AGRL are identified for TN by all methods. Other predictors, such as GRAS, HAY, and WTLN, were selected by all methods (except SR method) for TN.

Fig. 5
figure 5

Variable selection based on stepwise regression (SR), Least absolute shrinkage and selection operator (LASSO), GA (genetic algorithm), and ALL (All 28 predictor variables). [Note: A cell with a gray color indicates that the variable is selected by the variable selection method]

The soil parameter such as mean SOL_OM selected by all the methods was a important predictor for TN and picked by two methods for TP. Similarly, MeanRain and mean channel slope (CH_S) are significant variables based on LASSO and GA methods (Fig. 5a). Overall, URBAN and AGRL are selected by all methods for TP, while FRST, GRAS, Elev, and DryMRain are significant predictors based on the LASSO and GA methods (Fig. 5b). For TUR, all the three methods identified WTLN and the mean SOL_K as significant predictors, while all methods select URBAN and CH_L except for SR method (Fig. 5c). This discussion highlighted the choice of predictors can vary based on the methods (SR, LASSO, and GA), therefore it is important to evaluate the performance for the predictors for water quality prediction, as discussed in the following section.

4.2 Evaluation of RF and BRT models

We evaluated the performance of the selected climate and watershed variables for water quality prediction over 97 watersheds. The input (predictor) variables for RF and BRT models are selected based on all 28 predictor variables (ALL), and are those identified based on the SR, LASSO, and GA methods. These four types (ALL, SR, LASSO, and GA) of input variables are selected for the individual watersheds, and the median values of water quality indicators for the same watershed is considered as an output of the model. For each water quality constituent, eight models are evaluated (four models using RF and four models using BRT models). The models are named as RF_slection method (BRT_selection method). For example, RF_LASSO represents a random forest model developed based on the variables selected by LASSO. The model performances are quantified based on the three goodness-of-fit statistics (NSE, MSE, and RMSE). The box plots of goodness-of-fit statistics developed based on selected watersheds are shown in Fig. 6.

Fig. 6
figure 6

The models' performance for predicting the median values of TN, TP, and TUR using random forest (RF) and boosted regression tree (BRT) with 50 runs for the different variable selection methods. The selected input variables for each method are shown in Fig. 5

Figure 6 shows that all models (except SR models) predicted the TN, TP, and TUR concentrations moderately well based on the median values of NSE, MAE, and RMSE. Additionally, the models selected by LASSO, GA, and the ALL models show similar levels of prediction accuracy based on the median values of NSE. The selected climatic and watershed characteristics as predictors explained at least 48% of TN, TP, and TUR variation in streams are (as indicated by NSE values). Specifically, the median NSE values explain approximately 53% of the variability in the TN, 55% of the variability in the TP, and 48% of the variability in the TUR in streams for both RF and BRT algorithms. Additionally, the random forest model algorithm performed slightly better compared to the boosted regression models for TN, TP, and TUR models (Fig. 6). For example, when using predictors selected by the GA method for TN, the model of RF_GA has higher median values of NSE (0.56) with lower median values of MAE (0.022) and RMSE (0.061) compared to BRT_GA model (NSE = 0.53, MAE = 0.024, and RMSE = 0.061).

The relative importance of the top five predictors for the TN, TP, and TUR models using RF and BRT are presented in Fig. 7 and Fig. 8, respectively. The relative importance of each predictor is calculated as the mean value of the 50 runs of each model. The TN variability in streams is influenced mainly by the presence of URBAN lands, AGRL lands, and GRAS lands, as well as the mean total rainfall (MeanRain) over a watershed. URBAN lands show the highest relative importance for all TN models, followed by AGRL lands for RF_SR, RF_LASSO, and RF_GA methods and FRST lands in the case of RF_ALL model. On the other hand, the TP variability is influenced by URBAN, AGRL, GRAS, and watershed soil properties (the proportion of CLAY/SILT within a watershed in the SR and LASSO models). URBAN lands have the highest relative importance for all TP models, followed by MeanRain in the case of RF_SR, RF_LASSO, and RF_GA models and by HAY in the case of RF_SR model. For TUR, WTLN shows the highest relative importance for all TUR models (Fig. 7). The mean watershed slope (WT_S) appeared as an important variable in TUR_SR and TUR_ALL models.

Fig. 7
figure 7

The relative influence of the top 5 predictors of the median TN, TP, and TUR models based on the Random Forests (RF) algorithm

Fig. 8
figure 8

The relative influence of the top 5 predictors of the median TN, TP, and TUR models based on the Boosted tree regression (BRT) algorithm

RF and BRT models identified similar top five predictors with a high relative influence on the water quality indicators (Figs. 7 and 8). For instance, the five predictors of TN models for RF_GA and BRT_GA are the same; however, the relative importance is slightly different. URBAN is the most important predictor for TN followed by AGRL, MeanRain, GRAS, and HAY in the case of RF_GA model, while for BRT_GA model, URBAN is the most important predictor for TN followed by HAY, MeanRain, GRAS, and AGRL across the selected watershed.

Overall, the results from both the RF and BRT models suggest that the top five influential predictors for TN, TP, and TUR in streams are similar; however, the relative influence of each predictor is different in each model. This is expected as each machine-learning algorithm uses different inherent model structures. Specifically, RF algorithm generates tree independently (in parallel) where each tree is assigned equal weight for the final decision. This is different from the stagewise method of tree development that coupled with higher weight for better performing etree in the case of BRT. Besides, the bagging method in RF algorithm aims to minimize the variance in model fitting, while the boosting algorithm in BRT focuses on improving weak classifiers at each tree. Additionally, RF algorithm is slightly better compared to BRT algorithm. This may be due to a higher overfitting issue in BRT compared to RF. This is likely because, in the boosting algorithm, trees are grown in an adaptive way to eliminate any bias, which may reduce the variance, resulting in a model overfitting.

5 Partial dependence plots

Partial dependence plots can provide the functional relationship between an individual climate/watershed variable and the predicted water quality indicators. We assessed the partial dependence of the top dominant variables on water quality indicators for both RT and BRT models (Figs. 9 and 10, respectively). The partial plots are developed based on the key variables, which includes URBAN, FRST, AGRL, GRAS, WTLN, and SOL_K.

Fig. 9
figure 9

Partial dependence plot based on Random Forests for TN, TP, and TUR in the streams

Fig. 10
figure 10

Partial dependence plot based on Boosted tree regression for TN, TP, and TUR in the streams

Among the most important predictors, TN and TP reveal a positive trend with the URBAN and AGRL, while they share a negative trend with the percentage of FRST and GRAS lands in the study area (Fig. 9). Specifically, for RF models, TN and TP values in streams decrease linearly as the GRAS cover increase in the watershed. On the other hand, in BRT models, the GRAS cover is nearly linearly related to TP in streams when the percentage of GRAS was above approximately 9% of the watershed, while TN values in streams show a slight increase when the GRAS cover is around 10% of the watershed and then leveled out when the watershed is above approximately 21% of the watershed area (Fig. 10). Overall, for RF models, the partial plots suggest that TN and TP increased abruptly when the percentage of URBAN was above approximately 40% and 55% percent of the watershed, respectively and when AGRL land is above 43%.

TUR shows a negative trend with the percentage of WTLN and the mean values of SOL_K, while TUR exhibits a positive trend with the percentage of URBAN and FRST. Specifically, the TUR levels in streams in both RF and BRT tend to increase as URBAN and FRST land cover increased, but only below values of about 50% of the watershed area.

6 Discussion

This study shows that urban and agricultural lands are the largest contributors to nutrient loads (TN and TP) delivered to streams. The relative importance analyses and partial dependence plots suggest that the increase in human activities (e.g., urbanization and cropping) in a watershed has led to greater TN and TP concentrations in streams. This larger proportion of urbanization in the watershed, resulting in high TN and TP in streams, may be due to the increased use of fertilizer on urban lawns, the presence of treatment plants, and stormwater discharges (Perry & Vanderklein, 1996; Polsky et al., 2014; Tasdighi et al., 2017). These findings are expected and in agreement with previous studies that noted a positive correlation between the TN/TP and the percentage of the cropping and urbanization in the watershed (Agouridis et al., 2005; Pratt & Chang, 2012; Tasdighi et al., 2017; Wan et al., 2014; Wilson & Weng, 2010).

Additionally, WTLN lands appeared as a significant predictor for all TUR models and it has the highest relative importance value for most TUR models (Figs. 7 and 8). This is expected and it may be associated with wetlands near streams which act as a sink for particulate matter (Cui et al., 2016; Shen et al., 2019; Suzuki et al., 2018). On the other hand, FRST and GRAS have higher relative importance for TN and TP predictions in most models (Figs. 7 and 8). On the other hand, the negative correlation between TN and TP with GRAS and FRST is expected as the GRAS, and FRST can potentially decrease nutrients in streams (Giri & Qiu, 2016; Tu & Xia, 2008).

Soil characteristics appear in all the models for TUR. For example, the proportion of clay in the watershed and the SOL_K appeared in the TUR models. When there is a high percentage of clay and silt in soils, hydraulic conductivity (SOL_K) can be lower, leading to more runoff. Particulates are transported mainly from the watershed into streams by runoff. This high rate of runoff can lead to more particulates being transported over longer distances (Charlton, 2007; Wood, 1977), thus contributing to increased TUR and TP in streams. In addition, the positive relationship of TUR with the SOL_OM (organic matter) is expected, as many previous studies have indicated that organic matter can increase TUR in streams (Lenhart, 2008; Lenhart et al., 2010; Waters, 1995).

Moreover, the RF algorithm was easier to calibrate and robust to overfitting problems than BRT, which is partly associated with the bagging algorithm method that reduces the variance of the prediction model. These findings are consistent with previous findings showing that RF performed better than BRT (Giri et al., 2019; Park & Kim, 2019; Shabani et al., 2017; Wang et al., 2018). For example, Park and Kim (2019) found that RF was slightly better than BRT in predicting landslide susceptibility mapping using different variables, such as topography and land use variables. Additionally, Shabani et al. (2017) showed that RF outperforms BRT when predicting the best location to distribute the date palm trees under different climate change scenarios. Overall, one of the advantages of using these machine learning algorithms (RT and BRT) compared to the traditional approaches (linear regression) is their ability to handle nonparametric datasets as well as nonlinear relationships (Grömping, 2009; Noi et al., 2017; Trawiński et al., 2012).

Previous studies used stepwise regression (SR) to identify the most significant watershed characteristics influencing stream water quality (Hajigholizadeh & Melesse, 2017; Shrestha & Kazama, 2007; Wang et al., 2018). However, in this study, SR selected fewer predictors compared to LASSO and GA methods (Fig. 5) and did not perform well for RF and BRT models (Fig. 6). This may be due to the statistical deficiencies in the SR method, such as the distribution of test statistics, bias estimates, and standard error (Mo et al. 2016). Specifically, the regression error in the SR procedure follows the Gaussian distribution where the predictors and response variables are usually transformed into a Gaussian distribution. This may influence the interpretation of the regression coefficients (Hastie et al., 2017). More importantly, when solving a non-convex optimization problem, the SR procedure often fails to find a global optimal set of variables and stays at a local optimum (Hastie et al., 2017). On the other hand, LASSO uses cross-validation to find predictors with the optimal generalization performance (Arlot & Celisse, 2010), which enhance the selection capability and providing better results compared to SR (Hammami et al., 2012). Overall, the model performance results showed that using GA models performed slightly better than LASSO and ALL models. These findings agree with Xie et al. (2015) and Wang et al. (2018), where the GA model was found to improve soil type recognition accuracy by 3–10%. These studies highlighted that SR models performed the worst, as it chooses a predictor based on the correlation's strength and ignoring interaction effects between predictors.

The models developed in this study can improve water-quality management decisions. The water-quality managers can implement the partial plots to identify the impact thresholds for different land use and watershed characteristics to formulate watershed regulation and find impaired water bodies that have not yet been assessed. Although this study focused on the Southeastern part of the United States, the methodology can be extended to other United States regions to evaluate the long-term median stream water quality.

7 Conclusion

Understanding the variability of water quality in rivers is essential to improve and predict water quality and environmental conditions in watersheds. Random forests and Boosted regression tree algorithms were evaluated to determine the most reliable model to predict the long-term median water quality indicators (TN, TP, and TUR). Different climatic and watershed characteristics across 97 watersheds located in the Southeastern of the US were used as predictor variables. The results showed that the random forests algorithm performed slightly better than boosted regression tree algorithm for predicting the median values of TN, TP, and TUR. The cross-validation results suggested that the prediction accuracy of the random forest explained 53%, 55%, 48% of variation in TN, TP, and TUR in streams, respectively. The RF algorithm was easy to train due to lesser user-defined parameters compared to BRT. Additionally, RF addressed the model overfitting issue slightly better than BRT as it uses a bagging algorithm that reduces the variance of the predictive function. Because of this, the relative importance of predictors (climatic and watershed characteristics) for the response variables (TN, TP, and TUR) was slightly different for both algorithms, leading to slight differences in model predictability.

The results also highlighted the importance of forest and grasslands within a watershed to sustain healthy streams. Identifying a threshold can help water quality watershed managers develop watershed regulations or design a restoration program based on scientific criteria. While the partial plots can be useful to identify key variables to enhance stream water quality management, additional research is needed to evaluate the different hotspots ((e.g., septic tanks, industries, biogeochemical hotspots, and the distance of pollutant sources from the streams) within the watersheds on the long term spatio-temporal water quality changes.