Introduction

Rivers are one of the main natural sources used for drinking and utility water supply, aquaculture, agricultural irrigation, and energy production (Gupta and Gupta 2021a). Nevertheless, rivers are not simply biophysical phenomena that provide services to people. There is a connection between cities and rivers, a bi-directional or reciprocal relationship maintained by humans and rivers since they are social–ecological systems (Anderson et al. 2019). These systems contribute to improving the landscape of the city with parks and riverside forests, but they also become sources of waste and pollution, since they can be used as landfill for sewage and industrial effluents (Gatica et al. 2012). These freshwater resources also play important roles in urban areas, such as serving as carriers of water and suspended solids, providing habitats for a diverse and productive biota, and acting as social and cultural elements for the human inhabitants living in the watershed (Walsh et al. 2005). The relationships between cities and water bodies, as well as the issues related to the quantity and quality of urban water, change in response to the expansion and development of cities (Schirmer et al. 2013; Walsh et al. 2005). In developing countries, urban waterfronts have been transformed into mixed residential and commercial areas with high-density infrastructure and services. Currently, rivers and their banks are considered natural urban ecosystems that provide a wide development potential in cities (Khairabadi et al. 2023).

When there are urban uses in the hydraulic public domain or protection zones of a section of a river and/or the section is immersed in an urban matrix, that section should be considered an urban river (Duran Vian et al. 2020). Urban rivers are important not only because they constitute the habitat of species that control and maintain the ecological state of the basin in which they are located but also because they represent a considerable resource that is part of the productive, social, and cultural activities of the surrounding community (Obisesan and Christopher 2018).

Urban rivers face their own challenges: they are in transition with numerous tensions (social, political, economic, ecological, and environmental tensions) leading to water scarcity, decreasing water quality, and climatic variability (Ghimire et al. 2022). The water quality of an urban river is affected by several variables, i.e., geography, topography, atmosphere, anthropogenic activities, etc. (Gupta and Gupta 2021b). The impact of anthropogenic activities on urban rivers (irregular urbanization and industrialization, population growth, use of fertilizers and pesticides in agriculture, discharge of domestic wastewater, removing sand-gravel from stream beds, etc.) has increased substantially in recent years, leading to the rapid loss of their natural character (Schirmer et al. 2013; Ustaoğlu et al. 2020; Das et al. 2023; Ghimire et al. 2022; Gupta et al. 2022). This has become a serious problem for many other related aquatic environments, such as streams, lakes, and estuaries (Walsh et al. 2005). This situation is one of the main environmental problems in the world, as the increase in human population raises the need for freshwater (Ustaoğlu et al. 2020).

In urban areas, streams receive groundwater, treated and untreated wastewater, and industrial waste, among others, and are often degraded by a multitude of stressors (Obisesan and Christopher 2018; Ghimire et al. 2022). This degradation is called the “syndrome of urban rivers” (Schirmer et al. 2013; Walsh et al. 2005). Urban rivers are very sensitive to changes in land use, and the response is a warning signal of potential downstream water deterioration (Kominkova 2013). Anthropogenic activities deteriorate the quality of river water and prevent their utilization for other purposes such as drinking water or irrigation (Gupta and Gupta 2021b).

As highly dynamic systems, water quality estimation in rivers is a complex assignment but very significant in order to generate pollution mitigation strategies (Gupta et al. 2022). Regular monitoring of the water quality of rivers, including the control and evaluation of physicochemical and/or biological parameters, became extremely important to protect this resource (Ustaoğlu et al. 2020). These parameters can be combined and transformed into simple dimensionless numbers, called water quality indices, that are effective in monitoring river pollution by transforming multifaceted water variable information into usable and intelligible data (Gupta et al. 2022).

Currently, there are general water quality indices and specific indices tailored to particular purposes (Almeida et al. 2012; del Corigliano 2008; Unda-Calvo et al. 2020). For example, the Simplified Index of Water Quality (SIWQ) combines five physicochemical parameters to determine water quality (Queralt 1982). The numerical value of the SIWQ defines five levels of use of water according to its quality: all uses, swimming and fishing, irrigation and industry, forest irrigation, navigation, and very restricted use (Bustamante 1989; Bustamante et al. 2002, Losada Benavides et al. 2020).

A complete assessment of water quality requires extensive long-term data collection over space and time, generating huge data sets. Evaluation of spatiotemporal variation using statistical approaches is an effective method for monitoring the water quality of a river (Gupta et al. 2022). Multivariate statistical techniques, chemometric or environmetric methods (e.g., hierarchical cluster analysis (HCA), principal component analysis (PCA), factor analysis (FA), multivariate analysis of variance (MANOVA), correlation matrices (CM), regression analysis (RA), artificial intelligence (AI), etc.) are the most widely used ones in the analysis and classification of water qualities (Gupta et al. 2022; Fletcher et al. 2013). They also allow the interpretation of complex data sets and the understanding of temporal and spatial variations (Gatica et al. 2012; Ouali et al. 2009; Unda-Calvo et al. 2020), which can help to visualize possible water pollution sources from raw analytical data and reduce the subjectivity in the use of water quality indexes (S. Gupta and Gupta 2021a). The multivariate exploratory statistical techniques (PCA, HCA, etc.) identify the natural clustering pattern and associate variables based on similarities between samples (Kannel et al. 2007). RA also allows the developing prediction models, classification models, and time series for monitoring river water quality (Gupta and Gupta 2021a). Multiple linear regression (MLR) allows for analyzing the relationship between variables, providing more user interaction and control over predictive analytics compared to other prediction models such as machine learning models (Yildiz et al. 2017).

MLR is the most widely used statistical technique to estimate relationships between dependent and independent variables in regression modeling. It is used considering two main objectives: to interpret data and to predict future response values (Etemadi and Khashei 2021).

These models make it possible to accelerate and lower the costs of water quality evaluations (Ewaid et al. 2018), considering the expensive task of determining numerous parameters to monitor rivers. To develop a model, it is necessary to carry out a comprehensive study of the system to select, using statistical tools, those parameters indicative of contamination (Valentini et al. 2021). Regression models have been successfully used in recent years for modeling hydrological processes (S. Gupta and Gupta 2021b; Gupta et al. 2022).

Problem statement

Currently, there are numerous worldwide rivers affected by anthropogenic activities, and, therefore, researchers are dedicated to studying them through quality indices and multivariate statistics (Alvareda et al. 2020; Barakat et al. 2016; Brilly et al. 2006; Bu et al. 2019; Carrasco et al. 2019; Connor et al. 2014; Dimri et al. 2021; Edokpayi et al. 2015; Fan et al. 2010; Hernandez-Ramirez et al. 2019; Howladar et al. 2021; Keupers and Willems 2017; Pinto and Maheshwari 2011; Valentini et al. 2021; Varol 2020). In Argentina, studies of urban rivers in large cities seek to establish their water quality and the degree of anthropogenic impact (Casares and De Cabo 2018; Bonansea et al. 2013; Merlo et al. 2011; Mgelwa et al. 2020; Nimptsch et al. 2005; Rautenberg et al. 2015; Valdés et al. 2021; Gatica et al. 2012; Lupi et al. 2019; del Corigliano 2008; Cazenave et al. 2009).

Similar studies have been conducted in the province of San Luis (Almeida et al. 2012; González et al. 2014), using statistics to evaluate the water quality of mountain rivers affected by tourism. The San Luis River is an urban river that flows through two developing cities, San Luis and Juana Koslay, and is affected by anthropic activities. Scarce research has been performed in order to evaluate the water quality of this river, with studies conducted over short periods of time and focused mainly on the industrial section of the river without developing a quality monitoring model (Castro et al. 2021). However, anthropic activities are developed not only in the industrial zone but also in the entire surrounding area and the riverbank itself, in both San Luis and Juana Koslay cities. The anthropic disturbances that affect this river encompass effluent discharges, runoff from animal husbandry, land removal through sand mining, uncontrolled riverbank vegetation management, and the accumulation of solid wastes (Borgatello 2014; Calderon et al. 2014; Ortiz 2017; Castro et al. 2021). Moreover, in recent years, the population of San Luis City has significantly grown, resulting in the discharge of inadequately treated wastewater into the San Luis River, as reported by the municipality (Giorda 2021). Simultaneously, the water of this river infiltrates and replenishes underground aquifers that eventually flow into the Salinas del Bebedero. In this area, there is a salt exploitation company that provides a significant portion of Argentina with this compound. The water quality of this river is relevant not only from an ecological point of view but also from social, cultural, and economic approaches, which makes its monitoring relevant.

Considering the context outlined above, this study has three primary objectives:

  1. 1)

    To estimate the SIWQ index and classify the quality of river water for different uses.

  2. 2)

    To assess the spatiotemporal variations of water quality of the San Luis River using multivariate statistical analysis.

  3. 3)

    To develop and evaluate a multiple linear regression model to monitor an urban river.

Material and methods

Study area

The study area comprises the San Luis River, an urban river located in the province of San Luis, Argentina. This river, also called the Chorrillos River, courses through two developing cities: Juana Koslay and San Luis, where significant urbanization has occurred over the past decade. Data obtained from the Instituto Nacional de Estadística y Censos show that the Department of Pueyrredón—an area encompassing San Luis, Juana Koslay, and La Punta cities and some other smaller localities—increased its population from 204,019 inhabitants in 2010 (Instituto Nacional de Estadística y Censos (INDEC) 2010; Población Provincial Por Localidades Años 1869-2010 n.d.) to 260,295 in 2022 (Instituto Nacional de Estadística y Censos (INDEC) 2023), which implies a population growth of 25%.

This river originates at the foot of the sierras of San Luis, specifically in Juana Koslay City (33° 16′ 59,32″ south latitude, 66° 14′ 15,03″ west longitude), at the confluence of the Las Chacras and Cuchi Corral streams. It receives contributions from smaller basins located on the southern flank of Sierra de los Venados, as well as from an unnamed stream that collects water from its eastern sector. The San Luis River is a third-order urban stream, and it is part of the Bebedero Basin. It meanders through Juana Koslay City and is impounded at Dique Chico. From there, it flows southwestward through the city of San Luis, encompassing the industrial zone, and ultimately reaches an area where effluents from the municipal sewage treatment plant are discharged. Subsequently, the river traverses flat terrain, and during periods of low flow, its volume decreases significantly, with infiltration into the subsurface occurring approximately 15 km away from the city of San Luis. The San Luis River ultimately drains into the Bebedero Basin, approximately 42 km southwest of the capital city, near Salinas del Bebedero (Calderon et al. 2014). This body of water can be regarded as one of the most severely impacted fluvial systems in the province of San Luis.

Sampling sites and regime

Eleven sampling sites were selected considering the level of disturbance of the river along its course through the cities of Juana Koslay and San Luis. The first two sites, represented by JA_w and JB_w, were considered the lowest disturbance level sites since they were located at a short distance from the river origin in the city of Juana Koslay. The next six sites, represented by JC_w, JD_w, JE_w, JF_w, JG_w, and JPO_w, are located in San Luis City, and the last three sites, JZ1_w, JZ2_w, and JZ3_w, are located near the end of the river course, prior to infiltration. These sites are related to different anthropogenic impacts: rainwater (JE_w), truck washes (JG_w) or water waste (JPO_w) discharges, recreational purposes (JD_w, JE_w, and JF_w), river sand mining (JC_w and JG_w), or animal husbandry (JZ1_w, JZ2_w and JZ3_w). Figure 1 shows the sampling sites that were monitored, and Table 1 shows the location and description of studied sampling sites. Selected sites represented a gradient of urbanization and human disturbance, ranging from areas with minimal disturbance—where the environmental baseline was determined—to sites with increasing levels of human impact.

Fig. 1
figure 1

Sampling sites along the San Luis River. SL, San Luis City; JK, Juana Koslay City

Table 1 Location and description of studied sampling sites from the San Luis River

The climate of the region is continental, with an annual rainfall regime of 595.4 mm and a monthly average of 49.7 mm (Ledesma and Arrellano 2022). Sampling was conducted during the 2015–2017 period based on rainfall measurements that defined dry and wet seasons. The period from May to October was defined as the dry season, and the period from November to April was the wet season. A total of 80 samples were collected.

Water sample analysis

Water samples were collected, preserved, transported, and analyzed according to Standard Methods for Water and Wastewater (APHA 2017). Parameters such as pH, conductivity (CONDUCT.), total dissolved solids (TDS, SM 45040), temperature (T), and dissolved oxygen (DO) were measured in situ with portable equipment. Other five parameters were determined in the laboratory: turbidity (TURBIDITY) (S.M. 2130-B); total suspended solids (TSS, S.M. S.M. 2540-F); organic matter (OM) (DQO S.M.5220-B); phosphorus (P, S.M. 4500-PO43−E); NITRATE (S.M-4500-NO3E). The parameters were expressed in milligrams per liter, except conductivity (μS cm−1), pH, and turbidity (NTU).

Simplified index of water quality

SIWQ is obtained from a simple formula that combines five physicochemical parameters: OM, TSS, DO, conductivity, and T to provide a quick and intuitive idea of water quality. SIWQ varies between 0 (minimum water quality) and 100 (maximum water quality) and was proposed by Queralt (1982). SIWQ can be calculated according to the equation: SIWQ = E (A + B + C + D); where E is a factor that depends on temperature and its values range from 0.8 to 1, A is a factor that depends on OM, its values range from 0 to 30, B is a factor that depends on TSS, its values range from 0 to 25, C is a factor that depends on DO, its values range from 0 to 25, and D is a factor that depends on conductivity, its values range from 0 to 20 (Queralt 1982). Other authors (Alonso Duré 2013; I. Bustamante et al. 2002; Bustamante 1989; Losada Benavides et al. 2020) modified the order and number of categories, and currently, water quality can be classified into five categories according to its potential uses: (a) 0 ≤ SIWQ ≤ 30 Navigation, very restricted use; (b) 30 ≤ SIWQ ≤ 45 water only suitable for forest irrigation; (c) 45 ≤ SIWQ ≤ 60 water suitable for irrigation and industry; (d) 60 ≤ SIWQ ≤ 85 water suitable for swimming and fishing; and (e) 85 ≤ SIWQ ≤ 100 is considered all-purpose water (Losada Benavides et al. 2020). A SIWQ of 60 is the minimum acceptable value for the water quality of a river (López Fernández et al. 1998).

Statistical analysis

Exploratory/descriptive multivariate statistical analysis

Exploratory multivariate analyses were conducted with the opensource software R (version 4.1.0). Multivariate statistics were used to analyze environmental data (Aldás and Uriel 2017; Dormann 2020) and compare the spatial and temporal values of the studied physical–chemical variables.

Principal component analysis

PCA (Aldás and Uriel 2017; Dormann 2020) was performed aiming to extract significant components, reducing the contribution of variables with lesser significance. This technique reduces information provided by the variables to their essential features and transforms them into components (linear combinations of the original variables) that represent the proportion of variance explained by all variables (Greenacre et al. 2022).

Hierarchical cluster analysis

This method helps to identify natural associations in the dataset of river water quality and uses dendrograms to represent the similarity pattern between sampling sites (Aldás and Uriel 2017; Dormann 2020; Gupta et al. 2022). This descriptive statistical technique was applied by using Ward’s method (Ward 1963; Barakat et al. 2016) and the Manhattan distance technique in order to maximize homogeneity within the groups. Ward’s method considers the increase in the squared error as the proximity between two clusters and is the most common method to categorize groups. A dendrogram represents the clusters and their proximity with a reduction in the dimensionality of the original data (Barakat et al. 2016).

Heatmaps

Heatmaps were also used to simultaneously visualize groups of samples and their characteristics. A heatmap is another way to show hierarchical clustering, in which data values are transformed into a color scale. Dendrograms were combined with a heatmap to allow visual identification of possible characteristic patterns within each cluster.

Inferential statistical analysis

Correlation matrix

Inferential statistical techniques and individual and multiple correlation analyses were performed to explore the relationships between physical–chemical and elemental variables, resulting in correlation matrices. Pearson’s correlation coefficient (r) was employed to measure the extent of linear association between the parameters, which is useful for assessing the degree of dependence of one variable on others (Singh et al. 2020). Independent variables were selected by eliminating those that showed a strong correlation with each other (collinearity).

Multivariate analysis of variance

The MANOVA technique was carried out for an inferential study. However, not all parameters met the normality and homoscedasticity conditions for the application of MANOVA. Thus, the MANOVA.WIDE function in R was employed. This more robust technique calculates Wald-type statistics (WTS) and modified statistic ANOVA type (MATS), which are versions of these statistical tests designed for semiparametric multivariate data (Aldás and Uriel 2017; Dormann 2020).

Multiple linear regression models

Finally, multiple linear regression models (MLRM) (Harrel 2015) were applied in order to study the potential relationship between a dependent variable and a series of explanatory variables. The multivariate regression model can be expressed using the equation: Y = β0 + β1 X1 + β2 X2 + … + βm Xm, where Y is the dependent variable, X1, X2, …, Xm are the independent variables, predictor variables, or regressors, β1, β2, …, βm are the coefficients of the model, and m is the number of independent parameters considered in the regression.

The selection of the regressor variables was carried out by progressive elimination, until a model with an adjusted coefficient of determination (R2) greater than 50%, and a confidence level of 90% or 95% was obtained. R2 determines the percentage of the variance of the dependent variable explained by the regression model. The corresponding residual tests were carried out: the Durbin–Watson test was used to measure independence, the Shapiro–Wilk test was applied to measure normality, the Breusch–Pagan test was used to measure homoscedasticity, and the Variance Inflation Factor (VIF) test was applied. The Grubbs test was applied to detect outliers. The results of these tests were verified by graphs like the normal Q–Q plot and histogram to assess normality, residuals versus fitted values to test homoscedasticity, boxplots, and distance of Cook to detect outliers and influential values in the model. Standard residual error and adjusted R2 were also taken into account to determine an optimal fit.

The accuracy of the model was tested using cross-validation methods, specifically the Validation Set Approach. This method evaluates the ability of the model to predict the outcome of new unseen observations not used to build the model. The validation set approach consists of dividing the data into two sets: the training set, used to train or build the model, and the testing or validation set, used to test the model by estimating the prediction error (new data from SLR and data from different rivers with similar characteristics). Subsequently, the prediction error is quantified as the mean squared difference between the observed and the predicted outcome value (Kassambara 2018). The statistics Root Mean Square Error (RMSE), R2, and error rate were used to predict error. RMSE measures the average error of the model in predicting the outcome of an observation and is selected because it is an absolute measure of goodness of fit; the lower the RMSE values, the better the model. R2 was chosen based on the fact that it is a relative measure of the fit of the model to the dependent variables. It represents the correlation between the observed values and the predicted values. Higher R2 values indicate a better model (James et al. 2017). The learning curves of the model were plotted as log RMSE vs. iterations. Log RMSE was used for a better appreciation of the differences between both curves (Viering and Loog 2021).

Results and discussion

Water samples and SIWQ

As described before, water samples were obtained from the San Luis River, and ten physicochemical variables were determined on each sample. The average values of all variables determined are shown in Table 2.

Table 2 Average and standard deviation values of physicochemical variables determined in water samples from sites of the San Luis River

Average values of OM, TSS, DO, conductivity, and T were used to calculate an average SIWQ in order to classify water quality at each sampling site. Table 3 displays the obtained SIWQ values and the classification of water samples from each studied site.

Table 3 Average SIWQ determined and classification of water for each studied sampling site of the San Luis River

SIWQ values indicate a greater deterioration in water quality at sites JPO_w, JZ1_w, JZ_w, and JZ3_w. Water quality degradation is mainly affected by variables related to the discharge of municipal effluents from the city of San Luis (OM, conductivity, P, and nitrate). Additionally, medium-quality site JG_w could be affected by specific anthropogenic activities such as sand mining, truck wash discharges, and runoff (Borgatello 2014; Castro et al. 2021). The average SIWQ value for the JC_w site is barely higher than the medium quality limit, meaning that a small variation of any physicochemical variable can affect its water quality.

Considering the information provided by the SIWQ index and the recreational use of the San Luis River in some sections, it would be advisable to conduct further studies by applying the Recreational Water Quality Index (Almeida et al. 2012).

Physicochemical variables determined in water samples were analyzed using multivariate statistics to reduce the number of significant variables and develop a faster and more cost-effective predictive model for monitoring river water quality. Correlation matrix and exploratory analysis were performed using average data (Table 2), whereas inferential analysis was performed using the entire dataset (data not shown), not just average data.

Physicochemical variables correlation

A correlation matrix was calculated to explore associations among the different physicochemical variables in the dataset under study (Fig. 2). There was a positive and statistically significant correlation between the parameters P and OM (0.98), conductivity and P (0.89), and P and TDS (0.90) with p values < 0.001, between pH and DO (0.84), OM and TDS (0.83), and OM and conductivity (0.82) with p values < 0.01, between OM and nitrate (0.72) with p value < 0.05, and finally, between P and nitrate (0.60) with p values < 0.1. Figure 2 also depicts a negative and statistically significant correlation between the parameters DO and OM (−0.92), P and DO (−0.97), DO and conductivity (−0.95), MO and pH (−0.96), pH and P (−0.93), and OD and TDS (−0.95) all with p values < 0.001, and between pH and nitrate (−0.78), pH and conductivity (−0.74), and pH and TDS (−0.74) both with p values < 0.01. A correlation coefficient of 1 was observed between the variable conductivity and TDS, as well as between turbidity and SST.

Fig. 2
figure 2

Correlation matrix of physicochemical variables of water samples from the San Luis River: scatterplots, values, and statistical significance of each correlation coefficient (Pearson correlation coefficients [r]). Signification codes: 0 ***0.001; **0.01; *0.05; “.” 0.1; “ ” 1

Spatial and temporal similarity and site grouping

PCA was then applied using average data to associate the parameters that characterized the groups (Fig. 3).

Fig. 3
figure 3

Biplot showing the projections of the variables in the first two PCs and the distribution of the sampling sites from the San Luis River

The PCA resulted in the selection of two principal components, which represent 88.1% of the variability data. PC1 (Dim1) explains 72% of the variability. The linear combination of the original parameters that originated this component is:

$$\textrm{PC}1=-0.4243\ pH-0.1099\ \textrm{Turbidity}+0.3880\ \textrm{Conductivity}-0.4251\ \textrm{DO}+0.4407\ \textrm{OM}+0.2940\ \textrm{Nitrate}+0.4427\ P$$

Data show that conductivity, OM, P, DO, and pH are the variables with the greatest contribution to PC1. Figure 3 shows conductivity, OM, and P present a positive correlation with each other, with a higher correlation between OM and P, while DO and pH present a positive correlation with each other and a negative correlation with the parameters conductivity, OM, and P.

The second component, PC2 (Dim2), represents 16.1% of the variability of the data and is the result of the following linear combination of the original parameters:

$$\textrm{PC}2=0.2388\ \textrm{pH}-0.7527\ \textrm{Turbidity}+0.2687\ \textrm{Conductivity}-0.1995\ \textrm{DO}-0.0846\ \textrm{OM}-0.5100\ \textrm{Nitrate}+0.0320\ \textrm{P}$$

Turbidity and nitrate contribute significantly to this component. As indicated by the length of the vectors representing these parameters in Fig. 3, turbidity and nitrate exhibit a greater variability compared to the other variables. The parameters that contribute to PC1 are related to water pollution resulting from the organic load, while those contributing to PC2 are linked to pollution originating from land use in areas near the river (Rentier and Cammeraat 2022), with turbidity being the parameter with a major contribution to PC2.

The grouping of sites in Fig. 3 reveals that post-discharge sites (JZ1_w, JZ2_w, and JZ3_w) are characterized by OM, P, and conductivity. Conversely, JA_w, JB_w, JD_w, JE_w, and JF_w sites are associated with DO and pH, indicating a stronger connection to PC1. The discharge site JPO_w is related to the nitrate variable, while the JC_w and JG_w sites are linked to the turbidity parameter, representing PC2.

As part of the initial exploratory analysis, a dendrogram was created to determine site grouping. We used the Ward method with the Manhattan distance technique to produce the dendrogram shown in Fig. 4.

Fig. 4
figure 4

Dendrogram based on hierarchical clustering (Ward’s method) for all studied sites in the San Luis River

Figure 4 shows that all monitoring sites were classified according to four clusters: the first group includes JC_w and JG_w sites and is related to the second group, which consists of JA_w, JB_w, JE_w, JD_w, and JF_w sites. JPO_w site, the only one in the third group, is related to the fourth group, which consists of JZ1_w, JZ2_w, and JZ3_w sites.

The first and second clusters constitute the first branch of this dendrogram. Both clusters present similarities, although to a lesser degree. Sites grouped in the first cluster present similar characteristics due to the presence of comparable anthropogenic disturbances in the upstream sampling sites, such as sand mining, truck washes, and household sewage effluent discharges. In the second cluster, JA_w and JB_w sites, located in the city of Juana Koslay, show greater similarity with each other rather than JE_w, JD_w, and JF_w sites located in a river segment that flows through a highly urbanized area of the city of San Luis.

The third and fourth clusters constitute the second branch of this dendrogram. The JPO_w site corresponds to an effluent discharge and integrates a conglomerate by itself. However, the municipal effluent discharge site JPO_w and sites downstream, JZ1_w, JZ2_w, and JZ3_w, exhibit a great similarity, suggesting that the river cannot naturally recover from the impact of organic pollution.

The clustering outcomes were illustrated using both a dendrogram tree and a heatmap. The heatmap technique combines information from dendrograms obtained for sites and variables. Results performed using average data are shown in Fig. 5, allowed us to determine which parameters characterized or gave rise to the clusters obtained in the dendrogram technique.

Fig. 5
figure 5

Heatmap with dendrogram tree represents physicochemical variables values among sampling sites in the San Luis River

Figure 5 shows that JA_w and JB_w, reference sites with low-moderate anthropogenic activity, have nearly all the analyzed parameter values lower or equal to those of sites upstream of the municipal effluent discharge (JC_w, JD_w, JE_w, JF_w, and JG_w). Sites JC_w and JG_w exhibit higher turbidity values compared to other sites located prior to the aforementioned effluent discharge. This can be attributed to activities related to sand mining (Da and Le Billon 2022; Rentier and Cammeraat 2022) that exist in areas upstream. The exploitation of sand has increased in the area due to the high demand for this resource, primarily driven by its various uses, especially in construction (Da and Le Billon 2022). This increased demand can be attributed to the growth in population (Kemgang Lekomo et al. 2021). The continued resuspension of sediments resulting from in-stream sand removal affects local and downstream water quality. The presence of a certain amount of suspended solids is beneficial for the river system since suspended sediment and nutrients contribute to the ecosystem of the river. However, a large concentration of suspended solids leads to increased turbidity in the main channel, negatively affecting the stream of the river. Furthermore, during sand mining, fuel spills and leaks from excavation machinery and transport vehicles can occur. These contaminants can be adsorbed by sediments and transported downstream (Juez and Franca 2022; Rentier and Cammeraat 2022).

Data show that land uses and anthropogenic activities (sand mining, effluent discharges, etc.) are related to water quality, particularly with changes in variables such as turbidity (Lu et al. 2023; Nasrabadi et al. 2016; Obisesan and Christopher 2018; Yan et al. 2023), nutrient concentrations (P, nitrate, among others), etc. (Lu et al. 2023). Other authors who studied ponds receiving clandestine discharges (Fontanarrosa et al. 2023) proposed similar results.

The group of sites located downstream of the JPO_w site is associated with higher values of parameters like OM, conductivity, and P, inferring that the discharge of poorly treated municipal effluents (Giorda 2021) has an impact on water quality. Nitrate is the most relevant parameter at the JPO_w site. High concentrations of nitrate are mainly due to household effluents, according to various studies (Chilundo et al. 2008; Pisani et al. 2020; Schirmer et al. 2013; Sikakwe et al. 2020). Both urban areas and crops are important contributors of nitrogen and phosphorus in water bodies (Wang et al. 2023).

This same technique was used with average data for wet and dry season parameters separately (data not shown). We observed that the JPO_w site presented the highest concentration of nitrate compared to the other sites for both dry and wet seasons. P and MO variables exhibited higher concentrations for sites after JPO_w for both periods. In addition, the conductivity of sites downstream of JPO_w is higher in the wet season. This could be explained by considering land use in the area associated with livestock.

Inferential multivariate analysis

After conducting exploratory or descriptive multivariate analysis and inferring differences in site groupings and parameter influences, we proceeded with inferential multivariate analysis. This study was performed in order to define whether variations in the studied parameters for the sampled sites were statistically significant or not.

Inferential analysis (Dormann 2020) requires, as a necessary condition, to verify that the data have a normal distribution and homogeneity of variance. These requirements were verified using Royston’s multivariate normality tests (Trujillo-Ortiz and Hernandez-Walls 2007), obtaining a p value of 1.17 × 10−36. In addition, the Lilliefors (Kolmorogov–Smirnov) univariate normality test was applied for samples with n > 50, obtaining p values < 0.01 for conductivity, MO, OD, turbidity, nitrate, and P variables, and a p value of 0.39 for pH variable. The Cramer–von Mises test (Baringhaus and Henze 2017) was also applied, obtaining the same results and a p value of 0.226 for the variable pH. Considering this, we can conclude that only this variable exhibits a univariate normal distribution at the significance level of 0.01.

A multivariate homoscedasticity test was performed by applying Box’s M-test for homogeneity of covariance matrices, both for site and period, obtaining a p value < 2.2 × 10−16 for both cases. Considering p values are less than 0.05, they indicate there is no homogeneity in site or period.

The MANOVA.WIDE function from R software was used to calculate WTS and MATS. Both statistics reported a significant difference between period, site, and the site–period interaction, with a significance level of less than 0.001.

Multiple linear model to monitoring San Luis River

Based on the multivariate statistical analysis carried out, a model that allows monitoring the water quality of the San Luis River was proposed. This model, named the multiple linear model for monitoring San Luis River (MRSL), was developed considering OM as the dependent variable. Two reasons lead us to select this variable as a dependent one: on the one hand, it represents organic load; and on the other, this variable was clearly modified in river sections most affected by anthropogenic disturbances, as can be seen in PCA, heatmap, and inferential analysis.

The proposed multiple linear regression model evaluates the impact of each predictor (independent variable), considering the influence of all other predictors simultaneously. The categorical variable site was included in the model since it was significant. Values of categorical variable site are: 0 for site JA_w, 0.1429 for site JB_w, 0.9171 for site JC_w, −0.5170 for site JD_w, −0.4093 for site JE_w, −0.7866, −0.7866 for site JF_w, 0.9135 for site JG_w, −0.9654 for site JPO_w, 10.6203 for site JZ1_w, 2.6424 for site JZ2_w, and 7.394 for site JZ3_w.

The achieved model is determined by the following equation:

$$\textrm{OM}=1.631+1.47\kern0.75em \textrm{Nitrate}+4.204\kern0.75em P-0.005\kern0.75em \textrm{Turbidity}+\textrm{Site}$$

The statistics of the multiple linear models obtained reveals that when all the physicochemical variables and the categorical variable site as predictors are included, the model can effectively account for 97.06% of the OM variance in the San Luis River. Furthermore, the F-test statistic of 178.8 indicates a high significance of the model, supported by a p value of 2.2 × 10−1, which demonstrates a good association between the predictors and the OM variable. The physicochemical variables P, nitrate, turbidity, and the sites JZ1_w and JZ3_w have a statistically significant relationship with the response variable OM (p values < 0.001).

As evidenced by R2, the multiple linear regression model formulated by MRSL demonstrated that it effectively captured 97.06% of the overall variability of the variable explained by de-regression. The analysis of variance pertaining to the dependent variable OM (P: p value < 2.2 × 10−16, nitrate: p value < 2.2 × 10−16, turbidity: p value = 0.000596, and site: p value = 6.084 × 10−7) indicated that all physicochemical variables and the categorical variable site were statistically significant at the 0.01 level. Several authors have documented models that employ organic load as a basis for predicting parameters of water bodies, with the selection of dependent variables associated with organic load (Jiang et al. 2021) or parameters such as DO (Ahmed 2017; Ouma et al. 2020) or BOD (Vigiak et al. 2019), which serve as dependable indicators of water quality.

One noteworthy aspect of this work is that it helps bridge the gap in scientific data on the physicochemical parameters of the San Luis River, making this research a valuable study to initiate analysis of the area. The effort and costs related to sampling in order to obtain data from each site can be reduced by using this model capable of predicting parameter values such as OM.

The ability to predict OM is of great importance because it poses challenges for drinking water treatment plants. The main issues associated with OM include degradation of organoleptic quality, bacterial growth in the distribution network, and significant chlorine consumption during disinfection (LeChevallier 1990).

As shown in Fig. 6, the obtained model adheres to the necessary assumptions of normality, homoscedasticity, and independence of residuals. This is because it does not contain outliers, and there is no discernible pattern in the residual distribution. These assumptions were confirmed using the Kolmogorov–Smirnov test (D = 0.51, p value = 1), the studentized Breusch–Pagan test (BP = 20.52, df = 13, p value = 0.08), and the Durbin–Watson test (DW = 1.86, p value = 0.06). In all cases, the tests confirmed normality, homogeneity of variance, and residual independence, with p values greater than the 0.01 significance level. The correlation matrix showed a correlation of 0.14 between nitrate and turbidity and a correlation of 0.50 between P and nitrate, confirming the absence of multicollinearity. The calculated VIF values were all below 10, providing further evidence against multicollinearity. Cook’s distance values were also below 1.5, indicating no significant influence of the variables. Additionally, the Grubbs test did not detect any outliers (p value = 0.2845).

Fig. 6
figure 6

Verification of residuals homoscedasticity and independence (top left), residuals normality: Q–Q plot (top right), residual boxplot (bottom left), and influence plot (bottom right)

The MRSL model obtained is parsimonious (few parameters, but it fits well, is easier to understand, and has strong predictive ability), and it satisfies all the necessary assumptions for this type of multiple linear regression. Table 3 shows the minimum and maximum values of each physicochemical variable that must be considered to apply the MRSL model successfully.

The independent variables in the MRSL model, nitrate, turbidity, and P, are associated with the anthropogenic activities discussed throughout the study. This suggests the model allows for the rapid identification of increases in the quantity or frequency of effluent discharge or activities related to soil removal impacting the river.

The MRSL model fulfilled the assumptions of normality, homoscedasticity, linearity, and independence; hence, it was used to predict OM in different rivers of the province. The evaluated rivers were San Luis (SLRL2022, SLRPO2022, SLRA2021, SLRI2021, SLRL2021, SLRAG2019), Los Molles (LMRHP2019), Rosario (RRLT2022, RRLT2019), Conlara (CRC2023, CRSR2023, CRPG2023, CRPG2022, CRPG2021, CRPG2019), Quines (QR2022B, QR2022A, QR2019, QR 2021), Nogolí (NR2022b NR2022a), and San Francisco (SFR2021, SFR2022). For this purpose, river sites with similar characteristics to those studied in this work were selected. The physicochemical parameters experimentally determined at each site were within the range stipulated in Table 4. OM values estimated using the model were compared to OM determined experimentally (Fig. 7). The results in the prediction of OM values were satisfactory.

Table 4 Statistics of each physicochemical variable used to develop the MRSL model
Fig. 7
figure 7

Dotchart comparing experimental values of OM (turquoise) and estimated values of OM using the MRSL model (blue) for sites located at different rivers in the Province of San Luis

Normality and homoscedasticity of the data were evaluated in order to verify if the difference between the experimentally determined OM values (OMExp) and the model-estimated OM values (OMEst) (Fig. 7) was significant. Shapiro–Wilks test was used to test normality, obtaining a p value = 4.322 × 10−08 for estimated OM and a p value = 1.655 × 10−07 for the experimentally determined OM, indicating the data were not normal. Homoscedasticity was evaluated using the F-test, obtaining a p value = 0.995. This p value indicates the data are homoscedastic. Although the data present homogeneity of variances when graphing the boxplots (Fig. 8), the presence of outliers is observed.

Fig. 8
figure 8

Box plots comparing experimental values of OM and estimated values of OM using the MRSL model for different rivers in San Luis Province

The non-normal distribution and the presence of atypical data (outliers) required the application of a robust test to detect differences between the OMExp and OMEst groups. The Yuen test (Mair and Wilcox 2020) was used for paired samples, obtaining a p value = 0.179. A p value > 0.05 indicates that there are no significant differences between the trimmed means of OMExp and OMEst (trimmed means difference: −0.32; 95% confidence interval between −0.7955 and 0.1555). Based on this, we can conclude that the proposed multiple linear models satisfactorily predicts OM, not only for samples from the San Luis River but also for samples from different rivers with similar characteristics.

The accuracy of the model was tested using the cross-validation (set approach method). Two different train and test datasets were selected, and R2, RMSE, and error rate were determined after 75 iterations. The values of these statistics were R2 = 0.999, RMSE = 0.44, and error rate = 3.9% for the training dataset, and R2 = 0.99, RMSE = 0.584, and error rate = 5.11% for the test dataset. The learning curves of the training and validation datasets (Fig. 9) indicate the model is neither under-fit nor over-fit. These statistics exhibit the accurate prediction power of the model for this urban river since an error of 0.58 mg L−1 of OM in a range of values from 2 to 50 mg L−1 is acceptable.

Fig. 9
figure 9

Learning curves of MRSL model (logRMSE vs. iteration).

Conclusions

In this study, it was evidenced through the SIWQ that the quality of the water in the San Luis River is affected by anthropogenic activities. Furthermore, the application of multivariate statistics made it possible to detect significant differences in the temporal and spatial variation of the variables under study.

The multiple linear regression model developed demonstrates robustness in predicting and monitoring the water quality of the San Luis River and different rivers in semi-arid areas with similar characteristics affected by organic load pollution. Results provide reference information for authorities responsible for the environmental management of the San Luis River regarding how anthropogenic activities in the river and its banks affect water quality.