Introduction

Water is a vital and necessary element for the preservation of human activities. Human communities were first established and developed in regions near riverbanks and lakes, where water and other natural resources were rich and accessible (Mendes and Oliveira 2004). Since then, access to water has become necessary for all human activity such as navigation, crop irrigation, power generation, agriculture, domestic, and industrial supplies; the relationship between rivers and cities permeates throughout urban history. Urban center development and the occupation of river basins tend to occur from downstream to upstream, due to terrain characteristics, this can be observed in Europe and Brazil (Febvre 1994; Le Goff 1998; Tucci 2005; Magris 2007).

Due to increasing urban, economic, and technological expansion, a series of changes in natural ecosystems have been occurring and consequently causing instability in the environment. One of the consequences of this dynamic is urban voids (Busquets 1996; Santos 2005), where these areas end up being used for rainwater retention but also become inappropriate waste disposal areas, and their pollutants can reach rivers. With this, it is necessary to have adequate public policies for the use and occupation of the ground and for urban water management.

In order to classify water quality, different methods have been used, the main method being the comparison of physicochemical and biological parameters in the identification of pollution sources (Sotomayor et al. 2018; Yaseen et al. 2018; Şener et al. 2017; Gupta et al. 2017; Studer et al. 2017; Chapman et al. 2016; Almeida et al. 2008; Kannel et al. 2007a, b; Lumb et al. 2006; Debels et al. 2005; Abbasi 2002; Wills and Irvine 1996; Palupi et al. 1995; Sharifi 1990). This type of water quality evaluation indirectly through the representative parameters and their constituents makes it possible to identify the possible polluter.

A number of studies have been carried out using the water quality index (WQI) (Acharya et al. 2018; García-Ávila et al. 2018; Ewaid and Abed 2017; Gupta et al. 2017; Abdel-Satar et al. 2017), including spatial analysis (Hoover et al. 2018; Abbasnia et al. 2018; Xu et al. 2012), by means of the correlation between the parameters and variables (Sallam and Elsayed 2015; Xia et al. 2012) and using multivariate statistics (Ni et al. 2018; Diamantini et al. 2018; Salomons and Ostfeld 2017; Zeinalzadeh and Rezaei 2017; Azhar et al. 2015), with the objective of evaluating, classifying, and identifying the main sources and constituents of pollution in surface water and groundwater. However, one of the difficulties is that missing data values (data gap filling) may limit the quality of the statistical analysis. During the statistical analysis, most statistical software packages replace those missing values with means of the variables or prompt the user for case-wise deletion of analytical data, both of which are not desirable. This can bias statistical analyses if these values represent a significant number of the data being analyzed (Cüneyt Güler et al. 2002).

In assessing water quality in a lake, Zhao et al. (2012) used the principal component analysis (PCA) to obtain information about water quality and pollution sources. Guedes et al. (2012) used multivariate statistical analysis factorial/principal component analysis (FF/PCA) to identify the groups of pollutants present. For the groundwater quality characterization, Bodrud-Doza et al. (2016) used water assessment indices, multivariate statistics, and geostatistics. Barakat et al. (2016) evaluated the contribution of surface water quality parameters and identified the contamination that affects water quality and its potential sources using a correlation matrix and multivariate PCA and cluster analysis (CA) techniques.

Most studies generally assess water quality using one or two statistical techniques and only in rural or urban regions, but do not integrate them to identify pollution sources. Thus, this study seeks to relate the water quality index (WQI) with the use and occupation of the soil and the seasons of the year, through a correlation matrix and multivariate techniques (PCA and CA), not only to identify polluting sources but also to evaluate and classify the surface water quality, aiming at proposing a methodology for integrated decision-making in territorial and environmental public policies even with data gaps.

Materials and methods

Region of study

The municipality of Campo Grande, with 8092.95 km2 is located in the State of Mato Grosso do Sul in the Center-West Region of Brazil (Fig. 1). The current estimated population is 874,210 inhabitants, with a population density of 97.22 inhab·km−2, in which the urban population represents 98.66% and the rural population 1.34% (IBGE 2017).

Fig. 1
figure 1

Location and sampling sites in Campo Grande. Source: The authors

The local economy is mainly based on agribusiness, with the main agricultural crops being soybeans and corn. Cattle raising is another important activity, which supplies meat to local slaughterhouses and exports to other Brazilian states. The city has industrial and business centers that are areas dedicated to commerce and manufacturing of various genres such as food, beverages, leather and tannery, non-metallic mineral products, and fertilizers.

Pastures occupy large extensions of land in the municipality. However, the physiognomy has been changing due to the introduction of soybean and corn cultivation. Regarding elevation, the municipality has a variation between 500 and 675 meters. The municipality seat is located in the vicinity of the Paraná and Paraguay River Basins, predominantly in the Paraná River Basin, except for a small northwest portion of its territory located in the Paraguay River Basin (CAMPO GRANDE 2017).

According to the Köppen classification, the city has a rainy tropical savanna climate (subtype Aw), characterized by two well-defined seasons, dry winter with an average temperature of 15 °C and a rainy summer with an average temperature of 36 °C, with an annual rainfall of 1400 mm. Approximately 75% of the rain occurs between October and April, the water deficits are substantiated with a greater proportion in June, July, and August, August being the driest month (Peel et al. 2007).

Water quality samples and measurements

The water samples at the 80 monitoring points were collected between 2010 and 2011, every 3 months, from the water quality program in urban and peri-urban areas known as Córrego Limpo (clean stream). The Córrego Limpo Water Quality Program selected the collection points based on the interest of the public administration to identify potential clandestine effluents from industries, urban subdivisions, and proximity to rural areas with agriculture and livestock, among others. However, during the monitoring period, due to the availability of resources for collection and analysis, it was necessary to reduce the number of analyses or campaigns in some specific monitoring points. In order to assess the municipality’s surface water quality, the procedure of the analysis followed the Standard Methods for the Examination of Water and Wastewater (APHA 2012). The following parameters were analyzed: pH, temperature (T), dissolved oxygen (DO), concentration of total nitrogen (TN), total phosphorus (P), electrical conductivity (EC), chemical oxygen demand (COD), biochemical oxygen demand (BOD), total solids (TS), turbidity (Tu), total coliforms (TC), and E. coli. In addition to these physicochemical parameters, the presence or absence of odor and rainfall in the last 24 hours (RAIN 24hs), as well as the conditions of the river banks (LEFT.M; RIGHT.M), evaluated in preserved or deforested areas and the elevation of each point (ELEV) were considered.

The waters in the national territory are divided based on their salinity in freshwater, brackish, and saline waters, as well as their use in 13 classes. The freshwater class is divided into a special class assuming nobler uses such as domestic supply with prior or simple disinfection and classes 1 to 3 for domestic supply after conventional treatment and less noble uses such as fishing, landscape harmony, and crop irrigation, among others. CONAMA Resolution 357/2005 provides the classification for water bodies and environmental guidelines and establishes conditions and standards for effluent discharge, all of the streams in Campo Grande have their water characteristics based on this legislation (CONAMA 2005). Water quality standards according to their intended use, defined in current Brazilian legislation are presented in Table 1.

Table 1 Values referring to CONAMA Resolution 357/2005

WQI calculation

The calculation of the water quality index (WQI) followed the National Sanitation Foundation (NSF) method, adapted by the Environmental Company of São Paulo (CETESB). Nine parameters were considered, which represent water quality characterization. The WQI has values between 0 and 100 and are divided into five groups: 0–19, Poor; 20–36, Bad; 37–51, Acceptable; 52–79, Good; and 80–100, Great. The calculation of the i-th weighted products (qi) for each variable augmented for the respective weights (wi), is shown in Eq. (1) (CETESB 2018):

$$ \mathrm{IQA}=\prod \limits_{I=1}^N{q}_i^{w_i} $$
(1)

The qi was obtained as a function of the concentration of each parameter, based on the CETESB average quality variation curves. The wi corresponds to a weight attributed to the parameter due to its importance for quality. The sum of wi is equal to 1, based on Eq. (2):

$$ \sum \limits_{i=1}^n{w}_i=1 $$
(2)

Correlation analysis

The Pearson correlation coefficient (r) measures the degree of linear correlation between two quantitative variables ranging from − 1 to 1 (Pearson 1895), and widely used in the sciences as a measure of the degree of linear dependence between two variables. This coefficient was chosen due to its ample use within the field since those studies seek to evaluate the correlation between water quality and its variables (Hamzaoui-Azaza et al. 2011; Parizi and Samani 2013; Yu et al. 2016). A correlation coefficient close to − 1 or 1 means a stronger, negative or positive relationship between the variables, and 0 means that there is no linear relationship between them. Determined by Eq. (3).

$$ r=\frac{\sum \left({x}_i-\overline{x}\right)\left({y}_i-\overline{y}\right)}{\sqrt{\left(\sum {\left({x}_i-\overline{x}\right)}^2\right)\left(\sum {\left({y}_i-\overline{y}\right)}^2\right)}}\kern0.5em $$
(3)

Correlation is a mathematical analysis that requires quantitative data, with the rainfall and riverbank categorical variables, which were transformed into dummy variables consisting of assigned numeric values from the categories, assuming the value 0 or 1.

Statistical Analysis

Principal component analysis (PCA) is a multivariate technique that quantifies the significance of the variables making it possible to explain clusters, where new orthogonal variables are explained by a reduced set of uncorrelated data, which are called principal components (PCs) (Shrestha and Kazama 2007; Osei et al. 2010). The interpretation of the results is given by the evaluation of the individual sample projection in the axes defined by the main components (Dim) called “score,” and the coefficient of each variable in the linear combination called “loading” (Gibson et al. 2018). The physicochemical parameters used in the water quality assessment have a great divergence in their measurement and concentration units, thus influencing the statistical results. Therefore, before conducting the PCA, to minimize the influence of the different variables and their respective units, the parameters were standardized (z-scale). Another multivariate technique applied was cluster analysis (CA) for pattern recognition. In order to classify the data in a system into categories, the CA results show homogeneity or heterogeneity between the data in the formed clusters (Vega et al. 1998; Kazi et al. 2009).

Results and Discussion

Quality analysis in accordance with Brazilian legislation

The results obtained in Table 2, presents a comparison with the CONAMA Resolution 357/2005, which classifies surface water bodies. All the streams within the municipality are classified as Class 2, except for the Imbirussu stream (IMB) and its tributaries which are classified as Class 3. The Tu analyses showed less than 4% of the results above the established level, pH less than 2%, TN less than 16%, and DO less than 25%. On the other hand, 84% of the E. coli results, 64% of BOD, and 53% of TP were above the established level. The presence of fecal pollution and domestic and industrial wastes (mainly of organic origin) were responsible for the high concentration of the following parameters in disagreement with the legislation.

Table 2 Relation between the collection sites of the 80 samples and the 12 physicochemical, bacteriological, and topographic parameters analyzed. Source: The authors.

Characterizing water quality, a predominance of the Good WQI can be confirmed, however, it is possible to notice the presence of the Poor and Acceptable WQI, which was a characteristic observed in Class 3 water bodies. Souza et al. (2015) also analyzed water quality in the municipality of Campo Grande, the region of this study, by temporal and spatial diagnosis, and observed that among the physicochemical variables analyzed, only the fecal coliform, total phosphorus, and total nitrogen parameters presented values above the maximum levels established by legislation.

High concentrations of total phosphorus and fecal coliforms were also shown by Zucco et al. (2012) and Capoane et al. (2014), which was justified by the fact that the river basin is characterized by the presence of agriculture and livestock production. Ferreira et al. (2017), analyzed the water quality in a quilombola community and verified a high coliform index and attributed this result to the lack of basic sanitation and inappropriate waste disposal.

Correlation between water quality parameters

Correlation through the association coefficient provided confirmation of the interrelation pattern between water quality parameters. Correlation values range from 0.10 to 0.90 (Fig. 2). Data related to odor and rain do not present any significant correlation. Some of the strong (p < 0.01) and significant (p < 0.05) correlations were observed between pH and TN (0.53), LEFT.M and RIGHT.M (0.86), TC and E. coli (0.90), DO and TN and DO and P (− 0.55), WQI and P (− 0.6), WQI and ELEV (0.56), and WQI and DO (0.64). These results indicate that the phosphorus and total nitrogen nutrients contribute to the DO decay. Similar to Fan et al. (2010) and Sharif et al. (2015), water quality conditions were characterized and confirmed a negative correlation between DO and P and TN nutrients. The characteristics of the margins in general have the same physical appearance, that is, both deforested or both preserved.

Fig. 2
figure 2

Correlation matrix: the red and blue dots correspond to negative and positive correlations, respectively. Small dots with light colors represent lower intensity correlations, and larger dots with darker colors correspond to higher intensity correlations. Source: The authors

The parameters that most favored an increased WQI were dissolved oxygen and elevation, and what contributed to a decrease of the WQI was total phosphorus. Sharma and Kansal (2011) also observed that DO positively impacts the WQI. The urbanization process changes the landscape and hydrological dynamics; due to impermeabilization of basins, flooding is increased, which always occurs downstream from the urbanized areas at lower elevation levels affecting watercourse quality (Tucci 2002, 2008; Vargas et al. 2008). Kannel et al. (2007a, b) found that downstream from urban areas water quality is affected by the contribution of urban evictions and are areas characterized by higher pollution, upstream from urban areas is characterized by rural regions where water quality is affected mainly by chemical fertilizers.

The present coliforms are of the fecal type, indicating fecal pollution, coming from wastewater. In studies by Young and Thackston (1999), Mallin et al. (2000), and Schoonover et al. (2005), high levels of coliforms related to pollen density was observed, in which the coliform counts in urban area streams were higher than those found in rural streams.

The pH is affected by the nitrogen parameter, which contributes to its decrease. Other studies also show this relationship, in which changes in concentration of the various nitrogen forms cause pH changes and results in ammonia nitrogen (ammonia) levels that are toxic to fish, making it difficult to exchange gas between animals and water (Alabaster et al. 1979; Arana 2004; Camargo and Alonso 2006).

Parameters selected by the PCA

In order to understand the relationships between the variables, as well as their impacts, that is, the statistical loading and score for each water quality PC for the sampled points, Table 3 shows the results obtained. The score is classified as negative or positive and the loading component can be classified according to Liu et al. (2003) in three classes: “strong,” values higher than 0.75; “moderate,” between 0.75–0.50; and “weak” between 0.50–0.30. The components were selected based on principles suggested by Jolliffe (2002) in which the cumulative percentage of the total variance should be between 70 and 90% for a reasonable idea of the original variance representation. Based on this criterion, the main components selected obtained a cumulative variance of 86.25%.

Table 3 Vectors of the covariance matrix and cumulative variance. Source: The authors

The first factor (PC1) accounts for 26.86% of the total variance, which showed a positive score and high, moderate, and weak loading of the DO, ELEV, TC, and E. coli, as well as a negative score and moderate loading of the TN and P, and a weak loading of the EC, Tu, and BOD. With 16.90% of the total variance, the PC2 presented a positive score and weak loading of TN and P and moderate loading of TC, E. coli, and pH, as well as a negative score and weak loading of Tu and BOD and moderate loading of T. The third factor (PC3) accounted for 11.10% of the total variance pointing to a positive score and weak loading of the ELEV and TS, and a negative score and weak loading of Tu, BOD, and TC and moderate loading of E. coli.

PC4 had 8.99% of the total variance, the EC parameter had a negative score and weak loading, pH had a positive score and moderate loading, and the DO, Tu, and TS parameters had a positive score and weak loading. Presenting a total variance of 8.17%, the PC5 had a positive score and a low pH loading and TS had a negative score and strong loading. The sixth factor (PC6) had a positive score and a high loading of EC and a weak loading of Tu with a total variance of 7.80%. With only a positive score and moderate loading of BOD, PC7 had a total variance of 6.43%.

Analyzing only the PCs with the highest variance (Figs. 3 and 4), it can be seen that PC1 is composed of the DO, TN, ELEV, P, and TC parameters with corresponding contributions of 20.84%, 16.89%, 16.53%, 14.44%, and 9.14%, respectively. PC2 is composed of E. coli, TC, T, and pH, with contributions of 19.12%, 19.12%, 16.60%, and 14.1%, respectively. Thus, the parameters that are close to the center have a lower score and contribution, and those that are far from the center have a higher score and contribution.

Fig. 3
figure 3

Main components with greater significance. Source: The authors

Fig. 4
figure 4

Clusters and contributions of variables. Source: The authors

As the study by Fan et al. (2010), in identifying surface water quality characteristics, PCA results presented an environmental variance of 86% and their main components are represented mainly by the DO, BOD, TN, and P parameters. Barakat et al. (2016) also used the PCA to identify water quality with 12 physicochemical parameters, the result represented 63% of the variance in the data set, using parameters such as pH, TN, E. coli, BOD, P, EC, and Tu contributors for their results.

Cluster Analysis

Of the 12 variables employed in this study, 7 are more significant, that is, they have a more significant relationship between the dimensions of the data. The data was grouped (Fig. 4) establishing 4 clusters, represented by the following categories: physical, chemical, biological, and topographical. Variables that have the same direction and are close, are highly correlated. It can be observed that Cluster 1 is formed by Tu, BOD, and T, representing characteristics of effluents with a high organic load, as well as a high suspended solid load and temperature above the environment (Benka-Coker and Ojior 1995; Haydar and Aziz 2009; Suthar et al. 2010).

Cluster 2 is formed only by the physical parameter TS, where the present contaminants in the water with the exception of the dissolved gases contribute to the loading solids. Cluster 3 is formed by the TC, E. coli, DO, and ELEV parameters, and represent the parameters with high loading and positive score, that is, parameters that are of great importance in the composition of the PCs, having dissimilar values between the samples and that are positively correlated, as well as the relationship of the DO with elevation. The dissolved oxygen decreases as the elevation increases because of the decreasing relative pressure (Garcia 2015; Jacobsen et al. 2003). Barakat et al. (2016) observed that E. coli, P, BOD, and N parameters, which are organic and nutrient variables, may be associated with the influence of domestic, industrial, and livestock operations.

Cluster 4 was formed by the TN, P, EC, and pH parameters, which are related to agricultural, domestic, and industrial waste, having a considerable load of these nutrients. Due to their transformation processes in the aquatic environment, as well as the presence of dissolved salts, pH conditions are altered. In a water quality identification study of conditions by Sharif et al. (2015), clusters were also formed from DO, pH, and TN parameters, as well as with EC and micronutrient parameters such as Na, Mg, and K characteristics found in agricultural regions. Parameters such as TS, Tu, and pH are attributed to water properties and natural weathering of basins, and the parameters EC, BOD, and TN are indicators of contamination sources as well as anthropogenic inputs (Barakat et al. 2016).

Analysis of the PCs in the sample

When checking water quality behavior contributed by region (rural, industrial, and urban), season, and using the WQI (Acceptable, Bad, or Good) classification, the organic load was observed in samples in the industrial areas as a parameter characteristic, nutrients in the rural areas, and pathogenic organisms in the urban areas. A larger number of samples classified as Acceptable and Bad occurred during the winter and spring months and in rural and industrial areas.

It can be observed that in the urban area of contribution (Fig. 5), positive PCs have a high degree of influence, were the ELEV, DO, TC, and E. coli parameters contributed the most for the distinction of the samples, as well as TS with weak loading. This area presents collection points near rainwater, effluent, and spring release points. Von Sperling (2014) states that the main sources of pollution in urban areas are related to wastewater and rainwater, with the main representative parameters being TS, BOD, TN, P, and TC, possibly presenting polluting effects such as sludge deposition, pathogens, and waterborne diseases.

Fig. 5
figure 5

PCA in urban, rural, and industrial areas. Source: The authors

In the rural regions (Fig. 5) it can be observed that the positive and negative PCs of the DO and ELEV, TN, and P, as well as lower loading parameters such as TS and EC, cooperated to distinguish samples from this area. The collected samples are close to domestic dumps, but to a large extent, occur in native vegetation. According to Sopper (1975) and Pinto et al. (2012), areas with natural vegetation cover are important for maintaining good water quality, as well as promoting protection against erosion and excessive leaching of nutrients from the soil. In rural areas, the main sources of pollution are related to rainwater, having as its main constituent the non-biodegradable organic matter (Von Sperling 2014).

In the areas with industrial dominance (Fig. 5), the negative PCs of P, T, TN, positive pH, and the EC, BOD parameters were responsible for the distinctness of the samples. The collection points are located near industrial evictions, coming from leather and tannery, and food and beverage processes. Effluents from leather and tannery have high concentrations of organic matter and numerous toxic chemicals (Pascoal et al. 2007; Zupancic and Jemec 2010). The food industry is characterized by high concentrations of organic matter and low biodegradability due to the use of several additives such as dyes (Huang et al. 2002; Nigam et al. 2000). Effluents from beverage industries have high organic loads (BOD, COD, and total solids) and an alkaline pH (Sereno Filho et al. 2013).

In the distinction of the samples related to the qualitative classification (Fig. 6a), it was observed that the positive PCs of the DO, TC, E. coli, ELEV, and the negative PC T collaborated for this discrimination. The differentiation of the Acceptable and Bad classes undergoes intervention of negative PCs such as P, TN, and EC, except for the positive PC pH. It can be verified that in the Acceptable and Bad classes, much of these points belong to rural and industrial areas of contribution. Singh et al. (2005), assessing water quality and the distribution of pollution sources, concluded that soil weathering, the discharge of municipal and industrial effluents and the leaching of solid waste disposal sites were among the main sources responsible for the deterioration of surface water quality.

Fig. 6
figure 6

PCA classification for quality and seasons. Source: The authors

Regarding the seasons (Fig. 6b), it was noted that winter (dry season) and spring suffer from DO, pH, EC, TC, E. coli, TN, and P parameter actions, consisting of PCs with higher loading and a PC with a lower loading for TS. Similar to the study by Offem et al. (2011), DO, TS, and pH parameters were also influential in the dry season, in the evaluation of the effect of the seasons on water quality. Summer (rainy season) and autumn were influenced by PCs with lower loadings such as Tu, BOD, T, and TS. With this, it can be noticed that winter and summer have parameters of great importance in the composition of the PCs in their majority and heterogeneity in their sample values. In summer and autumn, the situation is exactly the opposite from the other seasons, they are linked to parameters of small importance in the PC clustering and their sample values have homogeneity. Vasco et al. (2011) also observed variations between rainy and dry periods in relation to BOD, temperature, total solids, and turbidity parameters.

It was also verified that in the winter and spring, the samples presented a greater amount of Acceptable and Bad classes. Gonçalves and Rocha (2002) analyzed water quality indicators and land use patterns in river basins, and also verified that surface water quality was lower in low rainfall periods. In beginning of the rainy season, the water begins to flow through the soil profile, thus mobilizing the accumulation of nitrate to reach watercourses (Holloway and Dahlgren 2001). In the other seasons, a predominance of the Good classification can be verified.

Conclusion

According to the physicochemical and microbiological data obtained, the analysis results of E. coli, biochemical oxygen demand, and total phosphorus exceeded 84%, 64%, and 53%, respectively, the levels allowed by Brazilian legislation.

This indicates that the water from the streams is affected by sources with high organic load, nutrients, and pathogenic organisms. The main parameters that impacted the water quality index were dissolved oxygen, elevation, and total phosphorus. The PCA components showed 86.25% of the variance in the data set, using 11 physicochemical parameters and 1 topographic parameter. In the cluster analysis, of the 12 parameters used in the study, the following 7 were most significant: dissolved oxygen, elevation, total coliforms, E. coli, total phosphorus, total nitrogen, and temperature. In the samples from the industrial region the organic load (BOD) was a characteristic parameter; in the rural region, the nutrients P and NT were characteristic parameters; and in the urban region, pathogenic organisms (total coliforms and E. coli) were characteristic parameters.

The winter and spring seasons of the year were influenced by the DO, pH, EC, TS, TC, E. coli, TN, and P parameters, characterized by their variability and their importance in the grouping of the main components. Summer and autumn were influenced by parameters that have sample uniformity and of little importance in the composition of the main components such as Tu, BOD, T, and TS. In the classification of the samples, there was a predominance of Good quality; however, the samples classified as Acceptable and Bad occur in winter and spring in rural and industrial regions. As a result, it can be verified that water pollution is related to land use and occupation, population density, and a lack of sanitation, and it is necessary to implement measures to preserve the amount of pollution and revitalize the water downstream, due to a better quality in the areas higher in the basin and worse quality in the lower areas.

The use of multivariate analysis and correlation allowed the verification and identification of the parameters that negatively and positively impact water quality and main contamination sources, in different seasons of the year and different soil uses, being able to serve as an effective tool, even with data gaps, for the decision-making in public policies to improve water quality.