Introduction

The monitoring of surface water quality is of extreme importance for understanding the current situation of water resources and the main changes that have occurred over time. It also allows for detecting the temporal and spatial trends of watersheds. Systematic and periodic monitoring of the water resource conditions allows for planning interventions for improvements, identifying clandestine releases, subsidizing inspection, environmental licensing and the formulation of environmental policies (Finotti et al. 2009; Bilgin and Konanç 2016).

However, the large number of water quality parameters sampled at multiple monitoring stations results in a high cost to the environmental agency. Therefore, strategies are needed to guide the optimization of the monitoring network, such as the identification of relevant parameters and stations, which must be maintained. In this sense, several methodologies are used for the evaluation and resizing of surface water quality monitoring networks (Dong et al. 2015) such as modeling (Paliwal et al. 2007), numerical models and algorithms (Chen et al. 2012; Lee et al. 2014; Park et al. 2006; Strobl and Robillard 2008), fuzzy logic (Huang et al. 2010; Xu et al. 2012; Wu et al. 2010; Zhao et al. 2011), artificial neural networks (Khalil et al. 2011), entropy method (Almeida 2013; Markus et al. 2003; Soares 2001), and proposal of new methods (Telci et al. 2009; Wang et al. 2006).

The relevance of this work is in the use of multivariate statistical techniques for the proposed adaptation of the water quality monitoring network of a river basin in Brazil. These techniques have been applied to water quality monitoring networks throughout the world with satisfactory results (Gamble and Babbar-Sebens 2012; Shrestha and Kazama 2006; Vieira et al. 2012; Zhang et al. 2009).

The water quality monitoring network of the Paraopeba river basin (Minas Gerais, Brazil), the focus of this study, is located in an important economic center of the Minas Gerais state. Over the last decade, some water quality parameters have presented elevated frequencies of violation to local legislation permitted values, such as fecal contamination indicator, phosphorus, and manganese (Costa et al. 2017). Thus, the several economic activities that are developed on this area together with the intense urbanization may be responsible for different impacts on the water resources.

The monitoring network of the Paraopeba river basin has been active since 1997, and its main objectives are (1) to know and to evaluate the surface water quality, disseminating it to users and guiding the establishment of quality goals; (2) to provide subsidies for the planning of water resources management, and (3) to verify the effectiveness of environmental control actions implemented and to propose priorities for action. In almost 20 years of the monitoring program, no evaluation or resizing tests were conducted for this basin. The high cost of the monitoring program is an obstacle to its maintenance and expansion. Thus, optimization of the network may reduce costs by eliminating unnecessary sampling or less important parameters and equivalent monitoring points. Furthermore, some resources may be applied to other areas in the basin or to the measurement of parameters of greatest need. Thus, the results of this work contribute to the planning and integrated management of water resources in the basin and can serve as a guide for work to adapt networks in other river basins.

Materials and methods

Study area

The Paraopeba river basin is an important sub-basin of the São Francisco river basin. The São Francisco river basin is one of the most important in Brazil due to the large water contribution in six states and the Federal District, being the largest river in Brazil. The Upper São Francisco region, in which the Paraopeba river basin is located, is the most degraded region in relation to surface water quality. The drainage area of the Paraopeba river basin is 12,054 km2, which corresponds to 2.5% of the state of Minas Gerais, and it is slightly larger than the land area of countries like Jamaica and Qatar. The Paraopeba river has a length of 510 km and almost 2.5 million people live in the basin, which covers 48 municipalities (Fig. 1).

Fig. 1
figure 1

Localization of Paraopeba river basin, Brazil

The uses of water in the basin vary from the dilution of domestic and industrial effluents to more noble ones, such as water supply, fish farming, and recreation. The use and occupation of soil in the Paraopeba river basin is quite diverse, which generated large impacts on surface waters due to different anthropogenic pressures. This basin also includes one of the most important economic centers of Minas Gerais that forms the second industrial pole of the country. There are also activities such as cattle raising, iron and steel, food and automobile industries, mining, sand and clay extraction, extensive animal husbandry, and eucalyptus monocultures.

Monitoring of the surface water quality in the Paraopeba river basin has been carried out since 1997 by the Minas Gerais Water Management Institute (IGAM), through the Project “Minas Waters”. From the implementation of the network in 1997 until 2013, 1509 water samples were collected, and 62 water quality parameters were monitored at 33 monitoring sites, with a quarterly and semi-annual frequency sampling, obtaining a total of 59,860 valid data points. Due to the extensive and complex database, previous data selection was required for the application of multivariate statistics.

Selection of the data for application of multivariate statistical techniques

Preparation of the database for application of statistical techniques consisted of the selection of parameters and monitoring stations while minimizing the missing data. Monitoring sites were selected that presented continuous quarterly sampling. Parameters with more than 10% missing data were disregarded. Parameters whose environmental and sanitary significance was low for the interpretation of water quality were also disregarded. Parameters with more than 90% censored data (concentrations below the limit of detection of the analytical method of measurement) were not included, as they did not express variation in water quality during the monitoring period. Therefore, these parameters were highlighted as the least relevant for monitoring in this basin. The monitoring database selected covered 720 samples of 19 parameters in 30 monitoring sites, resulting in an analysis of 13,560 valid data points from the period of 2008 to 2013. The selected parameters included ammonia-nitrogen (N-NH4+), chemical oxygen demand (COD), chlorophyll-a (Cl-a), dissolved iron (Fe), dissolved oxygen (DO), electrical conductivity (EC), 5-day biochemical oxygen demand (BOD5), nitrate nitrogen (N-NO3-), organic nitrogen (Norg), pH, temperature (TEMP), thermotolerant coliforms (T. colif.), total chloride (Cl-), total manganese (Mn), total phosphorus (P), total solids (TS), total suspended solids (TSS), true color (COL), and turbidity (TURB). The descriptive statistics (minimum, maximum, median, mean, and standard deviation) of the water quality parameters were calculated for each monitoring site. Distribution of the data was verified by applying the Shapiro-Wilk normality test at a 5% significance level.

Data treatment and multivariate statistical techniques

Cluster analysis

Cluster analysis was used to group the 30 monitoring sites according to the spatial similarity of water quality. The results showed high homogeneity within the group, and a high heterogeneity between the groups (Lattin et al. 2011). The hierarchical agglomeration and euclidean distance was used as a measure of similarity. Prior to the test the data was standardized. This evaluation sought to identify areas where the water quality is similar in the river basin and this statistical technique is commonly used to access the water quality in watersheds throughout the world (Zhang et al. 2009; Gamble and Babbar-Sebens 2012; Vieira et al. 2012; Guigues et al. 2013; Bhuiyan et al. 2016; Bilgin and Konanç 2016]. In order to identify significant differences between the concentrations of water quality parameters for the groups formed by cluster analysis (CA), non-parametric statistical tests were applied (Mann-Whitney (two groups) or Kruskal-Wallis (more than two groups)), followed by the multiple comparison test (when applicable), at the significance level (α) of 5%. Therefore, it was possible to identify which parameters were responsible for differentiating the groups formed by the CA.

Principal component analysis

The principal components analysis (PCA) is a complementary analysis, which seeks to reorganize and show the multivariate data in order to reduce its size, facilitating its visualization and analysis (Mingoti 2005; Abdi and Williams 2010). This reduction occurs when there is substantial redundancy between the data, and therefore, a reduced number of elements can often explain most of the information (İşçen et al. 2009). In this study, the PCA was used to reduce the complexity of the monitoring database, identifying the main parameters responsible for explaining the greater variability of water quality and the different sources of pollution acting in each group formed by the CA.

Evaluation and proposal for optimization of the monitoring network

The violation percentage of the water quality parameters was calculated in relation to the standards set in COPAM (State Environmental Policy Council) and CERH (State Water Resources Council) Normative Deliberation 01/2008 (MINAS GERAIS 2008), legislation that provides environmental guidelines on water bodies in the state of Minas Gerais, Brazil. The water quality parameters that presented 50% or more violation in the analyzed period, called “critical parameters,” were considered the most important parameters, as well as the parameters which explained the greater variability of water quality, as indicated by the PCA. The CA allowed for identifying differences in water quality monitoring sites, guiding their selection in the monitoring network. The parameters which had more than 90% censored data were considered the least relevant for the monitoring network.

Results and discussion

Descriptive statistics

The results of descriptive statistics analysis are presented in Table 1. It is possible to note a large amplitude for most parameters, resulting in high values of standard deviation and skewness. This asymmetry is common in environmental data and it was confirmed by the Shapiro-Wilk test (p < 0.05). In addition, this variation reinforces the importance of the spatial analysis of the concentration of these parameters in the Paraopeba river basin.

Table 1 Descriptive statistics of water quality parameters in the Paraopeba river basin, from 2008 to 2013

Cluster analysis and the Kruskal-Wallis test

For grouping of the 30 surface water quality monitoring sites of the Paraopeba river basin, the Euclidean distance of 21.27 was considered, which resulted in the formation of eight groups as shown in the dendrogram in Fig. 2.

Fig. 2
figure 2

Dendrogram showing the clustering of sampling sites in the Paraopeba river basin

The dendrogram allows for identifying the monitoring sites that present high homogeneity in water quality (monitoring sites in the same group) and those with high heterogeneity (monitoring sites belong to different groups). The monitoring sites of groups 1, 2, 3, 4, 5, and 6 (BP073, BP071, BP098, BP084, BP080, BP086, BP069, and BP069) can be considered the most relevant in the Paraopeba river basin because they have unique characteristics, which are different from most other monitoring sites. Not coincidentally, these sites represent the bodies of water that obtained the highest frequencies of “very bad” and “bad” on the water quality index (WQI) between the monitoring sites of the Paraopeba river basin in the period from 1997 to 2012 (IGAM 2013). Most of these monitoring sites are located in the industrial pole region, in the city of Betim, where water quality is mainly affected by domestic sewage and industrial effluents. The groups formed by the CA were identified geographically in Fig. 3.

Fig. 3
figure 3

Identification of the groups of monitoring sites in the Paraopeba river basin formed by the cluster analysis

The differences between the groups formed by the CA can be confirmed by the results of the Kruskal-Wallis non-parametric test, whose objective was to find significant differences in the concentrations of the parameters analyzed. The results of the multiple comparison test also confirm that the monitoring stations BP073 and BP071 (groups 1 and 2, respectively) are the stations whose surface water quality is more degraded in the Paraopeba river basin. These stations have the highest significant concentrations for Cl, Cl-a, EC, COD, P, N-NO3, Norg, and TS and the lowest concentrations of DO in relation to the other stations (Fig. 4). Concerning domestic sewage contamination, the concentration of T. colif was significantly lower for groups 3, 7, and 8. Most of the T. colif observations in group 1 focused on the detection limit value of the analytical method, which is 160,000 MPN.100 mL−1. Because of this, BP073 is highly contaminated by domestic sewage, and it has presented the highest frequencies (100%) of “very bad” and “bad” classifications of the water quality index (WQI) in the Paropeba river basin from 1997 to 2012, according IGAM (IGAM 2013).

Fig. 4
figure 4figure 4

Box-plot of the concentrations of the water quality parameters and results of the Kruskal-Wallis test and the multiple comparisons test (p < 0.05), considering the groups formed by the cluster analysis

From this analysis, it can be inferred that if there is a need to interrupt or permanently stop the water monitoring at some sites, where these sites must be present in group 7 or 8 because these groups have a larger number of monitoring sites in which the water quality is similar, and they present the best water quality in the basin.

Principal component analysis

The results of the PCA explained 63.1 to 82.9% of the surface water quality variability of each group formed by the CA, retaining five principal components (PC), considering factors with eigenvalues greater than one (Hair et al. 2005). The correlations between variables and the PC of each group formed by the CA are presented in Table 2.

Table 2 Principal components analysis for the monitoring sites in the Paraopeba river basin, considering groups formed by the cluster analysis

It can be observed that groups 1, 2, 3, 4, 5, and 6 are similar in relation to the variables that best correlated with the first factors. The variables that best explain the variability of the water quality of these monitoring sites are related to the presence of organic matter, mainly from domestic sewage and solids in the water body. The monitoring site BP073 (group 1) receives domestic sewage and industrial effluents from the city of Betim, a large industrial region (IGAM 2014). Station BP071 (group 2) is located just downstream of monitoring site BP073, so it is also affected by the water quality of the previous site. EC is also highly correlated with the first factors in these groups, indicating a high concentration of dissolved solids in the water body. Groups 1, 2, and 3 present a higher concentration of total solids in relation to the other monitoring sites as presented by the non-parametric Kruskal-Wallis test results.

The monitoring sites belonging to group 4 (BP084 and BP080) are located on the same tributary in the beginning of the river basin. In addition to the parameters related to organic matter and solids, a high correlation of dissolved iron is highlighted in this group, which can be explained by its proximity to mining activity. The BP084 site is the only one in the basin in which the iron concentration is above the limit allowed by the law in more than half of the samples.

The PC1 of group 5 (BV086) also highlights a correlation with total manganese. According to the Kruskal-Wallis non-parametric test, this group presents a higher concentration of manganese in relation to the other monitoring stations in the basin, except for group 4, which presents a higher concentration than group 5. Total manganese may be the result of mining activities, but it is also important to consider that the basal concentration of this mineral in the river is not known, whose concentration can be highly altered during the rainy seasons. The BP069 site constitutes group 6 and the high negative correlation with dissolved oxygen indicates the high degree of deterioration of the watercourse. This monitoring site receives the impacts from most of the sewage from the cities of Juatuba and Mateus Leme, where there is a maximum sewage collection of 65% (IGAM 2013).

Group 7, consisting of 11 monitoring sites, has a strong correlation with manganese and parameters related to the presence of dissolved and suspended solids in water (COL, COD, TSS, TS, and Turb.), indicating that the main source of pollution is these areas is the transport of solids to the river, mainly from non-biodegradable organic materials. Note that in this group the concentration of biodegradable organic matter is smaller than other groups previously analyzed. These monitoring sites are located predominantly on the main river channel, in the lower course, downstream from the region where the main contaminations of domestic sewage and industrial effluents are introduced. Therefore, this better condition of water quality is probably due to dilution of the contaminants and pollutants with increase of the river flow.

Correlations similar to those identified for group 7 were found in characterization of the water quality of the 12 monitoring sites in group 8. However, for most of the analyzed parameters, the concentrations in group 7 are significantly higher than that in group 8. Therefore, the stations of group 8 can be considered those of best water quality in the basin. These monitoring sites are located mainly in the upper course of the river basin and in tributaries of the main river.

Therefore, it can be observed that association of the results of CA, PCA and non-parametric Kruskal-Wallis test can lead to an understanding of the sources of pollution in different regions of the river basin, as well as identify areas most critical to water quality. These techniques allow an in-depth study of the superficial water quality of the river basin, orienting studies on the resizing of water quality monitoring networks.

Analysis of violation of the water quality parameters with regard to the standards set in ND COPAM/CERH-MG 01/08 and analysis of censured data

The concentrations of water quality parameters measured in the monitoring program were compared to the standards sets in ND COPAM/CERH-MG 01/2008 (Table 3). It was verified that T.colif. was considered a critical parameter in 82% of the monitoring sites analyzed. That is, the discharge of untreated domestic sewage into the water body is still the main problem with regard to water quality in the Paraopeba river basin. It is important to emphasize that this fact does not necessarily result from the lack of investments in sewage treatment in this period, but from the fact that investments made were not sufficient to substantially reduce the sewage volume. Additionally, the type of treatment employed may also not be effective in recovering the water quality of the basin. In this respect the parameters total manganese (critical in 67% of the monitoring stations) and total phosphorus (critical in 30% of the monitoring stations) are highlighted. These results coincide with those obtained by the PCA.

Analyzing the number of critical parameters in the monitoring sites of the Paraopeba river basin, it is verified that the monitoring sites belonging to the first groups of the CA (1, 2, 3, 4, 5, and 6) were those that presented the most critical parameters, as shown in Table 4. Thus, the analysis of critical parameters coincides with the groups formed by the CA, demonstrating the efficiency of the multivariate technique for water quality analysis.

Table 3 Surface water quality standards set in Normative Deliberation COPAM/CERH-MG 01/2008

Optimizing the Paraopeba river basin surface water quality monitoring network

The parameters total chromium, total cadmium, total selenium, dissolved copper and total boron have 90% or more data censored in more than 85% of the monitoring sites evaluated. Therefore, they can be considered the least relevant in monitoring the quality of surface waters in the Paraopeba river basin, and their sampling frequency may be reduced. BP092, BP094, and BP090 are monitoring sites that have the highest number of parameters sampled with 90% or more of censored data (12, 13, or 14), considering the monitoring period from 2008 to 2013, indicating low variability of water quality. The results also suggest an evaluation of the maintenance of BP092 and BP094 in the monitoring network of the Paraopeba river basin. These monitoring sites are located geographically close; they have the lowest Euclidean distance of clustering according to CA, have low number of critical parameters, and have a large number of censored data. So, it is suggested for the state agency—IGAM—to evaluate the relevance of the maintenance of both stations in the monitoring network. Perhaps only one of them would be sufficient to evaluate the background concentration for some of the elements on the basin and serve as a basis for comparison of water quality with other monitoring sites more impacted by anthropogenic activities. Therefore, a permanently stop on water monitoring at BP094 should be indicated due to its geographical proximity to BP092, to the similarity of water quality on both stations and because there were no violations to standards in more than 13 years of monitoring on BP094 site.

Table 4 Number of critical parameters for each of the monitoring sites in the Paraopeba river basin

Table 5 summarizes the results obtained by the violation analysis and censored data, presenting the most important water quality parameters for the monitoring site (50% or greater violation of legal standards), “V”; the least relevant (90% or more of censored data), “C”; and especially those which did not have violations during the study period (N), for each one of the 30 monitoring sites in the Paraopeba river basin. For the elaboration of this table we considered all parameters sampled in the monitoring network (including parameters not considered in multivariate analysis) which have legal standards in ND COPAM/CERH-MG 01/2008, totaling 29 water quality parameters.

Table 5 Summary table of the analysis of violations of the standards defined in ND COPAM/CERH-MG 01/08, for each monitoring site of the Paraopeba river basin

Conclusions

The optimization and resizing of a surface water quality monitoring network for a river basin should be done with caution and planning. This work showed that the association of multivariate statistical techniques with non-parametric tests and analysis of violating the legal standards allows for identifying similar monitoring sites regarding water quality, prioritizing the maintenance of monitoring stations in the monitoring network, identifying the main parameters that explain the water quality variability and the main sources of pollution in different regions of the river basin. This association was efficient to generate a proposal for optimization and resizing of the monitoring network.

The parameters thermotolerant coliforms, total manganese, and total phosphorus were considered the most relevant for the characterization of water quality in the Paraopeba river basin. On the other hand, total chromium, total cadmium, selenium, total dissolved copper, and total boron were considered less relevant. The discharge of domestic sewage and industrial wastewater are the main sources of pollution responsible for the surface water quality deterioration in the basin. The monitoring site BP073 has shown the most intense degradation, while the BP094 site is responsible for monitoring of the best water quality of the basin. It is suggested for IGAM an assessment of the need for maintenance of BP094 in the monitoring network, since it is located geographically close to BP092, which also has similar water quality, and both may be expressing the same information. In this way, the resources spent on BP094 monitoring could be allocated to areas of greater need in the basin. It is our wish that this work may guide the decision-making of public managers in the Paraopeba river basin, as well as works in other river basins around the world.