Introduction

The water quality of a water body can be influenced by several factors, reason why it presents great variability (Fritzsons et al. 2009; Singh et al. 2009; Soares et al. 2020). In natural environments, water quality can be influenced by climatic factors, weathering of rocks, and soil erosion. In anthropogenic basins, agricultural expansion and accelerated population and industrial growth are evident (Ajorlo et al. 2013; Dupas et al. 2015; Muangthong and Shrestha 2015). According to the National Water Agency (ANA 2013a), such factors cause changes in the nutrients, sediments, toxins, heavy metals, among others, causing serious damage to human health and the aquatic ecosystem.

Knowing that water quality reflects the environmental conditions of the river basin, it is becoming increasingly necessary to diagnose and predict future impacts resulting from certain actions. A qualitative monitoring program is the first step to establish a reliable and representative water quality database (Simeonov et al. 2003; Shrestha and Kazama 2007), allowing the detection of spatial and temporal variations in the variables in addition to supporting the management of water resources in the implementation of management instruments, such as the granting of permits for the use of water and charging and framing water bodies in classes of use (ANA 2013a).

In the Minas Gerais portion of the Doce River basin, qualitative monitoring began in 1997 through the Waters of Minas Project, under the responsibility of the Minas Gerais Water Management Institute (IGAM). Water quality is one of the main vulnerability aspects of the basin, since several determinants in the occurrence of specific and diffuse contaminations are observed, such as discharge of domestic wastewater without proper treatment, inadequate disposal of solid waste, high effluent generation, and inadequate land use (ECOPLAN-LUME 2010a).

In the current monitoring network, the Minas Gerais portion of the Doce River basin presents 65 stations in operation, performing four annual campaigns with quarterly frequency for most of the monitoring stations, being two complete and two partial ones. In the complete campaigns, carried out every 6 months, 56 water quality variables are analyzed, of which 51 are common to the group of stations. In the partial campaigns, carried out between the complete campaigns, 19 variables are analyzed in common to the set of stations and to the four monitoring campaigns. For the stations located in the main channel of the Doce River, the campaigns have monthly frequency (IGAM 2016).

In general, the monitoring campaigns analyze variables that make possible to characterize the water quality and the degree of contamination of the water bodies. However, the tempo-spatial variability and lack of proper understanding of water quality parameters due to the extensive database generated make it difficult to control water pollution (Chowdhury and Al-Zahrani 2014; Soares et al. 2020). Therefore, given the large amount of information that the monitoring campaigns have been generating and the lack of specific studies related to this subject for the Doce River basin, it is necessary to use statistical tools to analyze this database and identify the main variables that explain the variability of water quality, the main sources of pollution, and the best sampling frequency.

Among the methodologies available to interpret qualitative data sets, multivariate statistical techniques such as principal component analysis followed by factor analysis (FA/PCA) and cluster analysis (CA) have been widely used in recent years to support the management of water resources (Tanrıverdi et al. 2010; Zhang et al. 2011; Ajorlo et al. 2013; Yu et al. 2013; Sabino et al. 2014; Lopes et al. 2014; Chowdhury and Al-Zahrani 2014; Rocha and Costa 2015; Mohamed et al. 2015; Finkler et al. 2015; Ji et al. 2016; Rocha and Pereira 2016; Varekar et al. 2016; Zeinalzadeh and Rezaei 2017; Herojeet et al. 2017; Le et al. 2017; Calazans et al. 2018a, 2018b; Zhong et al. 2018). In water quality studies, FA/PCA make use of the correlation structure among multiple variables analyzed to produce a small number of new independent variables that contain most of the information in the original dataset (Olsen et al. 2012), allowing to correlate water quality variables with their possible sources of pollution and to select the most important ones for their characterization. The CA allows to identify the best sampling frequency, based on the similarity of the analyzed water quality data.

In addition to the multivariate techniques, other analysis can be used to complement such studies, for instance, the calculation of the percentage of monitored samples that are in disagreement with the standards established by law (Sabino et al. 2014; Martins et al. 2017; Oliveira et al. 2017). It is possible to understand that variables with a high degree of framing class violation may be considered indicative of deterioration of the water quality of the river basin.

In this context, from a data set containing physical, chemical, and biological characteristics of water, statistical techniques were employed with the objective of analyzing the qualitative monitoring network in the Minas Gerais portion of the Doce River basin, identifying the main variables to be maintained in the monitoring network, the possible sources of pollution, and the better sampling frequency, thus providing a guidance to the management bodies for actions of planning and management of water resources aiming to improve water quality.

Materials and methods

Characterization of the study area

The study was executed in the Minas Gerais state portion of the Doce River basin, which corresponds to 87% of the total area of approximately 82,427 km2 (ANA 2013b). The Doce River originates in the state of Minas Gerais, in the Mantiqueira and Espinhaço Mountains, and its waters flows approximately 850 km until it reaches the Atlantic Ocean, in the city of Linhares, Espírito Santo state (ECOPLAN-LUME 2010a). The basin is part of the Southeast Atlantic River Basin and it is located between the parallels 17° 30′ 00″ and 21° 30′ 00″ S and the meridians 39° 30′ 00″ and 44° 00′ 00″ W.

In its entirety, the Doce River basin comprises 228 cities, which their territories are totally or partially inserted in the basin, 200 cities in Minas Gerais, and 28 in the state of Espírito Santo (CBH-Doce 2016a). There are 209 city offices located in the territory of the basin, with a resident population of approximately 3.6 million (IBGE 2010). In the context of water quality, these values bring consequences from the precarious treatment of domestic sewage, one of the main problems in the basin. The negative impact on water quality is observed in some river parts of the basin, especially in the Doce River tributaries, because in its main channel, this impact is minimized by the increase in the river flow (ANA 2016).

The economic activity of the basin is very diversified, specially: agriculture (reforestation, traditional crops, coffee beans, sugar cane, farming); the agribusiness (sugar and ethanol); mining (iron, gold, bauxite, precious stones, etc.); industry (pulp, steel and dairy); trading and support services of industrial complexes; and electricity generation (ECOPLAN-LUME 2010a). The region has the largest steel mill complex in Latin America, which is associated with mining and reforestation companies (CBH-Doce 2016a).

In the state of Minas Gerais, the Doce River basin is subdivided into six Water Resources Management Units (UGRHs), which correspond to UGRH1 Piranga, UGRH2 Piracicaba, UGRH3 Santo Antônio, UGRH4 Suaçuí, UGRH5 Caratinga, and UGRH6 Manhuaçu (CBH-Doce 2016b). The UGRHs are characterized by physical, socio-cultural, economic, and political aspects (IGAM 2016). In Fig. 1, it is possible to observe the separation of the Doce River basin into UGRHs in the state of Minas Gerais.

Fig. 1
figure 1

Geographic location of the Doce River basin and the separation into each UGRH

Within the economic and environmental context, the basin was the target of a major environmental crime in Brazil. On November 05, 2015, the Fundão tailings dam, operated by Samarco Mineração SA collapsed. It was located in the district of Bento Rodrigues, municipality of Mariana, state of Minas Gerais. The dam, classified as Class III, with high environmental damage potential, was destined to receive and store the waste generated by the iron ore beneficiation activity (IGAM 2017a). The dam contained 56.4 million m3 of tailings, of which 43 million m3 (80% of the total volume) were released into the environment. This amount reached 668 km of rivers and streams of the Doce River basin, in the states of Minas Gerais and Espírito Santo (Carmo et al. 2017), resulting in several impacts on water resources and their uses, such as public supply, irrigation, industrial use, power generation electrical, leisure and fishing, destruction of permanent preservation areas, and silting and morphological alterations of water bodies (ANA 2016).

Database used

The water quality data used in the study came from the water quality monitoring campaigns carried out by the “Waters of Minas Project”, where the water quality analyses are carried out by a laboratory accredited by the National Institute of Metrology, Quality, and Technology (INMETRO), which regularly participates in analytical quality control assessments and follows standardized methods for water and sewage analysis (APHA, AWWA, WEF 2012). The monitoring results are available in the IGAM website (IGAM 2018).

For the analysis, it was chosen to use only the variables common to the set of stations, removing initially only the variable “air temperature.” Total of 50 variables (Table 1).

Table 1 Total of 50 water quality variables used in the study

Although the campaigns have been carried out since 1997 and currently the network has 65 monitoring stations, the data used in the study are those from the data collections conducted in the period from 2010 to 2017 in 64 stations. Such action was taken due to the following factors: (i) there is no complete set of data that includes the majority of monitoring stations available in the pre-2010 period; (ii) multivariate analysis does not allow missing values in the dataset; and (iii) the station code RD011, located at UGRH1 Piranga, was recently implemented in 2016, thus providing a small database that made it impossible to use it in the analysis.

In order to carry out the analysis, the whole database with the 50 water quality variables was divided into three data sets: (i) partial campaigns, where the data of the 64 stations and of the 18 quality variables were analyzed, monitored, since the “air temperature” was removed from the analyses; (ii) total campaigns, where the data of the 64 stations and of the 50 water quality variables monitored were analyzed in common to the set of stations; and (iii) monthly campaigns, where the data of the 12 stations located in the riverbed of the Doce River and the 18 variables of water quality that are monitored monthly were analyzed. Those variables are monitored in the partial campaigns.

It is noteworthy that, due to the data base used, the results found were partially affected by the collapse of the Fundão tailings dam in Mariana in 2015, since the IGAM historical series include variables sensitive to the impacts resulting from the accident, such as turbidity, solids, total manganese, and dissolved iron. It is also worth noting that of the 64 IGAM monitoring stations evaluated, only 13 were affected by the collapse, among which the RD011 station, corresponded to approximately 20% of the stations evaluated.

Analysis methods used

The identification and selection of the determining variables of the water quality variability of the Doce River was based on the application of two analyses: principal component analysis followed by factor analysis (FA/PCA) and the analysis of violation of the limits established by the current class of framing. The analysis of the best sampling frequency was performed using cluster analysis (CA). The statistical software XLSTAT® was used to perform all the analyses.

Factor analysis/principal component analysis

Factor analysis/principal component analysis (FA/PCA) was used to select the most significant water quality variables in the interpretation of the data set analyzed. Since FA/PCA does not allow missing values in the data set, the percentage of missing data for each water quality variable was calculated, parameters with more than 10% of missing data were disregarded (Calazans et al. 2018a). For the other missing data, it was considered the mean value of the variable (Olsen et al. 2012).

The FA/PCA was held in two rounds. The first one used only the water quality variables monitored in the partial campaigns, since they have a quarterly frequency and, consequently, a better data representativeness. The second one was performed using the water quality variables monitored in the complete campaigns, which, although they are biannual, they include a greater number of variables.

The Kaiser-Meyer-Olkin (KMO) and Bartlett’s sphericity tests were first performed to confirm the adequacy of FA/PCA to the water quality data. The KMO test verifies the correlation measure between the independent variables. The value of the test varies from 0 to 1, whereas values below 0.5 indicate that the application of FA/PCA is inappropriate. The Bartlett’s test of sphericity evaluates whether the correlation matrix is an identity matrix, which would indicate that there is no correlation between the data and that the factorial model is inappropriate (Muangthong and Shrestha 2015; Jung et al. 2016). FA/PCA was performed by decomposing the correlation matrix into its eigenvalues and eigenvectors. The Spearman’s correlation (Spearman R coefficient) was used due to considering the non-normal distribution of data from the water quality variables (Sabino et al. 2014; Winter et al. 2016), checked by application of the normal Shapiro-Wilk test at a significance level of 5%.

In the FA/PCA, PCA provides information on the most meaningful parameters, which describes a whole data set affording data reduction with minimum loss of original information (Helena et al. 2000; Shrestha and Kazama 2007). The principal component (PC) can be expressed as:

$$ {z}_{ij}={\mathrm{a}}_{i1}{x}_{1j}+{a}_{i2}{x}_{2j}+{a}_{i3}{x}_{3j}+\dots +{a}_{im}{x}_{mj} $$
(1)

where “z” is the component score, “a” is the component loading, “x” the measured value of variable, “i” is the component number, “j” the sample number, and “m” the total number of variables.

FA follows PCA. The main purpose of FA is to reduce the contribution of less significant variables to simplify even more of the data structure coming from PCA. This purpose can be achieved by rotating the axis defined by PCA, according to well established rules, and constructing new variables, also called factors (F). PCs were subjected to varimax rotation generating Fs, which is often used in water quality studies (Zhang et al. 2011; Guedes et al. 2012; Ajorlo et al. 2013; Rocha et al. 2014; Mohamed et al. 2015; Barakat et al. 2016; Villas-Boas et al. 2017). The final effect of rotating the factorial matrix is to redistribute the variance of the first factors to the latter, with the objective of achieving a simpler and theoretically more significant factorial pattern (Hair Jr. et al. 2009). The FA can be expressed as:

$$ {z}_{ij}={a}_{f1}{f}_{1i}+{a}_{f2}{f}_{2i}+{a}_{f3}{f}_{3i}+\dots +{a}_{fm}{f}_{mi}+{e}_{fi} $$
(2)

where “z” is the measured variable, “a” is the factor loading, “f” is the factor score, “e” the residual term accounting for errors or other source of variation, “i” the sample number, and “m” the total number of factors.

In FA/PCA, the factors are extracted in the order of the most explanatory to the least explanatory, and the number of factors is always equal to the number of variables. However, only those with an eigenvalue greater than 1 were considered, so any factor explains a higher variance in comparison with that one presented by a simple variable (Hair Jr. et al. 2009). On the selection of variables to characterize the factors, it was adopted the classification of values of the factor loading proposed by Liu et al. (2003): strong (> 0.75), moderate (< 0.75 and > 0.50), and weak (< 0.50 and > 0.30). It was decided to select the factors that presented factorial load ≥ 0.7, a value widely used by other authors (Chowdhury and Al-Zahrani 2014; Rocha and Pereira 2016; Shrestha and Kazama 2007).

Framing class violation analysis

In addition to the FA/PCA, the percentage of framing class violation was also calculated for the water quality variables that have concentration limits established by the COPAM/CERH-MG Normative Resolution No. 01/2008 (Minas Gerais 2008), considering the water body framing in the location of the monitoring stations. In the Minas Gerais portion of the Doce River basin, only the Piracicaba River Basin has a framing approved by the State Council for Water Resources (CERH-MG), so the Class 2 framing was adopted for the other water bodies as approved by CNRH Resolution n° 91/2008.

In the selection of variables to be prioritized, it was used a percentage of violation of the framing class equal to or greater than 20%. According to the Integrated Water Resources Plan of the Doce River basin (PIRH-Doce), the variables that are above this percentage are indicative of deterioration of water quality, being essential to maintain them in the monitoring program (ECOPLAN-LUME 2010a).

In the analysis, it was chosen to use all the variables monitored, and only one previous analysis was performed to filter those that have their limits established in the legislation, since this is the only necessary requirement for the application of the method.

Cluster analysis

Cluster analysis is a group of multivariate techniques whose primary purpose is to assemble objects based on the characteristics they possess (Shrestha and Kazama 2007). In this study, cluster analysis (CA) was used to evaluate data from the monthly campaigns conducted only in the Doce River riverbed, aiming to gather the 12 months of the year into groups (clusters) according to the similarities of the water quality variables, so that the months within a group are similar to each other but different from other groups. In the analysis, the hierarchical grouping was applied through the Ward method in the normalized data set, using the Euclidean distance as a measure of dissimilarity (bond length), as also used in several other studies (Zhang et al. 2011; Ajorlo et al. 2013; Muangthong and Shrestha 2015). With the result of the CA, it was possible to verify in which months the water quality presents a greater similarity, and therefore, to evaluate the monthly frequency sampling adopted by the IGAM and to compare it with the quarterly frequency adopted for the partial campaigns.

Results and discussion

Analysis of water quality through FA/PCA

In the evaluation of data suitability to the FA/PCA, the existence of significant correlations between the variables for both datasets (p value < 0.05) was verified with the Bartlett’s sphericity test. Regarding to the KMO test, the value found was 0.74 when the data of the partial campaigns were evaluated and 0.85 when the data of the complete campaigns were evaluated, demonstrating correlation of the variables. Due to the results obtained in both tests, it was verified the adequacy of the FA/PCA application to the data set.

When extracting the factors of water quality variables from the partial campaigns, 18 factors were found, six of them with an eigenvalue greater than one, explaining 71% of the total variability of the data. Fig. 2 shows the eigenvalues in descending order and the cumulative variance among the obtained factors.

Fig. 2
figure 2

Eigenvalues and percentage of the cumulative variance of the factors when analyzing the water quality variables of the partial campaigns

Table 2 shows the non-rotational factorial weight matrix for the water quality variables of the partial campaigns. The modulus values of factor loading ≥ 0.7 suggest which are the most significant variables in each factor.

Table 2 Matrix of non-rotational factor weight of water quality variables analyzed in partial campaigns

Based on the factor weight matrix (Table 2), it can be observed that only factors F1 and F6 presented factor loading greater than or equal to 0.7, while the others presented loadings close to this value, as well as similar values in more than one factor, such as the total chloride and electrical conductivity variables, thus hindering the analysis of the results. In this way, it was reasonable to rotate the factors, since the process maximizes their variance without affecting the proportion of the total variance explained by the set (Hair Jr. et al. 2009).

Table 3 shows the contribution of each component after the redistribution of the total variance among the factors by applying the varimax algorithm, without changing the total variance explained. Considering weights greater than or equal to 0.7 as indicative factors of strong loading among water quality parameters, 12 variables were selected.

Table 3 Rotated factor weight matrix of water quality variables analyzed in partial campaigns

Table 3 shows that the rotation of the factors provided significant improvements in the results, since parameters that did not present a high factor loading in some of the factors in the non-rotated matrix (Table 2), started to show some after the varimax rotation. Another positive aspect of the rotation was the better distribution of the factor loading between the factors, and therefore, each variable had a greater numerical value in only one factor, facilitating the interpretation of the result and the identification of possible sources of pollution in each of them. Improvements from the rotation process have also been observed by other studies on water quality (Guedes et al. 2012; Lopes et al. 2014; Rocha and Pereira 2016; Villas-Boas et al. 2017).

The first factor (F1) was responsible for 20% of the total variance of the data. When analyzing the factor loadings of the variables, it can be interpreted that the high values of total suspended solids (TSS), total solids, and turbidity represent the high susceptibility that the basin presents to erosion. According to the PIRH-Doce, the characteristics of soils and relief lead the Doce River basin to a condition of fragility in terms of susceptibility to erosion, which is divided into four levels: very strong; strong; moderate; and low or zero, of which 58% of the total area is classified as strong and 30% as moderate (ECOPLAN-LUME 2010a). Also, according to the PIRH-Doce, the most problematic areas in the Doce River basin are the high stream of the Piracicaba River and the Suaçuí Grande River basin.

As for the Piracicaba River basin, it is observed that the elevated portions of the unit produce the largest amount of sediment with values varying between 100,000 and 200,000 kg km−2 year−1. From the confluence of the Piracicaba River with the Doce River, the production decreases to 50,000 kg km−2 year−1. Among the aggravating factors of high sediment generation rates are the torrential rains, susceptible soils, and the land use in the basin, which has about 60% of anthropogenic areas (ECOPLAN-LUME 2010b). Regarding to the Suaçuí River basin, the values also vary between 100,000 and 200,000 kg km−2 year−1. The extensive areas of the basin occupied by animal husbandry and mining (ECOPLAN-LUME 2010c) collaborate with the erosive process.

The F2 was responsible for 16% of the total variability of the data, inferring about the inorganic material dissolved in the water through the variables total chloride and electrical conductivity. According to Barakat et al. (2016), these variables may reflect the natural conditions of the basin through the weathering of rocks and consequent surface runoff. In addition to their natural origin, total chloride levels may also be related to releases of industrial and domestic effluents (Ramesh kumar and Anbazhagan 2018; Rocha and Pereira 2016). F2 also showed that the pH and TDS, although not selected, presented moderate factor loading, since the electrical conductivity is positively correlated with the dissolved solids (R = 0.73), a result that corroborates with several other studies presented in the literature (Zhang et al. 2011; Frančišković-Bilinski et al. 2013; Muangthong and Shrestha 2015; Barakat et al. 2016; Pavlidis et al. 2018).

The F3, accounting for 14% of the total variability of the data, is represented by total ammoniacal nitrogen and BOD, indicating that the water bodies of the basin suffer variation due to contamination by organic fertilizers from agricultural areas and by the discharge of untreated or partially treated domestic effluents. It can be observed in Table 3 that the variables total ammoniacal nitrogen and BOD presented a positive factor loading, while dissolved oxygen presented negative factor loading. In other words, F3 also shows the inverse relationship between dissolved oxygen and other variables, since BOD is related to the amount of oxygen required to degrade organic matter (Obade and Moore 2018). F4 is basically explained by total coliforms and thermotolerant coliforms, again indicating the precariousness of domestic sewage treatment and its release in the basin’s water bodies.

According to the Water Resources Action Plan of the Piracicaba River basin (PARH-Piracicaba), coliform contamination in UGHR2 Piracicaba is above the standards in almost all the stations in the basin, demonstrating that the discharge of domestic sewage is a constant problem (ECOPLAN-LUME 2010b). As stated by the Water Resources Action Plan of the Caratinga River basin (PARH-Caratinga), in UGHR5 Caratinga, it is clear the condition of domestic effluents overload in surface waters reproduced in the non-conforming results in relation to the class 2 limit for thermotolerant coliforms (61%), total phosphorus (32%), BOD (13%), and dissolved oxygen (11%), as well as the detection of small violations of total ammoniacal nitrogen in isolated monitoring stations (ECOPLAN-LUME 2010d).

The fifth (F5) and the sixth (F6) factors are represented by chlorophyll a and pheophytin a, respectively. These variables refer to the primary productivity in the water bodies, being indicative of the physiological state of the phytoplankton and the degree of eutrophication of the aquatic environment (Giovanardi et al. 2018; Sun et al. 2018), again results from the overload of sanitary sewage without treatment and diffuse pollution from agricultural areas.

For the second round of FA/PCA, where the water quality variables monitored in the complete campaigns were considered, it was necessary to remove the total boron, dissolved copper, oils and grease, and total selenium from the calculation of the missing data percentage. The total cadmium variable was also removed because all values were equal to the minimum detection limit of the test used in the laboratory analysis (0.0005 mg L−1), resulting in a standard deviation of zero. However, the variables calcium hardness and magnesium hardness were removed from the analysis because they essentially had the same attributes as the total calcium and total magnesium variables, respectively, and this was observed due to the high correlation between the variables (~ 1.0). Therefore, 43 out of 50 water quality variables monitored in the complete campaigns were analyzed. Figure 3 shows the eigenvalues in descending order and the cumulative variance among the 43 factors obtained for the 43 variables analyzed in the total campaigns.

Fig. 3
figure 3

Eigenvalues and percentage of the cumulative variance of the factors when analyzed the water quality variables monitored in the total campaigns

Considering the factors with an eigenvalue greater than one, the 43 variables analyzed were reduced to 12 uncorrelated factors, which together explain 76% of the total variance of the data (Fig. 3). In Table 4, the matrix of factor weights after the varimax rotation is presented. Considering weights equal to or greater than 0.7 as indicative factors of a strong factor loading, 29 variables of water quality were selected.

Table 4 Rotated factor weight matrix of water quality variables monitored in total campaigns

As can be seen in Table 4, the FA/PCA results for the total campaigns reinforce those obtained in the partial campaigns. However, because more variables were analyzed in the total campaigns, the selection of new variables considered as representative of the water quality variability in the Minas Gerais portion of the Doce River basin was conducted and, consequently, the identification of additional sources of pollution in the basin that were not considered in the analysis using data from the partial campaigns.

F1, which previously represented the high susceptibility of the basin to erosion, now also has representative variables of heavy metal pollution: total barium, total lead, total chromium, total manganese, and total nickel. Several studies have shown the relationship between heavy metals and solids, demonstrating that only a small number of them remain in the liquid mass and most of them are deposited in the sediments (Thuong et al. 2013; Malvandi 2017; Zhuang et al. 2018). Thus, sediments in the aquatic environment may play an important role in the deposition and transmission of heavy metals, justifying the fact that both have high factor loadings in F1. In the Doce River basin, heavy metals are associated with regional geology; however, their concentration in surface waters is enhanced by the releasing of domestic effluents, by the use of agrochemicals and by mining and metallurgy. They are all dominant economic activities in the Doce River basin (ECOPLAN-LUME 2010a).

A study on the water quality of the Doce River after the collapse of the iron ore tailings dam in the municipality of Mariana reinforces the strong correlation found among the variables in F1. According to ANA (2016), when analyzing only the monitoring stations affected by the collapse of the dam, a strong correlation between the turbidity and the concentration of total suspended solids, total solids, and total manganese was verified, since these variables presented increase in the same order of magnitude. The same study showed that variables such as total lead, total chromium, total arsenic, and total mercury also had the highest maximums above acceptable limits, according to the current legislation, after the occurrence of the dam collapse. However, although they were also linked to mining, total arsenic and total mercury only showed high factor loadings in F9 and F11, respectively.

Although the disaster discharged 34 million m3 of iron ore tailings in the waters of the Doce River basin, the dissolved iron variable presented a low correlation with the other variables and, consequently, a low factor loading in all 12 selected factors. In their study, ANA (2016) also found that, despite increasing the concentration after the dam collapse, the dissolved iron showed a different dynamic than the other analyzed variables. Thus, this fact may justify the low factor loading and the non-selection of the iron dissolved by FA/PCA in the present work.

The F2 continued to represent the natural conditions of the basin through the weathering of rocks and the consequent surface runoff; however, in this second analysis, we can note the addition of other variables associated with the same causes: total alkalinity, total calcium, total hardness, total magnesium, dissolved potassium, and dissolved sodium. Although the dissolved aluminum only presents a high factor loading in F4, it is also influenced by the natural conditions of the basin, since the soil of the region has in its chemical composition large concentrations of aluminum (ECOPLAN-LUME 2010a).

For the other factors, it can be stated that they represent basically the contamination by organic fertilizers originating from agricultural areas and the discharge of untreated or partially treated domestic effluents in the water bodies of the basin. The following variables were selected: BOD, surface active substances, total coliforms, thermotolerant coliforms, fecal streptococci, nitrate, total phenols, water temperature, chlorophyll a, and pheophytin a.

Comparing the first and second rounds of the FA/PCA, it is observed that the difference is basically in the variability of the water quality explained by the heavy metals: total arsenic, barium, chromium, lead, manganese, mercury, and nickel. For these variables, several studies emphasize the importance of monitoring due to their bioaccumulative capacity, since they cause disturbances in the metabolic processes and damage to the biological system of living beings (Lozano et al. 2010; Riguetti et al. 2015; Zapata et al. 2017). Therefore, the importance of the second round of FA/PCA with the data from the total campaigns was demonstrated, as well as the need to include some of these variables in the partial campaigns.

Analysis of violations of the limits established by the COPAM/CERH-MG Normative Resolution No. 01/2008

With the result of the analysis of violation of the framing class, it was possible to identify the variables that most represented deterioration of water quality (20% of violations or more) in the Minas Gerais portion of the Doce River basin in each UGRHs (Table 5).

Table 5 Percentage of violation of the framing class in the Doce River basin considering the variables with limits established by the COPAM/CERH-MG Normative Resolution No. 01/2008

As can be seen in Table 5, only the variables thermotolerant coliforms, dissolved iron, total phosphorus, and total manganese presented values of violation of the framing class higher than 20% among the UGRHs, being the maintenance of the UGRHs in the monitoring program a priority, as well as the variables indicated by the FA/PCA. Thermotolerant coliforms and total manganese were also pointed out in the second round of the FA/PCA, meaning that, in addition to presenting a high rate of violation of the framing class, they are also part of the main variables responsible for the variability of water quality in the Minas Gerais portion of the Doce River basin.

The high rates of violation of thermotolerant coliforms and total phosphorus variables characterize the release of untreated domestic effluent as the main source of pollution that affects the quality of the water of the Doce River basin, a result that has also been found in several other studies in Brazilian basins (Souza and Gastaldini 2014; Oliveira et al. 2017, 2018; Costa et al. 2017; Vargas et al. 2018; Fraga et al. 2019; Soares et al. 2020).

In a study on water quality in the Xopotó River basin, sub-basin of the Doce River, it was found that the microbiological quality of the water is deteriorating (Drumond et al. 2018). In addition to presenting high concentrations of thermotolerant coliforms, a variety of bacterial genotypes were found that represent a potential risk of diarrheagenic diseases, emphasizing the poor condition of the microbiological quality of the water bodies of the basin, mainly due to the absence of sewage treatment plants (Drumond et al. 2018). According to ANA (2017), only 31 of the 200 municipalities in the Minas Gerais portion of the Doce River basin have some percentage of sewage treatment, and many of the effluent treatment plants are unable to remove microorganisms since they do not have a tertiary treatment processes. Despite the inadequacy of the effluent collection and treatment system, the surface water presented low levels of DO and BOD violation in all UGRHs, which can be explained by the auto depuration process, which re-establishes DO levels, but does not reduce the coliforms levels (Andrade et al. 2018).

The high percentage of class violation of dissolved iron and total manganese variables reflects the impacts of mining and releasing of steel removals. The largest steel complex in Latin America is located in the Doce River basin. The extraction of iron ore comprises the main mineral exploration activity, with approximately 20% of mining concessions in Minas Gerais. This entire industrial complex is responsible for most of Brazil’s iron ore and steel exports (ECOPLAN-LUME 2010a). Total manganese is also related to mining; this metal is widely utilized in siderurgy (iron production in the manufacturing of metal alloys and batteries), textile industries (fabric paints), and other chemical industries (varnishes, fireworks, and fertilizers) (CETESB 2016).

Previous studies have shown that these variables already presented problems related to violation of the framing class. When evaluating the IGAM data from 1997 to 2008 using fewer stations, it was found that the variables thermotolerant coliforms and total manganese had the highest violation rates in all the UGRHs. For the dissolved iron variable, it was verified that the variable exceeded the established limit of 20% in the UGRHs 1, 2, 4, and 5. For the total phosphorus, this limit was exceeded only in the UGRHs 1 and 5 (ECOPLAN-LUME 2010a).

It is worth mentioning that the high violation of the total manganese and dissolved iron variables is also associated with their concentration peaks caused by the dam collapse in Mariana in 2015. For these variables, these peaks exceeded significantly the values of the historical series of data prior to the event (IGAM 2017b). It is worth pointing out that, despite the trend of return of the analyzed variables to the previous conditions, the disturbances imposed on the affected ecosystems left a significant damage in the Doce River. Much of the leaked material after the dam collapse is still deposited in the water bodies, which still potentially compromises various water uses. In addition, the large volume of tailings accumulated in the water bodies affects the balance of aquatic ecosystems, compromising fauna, flora, and ecological processes, such as auto depuration (ANA 2016). In addition to the total manganese and dissolved iron, the variables turbidity and total solids also presented peaks that exceeded the maximum values of the historical series of data before the dam collapse.

Even though it presented violation values of less than 20%, the dissolved aluminum proved to be an important variable and, although it is associated to the regional geology, its transport to the surface waters is potentialized by the dominant economic activities in the basin. Violations of other heavy metals such as total arsenic, total lead, dissolved copper, total chromium, total mercury, and total nickel were also observed. On the other hand, variables that did not present violations were observed: total boron, total cadmium, total chloride, nitrite, total selenium, total sulfate, and sulfide.

In order to prioritize the most impacting variables in the basin and reduce the costs associated with monitoring, the variables that did not present a percentage of violation may have a biannual sampling frequency, which would result in the analysis of the total chloride only in the complete campaigns. Although the total chloride had a high factor loading in F2 in the first round of the FA/PCA, this change would not pose major problems, since the electrical conductivity was also selected in the same factor and both represented the same pollution group.

Monitoring frequency analysis through CA

In the analysis of the frequency of the monthly monitoring performed only using data from the stations installed in the Doce River riverbed, the CA gathered the 12 months of the year into four groups, as it can be seen on the dendrogram in Fig. 4.

Fig. 4
figure 4

Dendrogram resulting from CA, showing the grouping of the 12 months of the year

When analyzing the dendrogram (Fig. 4), you can see the influence of seasonality in the formed clusters. Clusters 1 (January and December) and 2 (April, February, and March) correspond to the rainy season, and clusters 3 (September and October) and 4 (July, May, June, August, and November) correspond to the dry season. Because they are performed quarterly, this result has a similarity with the months in which the partial campaigns are carried out, demonstrating that for the Doce River basin, the quarterly frequency can be satisfactory. On the other hand, greater Euclidean distances are observed between the months of the rainy season (groups 1 and 2), showing that for this period, the water quality does not present as much similarity, which emphasizes the importance of adopting a monthly sampling frequency. Similar results were also found by Calazans et al. (2018b) when evaluating the water quality monitoring network in the Velhas River basin, also located in the state of Minas Gerais.

Due to the high cost of monitoring campaigns, the monthly frequency can only be maintained in the riverbed of the Doce River. However, it is also recommended for the monthly frequency to perform the suggested changes for the variables monitored in the partial campaigns. Thus, the monthly and partial campaigns would monitor the most representative variables of the water quality in the Minas Gerais portion of the Doce River basin.

Conclusions

A total of 14 out of 50 variables were identified as priority variables in the monitoring network: chlorophyll a, total coliforms, electrical conductivity, BOD, thermotolerant coliforms, pheophytin a, dissolved iron, total phosphorus, total manganese, total ammoniacal nitrogen, DO, total suspended solids, total solids, and turbidity. Contamination in the Doce River basin is due to a series of factors, including natural processes and anthropic activities, such as the high susceptibility of the basin to erosion; contamination by heavy metals, which are associated with the economic activities and the soils of the region; and the release of untreated or partially treated domestic effluents in the water bodies of the basin.

The high values of framing class violation for thermotolerant coliforms and total phosphorus indicate inappropriate sanitary conditions in the Doce River basin. The percentages of violation of total manganese and dissolved iron were also significant, potentialized by the economic activities of the basin.

Based on the analyses, it is recommended to include the dissolved iron and total manganese variables in the partial campaigns and the total chloride sampling only in the complete campaigns. This change would make the partial campaigns represent all sources of pollution in the Doce River basin.

The cluster analysis showed that the water quality variation of the Doce River is determined in part by the seasonality, reiterating the importance of monthly frequency monitoring in the stations of the Doce River basin.