Introduction

Surface water quality is an issue of worldwide concern due to the vulnerability of water bodies to environmental pollution. The surface water quality of a river basin is largely influenced by natural and anthropogenic processes such as the dissolution of geological deposits, biological degradation of organic matter, wet and dry atmospheric deposition, urban and industrial development, agriculture, and the increasing exploitation of water resources. All these activities result in adverse impacts that may alter negatively the aquatic environment in its physical and chemical characteristics (Ruždjak and Ruždjak 2015; Singh et al. 2005; Zhang et al. 2009).

The degradation of water quality can affect species diversity, decrease the overall health of aquatic ecosystems, and cause serious harms to the human health and to the environment. Due to these environmental harmful effects, regular water quality monitoring programs are highly needed as a base of information for the elaboration of actions of pollution control and prevention, and also to provide reliable information on water quality for effective management. Furthermore, the characterization of seasonal changes in surface water quality is an important strategy for the assessment of temporal variations of the pollution caused by both natural and anthropogenic sources, which are seasonal dependent in most of the cases (Ouyang et al. 2006; Ruždjak and Ruždjak 2015; Singh et al. 2005; Zhang et al. 2009). However, given the size and complexity of the resulting monitoring multivariate data sets, overall conclusions regarding the importance of the pollution sources and environmental patterns are difficult to be reached when the pollutants are assessed one by one. Thus, multivariate statistical methods can be a useful tool for extracting important information from large data sets (Olsen et al. 2012).

The integrated use of multivariate statistical techniques such as cluster analysis (CA), principal component analysis (PCA), and positive matrix factorization (PMF) provides better interpretation and understanding of complex water quality data sets. These techniques permit the source identification and source apportionment of pollutants as well as the identification of clusters within the whole monitoring period, providing valuable information for the development of appropriate strategies for effective management of the water resources (Bilgin 2015; Li et al. 2015; Sergeant et al. 2016; Simeonov et al. 2003; Singh et al. 2005; Zhang et al. 2009). The combined use of different multivariate statistical techniques has been widely used in the assessment of surface water quality. Simeonov et al. (2003), Singh et al. (2005), Zhang et al. (2009), and Yang et al. (2010) employed at least two statistical techniques (PCA and CA) in studies related to water quality assessment. The PMF, however, has had limited application to water systems (Li et al. 2015). Li et al. (2015) used PMF to in monitoring data for source apportionment of surface water pollution in the Daliao River basin, Northeast China, but this tool was not combined with any other technique, such as PCA or CA in their research.

The Sinos River is the fourth most polluted river in Brazil. It is intensely polluted, basically by the presence of three main components, which are related to the land use along the three stretches of its basin. In the upper stretch, there is intensive rice cultivation, which is responsible for the reduction in water availability and also acts as a source of discharge of chemical contaminants, due to the use of agrochemicals. In the middle stretch, the pollution is mainly related to rural and urban occupation, in the form of point and non-point contamination by animal wastes and untreated domestic sewage. Industrial wastewater is added to domestic sewage in the lower stretch of SRB, which is densely populated and has only a few domestic sewage treatment plants (Benvenuti et al. 2015; Blume et al. 2010; Nascimento et al. 2015; Petry et al. 2016).

The aim of this study is to use a combination of multiple multivariate statistical techniques (CA, PCA, and PMF) to extract information about (1) the seasonal variations in water quality, (2) the most important parameters in contribution to seasonal water quality variation, (3) the influence of the natural and anthropogenic sources on the water quality, and (4) the source identification and source apportionment of water pollution of the Sinos River. This is the first study carried out using this approach for the assessment of water quality in the SRB. This methodology can be helpful to identify the main parameters responsible for most of the seasonal variations in the water quality and based on this, optimize the environmental monitoring programs. Also, it provides valuable data for the development of new and effective strategies for surface water quality management, in order to decrease environmental pollution problems.

Materials and methods

Sampling site

The Sinos River is located in the southern region of Brazil, in the state of Rio Grande do Sul. It rises in the hills of the municipality of Caraá, at an altitude of approximately 800 m above sea level and travels a course of about 190 km, where it reaches the Jacuí delta, in the municipality of Canoas, at an altitude of only 5 m above sea level (Fig. 1). The SRB covers two geomorphological provinces: the southern plateau and the central depression and occupies an area of 3.696 km2, comprising 32 municipalities (Petry and Schulz 2006).

Fig. 1
figure 1

Map of the sampling site location within the SRB

The sampling site (latitude 29.730761° S and longitude 51.082849° W) is located in the rural area of the municipality of Novo Hamburgo. The water samples were collected from the raw water inlet of the municipal WTP operated by COMUSA. It is located 55 km from the Sinos River mouth (Jacuí Delta) and 135 km from its spring, with an upstream area of approximately 2980 km2, crossing the municipalities of Santo Antônio da Patrulha, Taquara, Parobé, Sapiranga, and Campo Bom, receiving domestic sewage, industrial wastewater, and agricultural runoff, which contribute to the degradation of the water quality, which, in its turn, is adducted, treated, and distributed by COMUSA to a population of 238,940 inhabitants (IBGE 2017). Although the sampling site is located in the lower stretch of the SRB, characterized by the anthropogenic pressure from highly urbanized and industrialized areas (Benvenuti et al. 2015; Blume et al. 2010), Fig. 2, which emphasizes the sampling site drainage area, it shows also a large contribution from areas highly occupied by agriculture, with irrigated agriculture (cultivation of rice), forestry (planting of exotic species such as Pinus sp. and Eucalyptus sp.), and areas of mixed use, characterized by small plots of plantation and livestock activities.

Fig. 2
figure 2

Land uses within the SRB affecting the Sinos River water quality in the sampling site

The main land uses of the study area result in the discharge of agrochemicals and in point and non-point sources of pollution, such as animal feces and untreated sewage (Benvenuti et al. 2015; Blume et al. 2010; Nascimento et al. 2015). Thus, in addition to the problems related to agricultural activities and the growth of urbanization, the vegetation cover was reduced to 10% of the total basin area, which, consequently, results in impacts on water resources. The deforestation on slopes, springs, and water courses is visible and the vegetation remaining is only small patches of native forest in adjacent areas (Matzenauer 2003).

Data set preparation

The data set with the measurements of water temperature, pH, electrical conductivity, turbidity, alkalinity, dissolved oxygen, biochemical oxygen demand, total coliforms, Escherichia coli (ECOLI), and manganese was provided by COMUSA and is the result of a monitoring program of weekly sample collection, comprising a period of almost two years (May 2013 to April 2015). The data on river water level was obtained from the Sinos River Basin Management Committee (COMITESINOS 2017). The data sets were integrated and used to assess the changes caused by the seasonal hydrological cycles in the Sinos River water quality. The water quality parameters, their units, abbreviation, equipment/methods used, detection limits (DL), missing data (MD), and data bellow detection limit (BDL) are summarized in Table 1.

Table 1 Water quality parameters, units, abbreviations, equipment/method, DL, MD, and BDL

The first treatment given to the data set was an evaluation to identify outliers and was carried out using IBM SPSS Statistics 24. A total of 10 samples were excluded from the data set due to the presence of outliers, as the multivariate statistical analyses are sensitive to outliers, especially the PCA (Olsen et al. 2012), which resulted in a data set of 95 samples. Then, the data set was split according to the seasons autumn (AUT) (March 21st to June 20th), winter (WIN) (June 21st to September 20th), spring (SPR) (September 21st to December 20th), and summer (SUM) (December 21st to March 20th) and the MD and data BDL were treated. The issue of MD and data BDL can be addressed in several ways, i.e., by eliminating samples (rows of the data matrix), by eliminating the parameters (columns of the data matrix), or by imputing a value. As the higher percentage of MD and data BDL (6.3 and 14.7%, respectively) was observed only for the parameter BOD, we decided to impute the parameter-specific median for the MD and half of DL for the data BDL (DL/2) (Polissar et al. 1998; Reff et al. 2007), rather than eliminating samples or parameters, as the elimination of samples or parameters may result in loss of useful information and may leave the data set too small for an adequate PCA or PMF (Olsen et al. 2012; Reff et al. 2007).

Cluster analysis

Cluster analysis groups objects (cases) into classes (clusters) on the basis of similarities within a class and dissimilarities between different classes. The results of CA are useful for the interpretation of the data through the indication of patterns. Hierarchical clustering is the most common approach in which clusters are formed sequentially (Singh et al. 2004; Vega et al. 1998). In hierarchical clustering, clusters are formed sequentially by starting with the most similar pair of objects and forming higher clusters step by step. The Euclidean distance usually gives the similarity between two samples and a “distance” can be represented by the “difference” between analytical values from both samples (Singh et al. 2004). Hierarchical CA was performed on the data set (mean of observations over the seasonal periods) by means of between-groups linkage method, using squared Euclidean distances as a measure of similarity. CA analysis was applied to the data set in order to assess the seasonal similarity and to group similar seasons in clusters, which allows the reduction in the number of statistical treatments and simplifies the interpretation of the results. CA analysis was carried out using IBM SPSS Statistics 24.

Principal component analysis

PCA is a powerful multivariate technique for pattern recognition that can be applied to a data set in order to evaluate the relations between the parameters and to find groups of parameters with similar behavior, which also enable the identification of their emission sources. Through PCA, it is possible to reduce a large number of parameters to a smaller set of factors (principal components (PC)) which describe the whole data set, retaining most of the original information (variability) (Field 2009; Hair et al. 2005). The verification of sample adequacy was carried out using the Kaiser-Meyer-Olkin test (KMO) and Bartlett’s test of sphericity. The extraction was based on Eigenvalues greater than 1. The varimax rotation method with Kaiser normalization was applied as a tool to simplify the interpretation of the PC. Only the parameters with communalities greater than 0.5 were used (Singh et al. 2005). PCA was carried out using IBM SPSS Statistics 24.

Positive matrix factorization

PMF is a multivariate receptor modeling tool that decomposes a matrix of speciated sample data into two matrices (factor contributions and factor profiles) in order to understand the factors or sources impacting the speciated sample data (Brown et al. 2015; EPA 2014). PMF is especially applicable to working with environmental data because it incorporates the parameter uncertainties, often associated with measurements of environmental samples, and also, forces all of the values in the solution profiles and contributions to be nonnegative, which is more realistic than solutions provided from previously used methods like PCA (Reff et al. 2007). For the uncertainty file, 4 times the specific-parameter median was used for missing data and 5/6 MDL was used as the corresponding uncertainty for data below DL (EPA 2014; Li et al. 2015; Polissar et al. 1998; Reff et al. 2007). The uncertainty file for parameters with concentrations greater than the MDL was created according to Eq. 1 (EPA 2014):

$$ \mathrm{Unc}=\sqrt{{\left(\mathrm{Error}\ \mathrm{fraction}\times \mathrm{Concentration}\right)}^2+{\left(0.5\times \mathrm{MDL}\right)}^2} $$
(1)

EPA PMF 5.0 was used for PMF analysis. The error fraction was set at 10% for all the parameters (Li et al. 2015).

Results and discussion

Temporal grouping by seasonal similarity

CA was applied to the data set to detect seasonal similarities in water quality. The dendrogram resulted from CA (Fig. 3) shows the similarity between the seasons autumn (AUT) and summer (SUM) (which makes it possible to group these two seasons into a cluster), as well as the dissimilarities between winter (WIN), compared to the cluster autumn-summer (SUM-AUT) and spring (SPR), and from SPR, compared to WIN and SUM-AUT.

Fig. 3
figure 3

Dendrogram using average linkage (between groups) of seasonal river water quality

The results provided by CA indicate that only three seasonal periods are sufficiently robust for a temporal assessment of the Sinos River water quality at the sampling site. Thus, three periods will be used for the statistical treatments: SUM-AUT, WIN, and spring SPR.

Seasonal variations of water quality

The data sets did not show normal distribution, according to the Kolmogorov-Smirnov test (at a significance of 5%) and the differences between the parameters within SUM-AUT, WIN, and SPR periods were determined by the nonparametric statistical test of Kruskal-Wallis (at a significance of 5%), followed by Student-Newman-Keuls post hoc. Descriptive statistical analyses were applied to data sets to determine the median, mean, standard deviation (SD), minimum (Min), and maximum (Max) values (Table 2). Kolmogorov-Smirnov and Kruskal-Wallis tests were carried out using IBM SPSS Statistics 24 and BioEstat 5.3, respectively.

Table 2 Descriptive statistical summary showing the statistical differences of the Sinos River water quality during the study period

Significant statistical differences were identified within the parameters water level (WL), water temperature (WT), electrical conductivity (EC), dissolved oxygen (DO), biochemical oxygen demand (BOD), and ECOLI. WL is significantly higher in WIN and SPR and EC and ECOLI are significantly lower in these periods. This fact is possibly related to the increase of river water volume and the consequent dilution of salts and bacterial colonies, which are directly related to EC and ECOLI, respectively. WT is significantly lower, whereas the concentration of OD is significantly higher and the concentration of BOD is significantly lower in WIN period. This fact demonstrates the action of physical variables (such as temperature) over DO concentrations since at lower temperatures, DO concentrations tend to increase. The decrease in BOD may be related to the decrease in temperature and also to the dilution of the organic matter by the increase in the river water volume.

To better understand the influences of WL and WT on the remaining parameters, the nonparametric test of Spearman’s correlation was carried out using the IBM SPSS Statistics 24 with both WL and WT as reference parameters (RP), in order to assess the seasonal influences of these two physical parameters in water quality. Table 3 shows the correlation matrix of WL and WL to each other and with the remaining monitored parameters.

Table 3 Spearman’s correlation coefficient matrix with WL and WL as reference parameters

Table 3 confirms the seasonal influences of the physical parameters WL and WT on the physicochemical parameters, especially in relation to the DO, which presented significant correlations (at the significance level of 1 and 5%) with the WL and WL in all SUM-AUT, WIN, and SPR periods. These data can be useful to predict water quality behavior with a high confidence degree and also to develop working strategies in WTP, which is the case of COMUSA.

Principal component analysis

PCA was applied to SUM-AUT, WIN, and SPR data sets in order to assess the temporal variation of the monitored parameters and to identify the main parameters responsible variations in water quality. The results of KMO test were 0.607, 0.658, and 0.642 for SUM-AUT, WIN, and SPR, respectively, which indicate sample adequacy for PCA. The significance of Bartlett’s test of sphericity was < 0.001 for all the data sets, indicating that they are factorable. Mn and BOD were excluded from all the data sets and pH was excluded from SPR data set once these parameters showed communalities less than 0.5. PCA resulted in three PC for both SUM-AUT and WIN and two PC for SPR, which explained 77.6, 81.1, and 84.3% of the data set variance, respectively. The rotated component matrix is shown in Table 4.

Table 4 Component matrix after varimax rotation with Kaiser normalization

PC1 explained 41.5, 35.7, and 45.6% of the total variance of SUM-AUT, WIN, and SPR, respectively. PC1 showed high positive loadings for WL, COLOR, TURB, and OM and high negative loadings for EC and ALK in SUM-AUT and high positive loadings for WL, COLOR, and TURB and high negative loadings for pH, EC, and ALK in WIN. During periods of increase in WL, COLOR and TURD abruptly increase, probably because of the surface runoff from watersheds and the movement of the sediment layers of the river bed, which frequently contains large amounts of hazardous pollutants (heavy metals, oil products, dioxins, etc.) (Dolgonosov and Korchagin 2005). Such an increase in TURB is commonly accompanied by a deterioration of some organoleptic (color index and odor) and generalized (permanganate oxidability) characteristics (Dolgonosov and Korchagin 2005). At the same time that COLOR and TURB increase due to the increase of WL, the opposite occurs with EC and ALK, which are probably diluted, due to the higher volume of water, demonstrating that the largest sources of EC and ALK are present along the river, as these pollutants are not being brought along with the surface runoff from watersheds. These patterns are clearly demonstrated in PC1 of SUM-AUT and WIN. PC1 showed high negative loadings for WL and DO and high positive loadings for WT, EC, TCOLI, and ECOLI in SPR, representing the physicochemical influence of WT (Jha et al. 2007; Vega et al. 1998; Yang et al. 2010), especially on the concentrations of DO. The main processes that affect the oxygen concentration in water is the water temperature, as the solubility of oxygen decreases with the increase of temperature, and the oxygen consumption by the microorganisms living in the water, while they decompose biodegradable organic matter (Jha et al. 2007; Vega et al. 1998). This fact explains the inverse charges between WT and DO loadings in SUM-AUT, WIN, and SPR periods. Also, as shown in Table 2, the median concentration of DO was significantly higher in WIN, when WT was significantly lower, which clearly confirms the action of WT on the variation of DO concentrations. Similar PC loading patterns were obtained for SUM-AUT and WIN, as these two periods show the same block of parameters (EC, COLOR, TURB, and ALK) in PC1. The similarity between SUM-AUT and WIN can also be observed in Fig. 3, where SUM-AUT and WIN are close to each other and far away from SPR.

PC2 explained 20.0, 25.8, and 38.7% of the total variance of SUM-AUT, WIN, and SPR, respectively. PC2 showed high negative loadings for WT and high positive loadings for pH and DO in SUM-AUT, representing the physicochemical variations (Jha et al. 2007; Vega et al. 1998; Yang et al. 2010), similarly as described for PC1 in SPR. PC2 showed high positive loadings for OM, TCOLI, and ECOLI in WIN high positive loadings for COLOR, TURB, and OM and high negative loadings for ALK in SPR.

PC3 explained 16.1 and 19.6% of the total variance of SUM-AUT and WIN, respectively. PC3 showed high positive loadings for ECOLI and ECOLI in SUM-AUT and high positive loadings for WT and negative loadings for DO in WIN. PC3 of SUM-AUT and PC2 of WIN showed almost the same parameters (except for the parameter OM in WIN) and PC3 of WIN and PC2 of SUM-AUT also showed almost the same parameters (except for the parameter pH in SUM-AUT), which once more, shows the similarity between these two periods.

In this study, any water quality parameter with an absolute correlation coefficient value > 90% was considered to be an important parameter contributing to seasonal variations of the Sinos River water quality. In SUM-AUT, only TCOLI (90.1%) was identified to be the most important parameter contributing to seasonal variations of the Sinos River water quality, representing the influences of bacteriological inputs. In WIN, the parameters WL (91.2%), WT (93.5%), and EC (93.8%) were identified as the most important parameters for the river water quality variations, representing the environmental influences (WL and WT) and the mineral-related influences (EC). In WIN, the most important parameters contribution to water quality variations were COLOR (97.6%) and TURB (97.2%), which is probably related to organic inputs. It demonstrates that a water quality parameter that is important in contribution to water quality variation for one period may not be important for another period. Therefore, the seasonal variation in the water quality parameters should be considered when selecting water quality parameters for the establishment of strategies for monitoring programs and also to the management of WTP along the river (Ouyang et al. 2006).

Positive matrix factorization

In order to have an overview of the entire data set profile, EPA PMF 5.0 was applied to the source identification and source apportionment of surface water pollution of the Sinos River. Except for WT, WL, and pH, which were set as “bad” weighted, all the remaining parameters were included in the analysis. The parameters signal-to-noise (S/N) ratios ranged from 3.0 to 9.0. The scaled residual distribution plots were used to determine how well the model fit each parameter. Parameters showing large-scaled residuals or displaying nonsymmetrical distribution, which was the case of EC, OM, DO, BOD, TCOLI, and Mn, were categorized as “weak” before the analysis, to avoid a poor model fit (EPA 2014; Li et al. 2015). EPA PMF 5.0 was run with two to five factors to find the optimal number of factors. A solution of three factors met the requirements of the three levels of EPA PMF 5.0 model (Base Model Displacement Method, Base Model Bootstrap Method and Base Model BS-DISP Method) for a convergent solution. Figure 4 shows the factor profiles of water pollution of the Sinos River.

Fig. 4
figure 4

Factor profiles of water pollution of the Sinos River from EPA PMF 5.0 base run number 20 (convergent run)

As shown in Fig. 4, Factor 1 is mainly composed of contributions of EC, ALK, OM, DO, BOD, and Mn and represents the anthropogenic contributions (domestic sewage, industrial wastewaters, and agricultural runoff). OM in surface water is derived both from natural degradation of some organic substances within the ecosystem and from anthropogenic activities, such as the discharge of domestic sewage and industrial wastewaters. OM is not easily removed from the water using normal treatment techniques such as coagulation, and therefore, it enters the treatment system and part of it can remain in the treated water, which is distributed through the water supply network. Moreover, during the disinfection process with chlorine, which is a common treatment technique in municipal water supply facilities, OM can be converted to potentially harmful disinfection by-products such as trihalomethanes (THM), haloacetic acids (HAA), and haloacetonitriles (HAN) (Panyapinyopol et al. 2005). Factor 2 is mainly composed of COLOR and TURB, but with a considerable contribution of OM (36%) and OD (45%), representing the anthropogenic (surface runoff from watersheds) and natural (movement of the sediment layers of the river bed) contributions. Factor 3 is mainly composed by TCOLI and ECOLI, representing the anthropogenic/environmental contributions (fecal and environmental contamination). The microbiological contamination of surface water can be assessed by means of TCOLI and ECOLI, although only ECOLI can provide the indication of fecal contamination. The fecal pollution of surface water is basically from human and animal origin, through point and non-point emission sources (Garcia-Armisen and Servais 2007). The major point sources of TCOLI and ECOLI in areas such as the SRB are the domestic sewage, since only a small part of the population is connected to a sewage collection network, which could deliver the wastewaters to wastewater treatment plants, but the majority directly discharge the untreated wastewater into the river. The non-point sources are represented by both contaminated surface runoff and soil leaching, mainly originated by the wild animals and grazing livestock feces and also by the cattle manure spread on cultivated areas.

Conclusions

This study comprises a seasonal assessment of a surface water quality data set resulting from a monitoring program of almost two years, carried out next to the raw water adduction point of a WTP using multivariate statistical techniques, such as CA, the Kruskal-Wallis test, Spearman’s correlation analysis, PCA, and PMF. CA showed that the seasons summer and autumn could be grouped due to their similarity in relation to the monitored parameters. Statistical seasonal variations were identified through the Kruskal-Wallis for WL, WT, EC, DO, BOD, and ECOLI. Therefore, seasonal variations should be taken into consideration when these parameters are used as indicators in the evaluation of the Sinos River water quality. Moreover, WL and WT play a prominent role in the seasonal variation of water quality and should be considered as key parameters in the evaluation of other parameters of water quality.

PCA resulted in three PC for SUM-AUT and WIN and two PC for SPR, explaining 77.6, 81.1, and 84.3% of the data variance, respectively. The results showed that the importance of the parameters is seasonal dependent, although similar behavior could be observed among the PC distributions along with SUM-AUT, WIN, and SPR periods. Applied to the whole data set, PMF resulted in three factors, which are in mostly in accordance with the results provided by PCA, representing the discharge of domestic sewage, industrial wastewater, and agricultural runoff. In summary, the results provided by PCA/PMF suggested that organic pollution, biochemical pollution, agricultural runoff, physicochemical, and soil weathering are the potential pollution sources affecting the Sinos River water quality.

Thus, this study demonstrates the usefulness of multivariate statistical techniques such as CA, PCA, and PMF in the analysis and interpretation of complex data sets such as those resulting from water quality monitoring programs of urban rivers, as well as for the pollution sources identification and apportionment and to understand the temporal variations in water quality, which can be an important tool for water resources management, as well as guide to the proper WTP management, to some extent. Finally, considering the representative concentrations of fecal bacteria (ECOLI) and its high water contamination capacity, it is suggested that the treatment of domestic sewage should be a priority action aimed at the recovery of the environmental quality of the SRB, in order to avoid environmental problems of higher proportions and its harmful consequences to human health. In addition, it is of importance to mention that pollution implies in the increase of water treatment costs and, despite the high pollution loads of the Sinos River, in the current scenario, COMUSA is still able to treat and supply drinking water to the community.