Introduction

Surface water quality is very important and very sensitive issue. It is determined by natural processes such as dissolution of geological deposits, biological degradation of organic matter, and atmospheric deposition. In addition to natural sources, urban and industrial development, farming, building of dams, and diversion of flow and other human activities within the aquatic environment alter the physical and chemical composition of water (Meybeck 1998; Malmqvist and Rundle 2002; Mayer et al. 2002). Since degradation of water quality can result in altered species diversity, decrease the overall health of aquatic ecosystem, and cause a serious harm to human heath and the environment, it is necessary to monitor the water quality regularly. Long-term water quality monitoring programs (Dixon and Chiswell 1996) often provide large sets of data which are difficult to interpret. In attempt to draw meaningful conclusions and achieve better understanding of water quality, different multivariate techniques, such as principal component analysis (PCA), factor analysis (FA), or cluster analysis (CA), have been applied to these complex data matrices (see, e.g., Vega et al. 1998; Simeonov et al. 2003; Bouza-Deano et al. 2008; Li et al. 2011; Pinto and Maheshwari 2011; Olsen et al. 2012; Wang et al. 2012; Mei et al. 2014).

PCA is widely used technique in science. It was often applied to exploratory data analysis in water quality research as has been recently summarized by Olsen et al. (2012). PCA is the simple non-parametric method which can be used to reduce a complex data set to a lower dimension and capture sometimes hidden, underlying patterns present within the original data (Wold et al. 1987). It is optimal in terms of mean squared error, and the model parameters can be computed directly from the data. Despite these attractive features, PCA models have several shortcomings. First, it is not obvious how to deal properly with incomplete data set (what is often the case). Second, the standard PCA algorithm is based on the assumption that data are not spoiled by outliers. In case of outliers, robust version of PCA has to be developed (Ruymgaart 1981; Maronna et al. 2006; Stanimirova et al. 2007). To account for outliers, the most commonly used approach is to replace the standard estimation of the covariance matrix with a robust one (Campbell 1980; Croux and Haesbroeck 2000; Hubert et al. 2002).

The main goal of this study is to determine the state of the Sava River in the transition period between the end of the war in Croatia and the adjustment of its legislation to the European standards. Therefore, the results of this study can be used as ‘baseline’ condition when examining the impacts of those changes. For this purpose, different multivariate statistical approaches were applied to the data matrix obtained during the 7-year water quality monitoring of the Sava River in Croatia with the goal to detect and quantify the water quality trends. Pearson correlation coefficient, PCA, and CA were used to obtain the information about the latent factors explaining the data set, to describe the seasonal and spatial variations, the influence of possible sources (natural or anthropogenic) on the parameters, and the similarities between sampling sites. Robust PCA algorithm was used to check the influence of the outliers in the data set on the results. All mathematical and statistical computations were made using MATLAB version 7.7.0.471 (R2008B).

Materials and methods

Study area

Water samples used in this work were collected from the Sava River in Croatia. The river is located in Southeast Europe. It has its source in Slovenia, flows through Croatia, Bosnia and Herzegovina, and on to Serbia where it joints, as side tributary, river Danube in Belgrade. It is suitable for navigation downstream of Sisak, Croatia. The river is 945 km long; in Croatia, it flows in the length of 510 km and for the most part constitutes the border with Bosnia and Herzegovina (313 km). The average flow of the Sava River at its entry into Croatia (Jesenice) and at its exit from Croatia amounts to 300 and 1200 m 3/s, respectively. The main tributaries in this segment of the basin are Kupa, Una, Vrbas, Sutla, Krapina, Lonja, Ilova, Orljava, and Bosut. Major towns along the river that have considerable industrial capacities are Zagreb, Sisak, Slavonski Brod, and županja. The main sources of pollution come from untreated wastewaters discharged from municipalities and industries along the Sava River as well as the tributaries.

In its middle course, downstream of Zagreb, Sava becomes a very large lowland river; it is unique example of a river with flood plains that are still intact and support flood alleviation and biodiversity. It has been considered as one of the key areas in the Pan European Biological and Landscape Diversity Strategy (PEBLDS) of the Council of Europe. The Lonjsko Polje Nature Park and Special ornithological reserve Crna Mlaka (Crna Mlaka Fishponds) have been included on the Ramsar List of Wetlands of International Importance and on the list of internationally Important Bird Areas. The Lonsko Polje has also been included in the ecological network in its entirety as an area important for the conservation of species and habitats. In the Sava River Basin in Croatia, almost entire public water supply has been based upon groundwater exploitation. Since the water of the Sava River is in the hydraulic connection with groundwater, the quality and the quantity of the Sava River are very important.

Monitoring sites

In Croatia, the water quality monitoring network is operated primarily by Croatian Waters. The Sava River monitoring system covers 18 sample sites (Fig. 1) where the water quality is tested twice per month. Water quality monitoring stations are located upstream and downstream of the tributaries, developed areas and inputs from drains. The first and the last station (S1, S18) are located just downstream/upstream of the international border with Slovenia/Serbia. Stations S6, S8, S10, and S14 are located upstream while the stations S7, S9, S11, and S15 downstream of the right tributaries Kupa, Una, Vrbas, and Bosna, respectively. Stations S6 and S7 are also covering an area of Sisak. Stations S2, S12, and S16 are located upstream while the stations S3, S4, S5, S13, and S17 downstream of Zagreb, Slavonski Brod and županja. Station S4 is located downstream from the mouth of the Main Drainage Channel of Zagreb, where the impact of discharged Zagreb wastewaters was significant. Twelve Trans National Monitoring Network (TNMN) stations are operating in the Sava River Basin, among them three stations are on the Sava River in Croatia (S1, S8, and S17).

Fig. 1
figure 1

Map of the sampling site locations

Sampling and chemical analysis

Water quality of the Sava River is monitored regularly in accordance with national regulations. In this study, 18 physico-chemical parameters obtained at each station were used for analysis. Data for the sampling period from 01 January 2000 to 31 December 2006 are presented when samples were collected twice per month (every other week). Sampling, preservation, and chemical analyzes were performed according to standard analytical methods for the examination of surface waters (ISO; APHA 1995; EPA 1999) which are routinely applied in the water quality monitoring laboratories. Air and water temperatures were measured on the site while pH, electrical conductivity, and dissolved oxygen were measured both on the site (by using standard electrochemical techniques; portable pH, dissolved oxygen and conductivity meters, HANNA instruments) and in the laboratory. For meaningful data interpretation, water quality parameters measured at various temperatures need to be transformed to values corresponding to a standard temperature (Hayashi 2004). Therefore, only data measured in the laboratory were used here. Unless otherwise specified in the methods, water samples were collected in 3–5 L polyethylene containers, stored in the dark at 4 C, and analyzed within 24 h. For the total suspended solids determination, samples were filtered through 0.45 cellulose nitrate membrane filter (Whatman, Springfield Mill, England). The water quality parameters, their units and abbreviation and methods of analysis are summarized in Table 1.

Table 1 Determined water quality parameters and analytical methods used in the study

Although at most stations, samples have been collected over 30 years and there are more than 50 water quality parameters available (including detergents, pesticides, and other organic parameters, biological and microbiological parameters, heavy metals, etc.), only 26 parameters have been selected due to the continuity in measurement at all water quality monitoring stations. Constituents that were not routinely analyzed and/or analyzed at all stations were excluded. This data set was further refined by retaining only 18 parameters; highly redundant constituents (such as multiple forms of nitrogen) and constituents that have nearly constant concentrations over the entire watershed were excluded. Finally, we have 18 parameters at 18 locations over the 7-year period.

Data and statistical methods

Data set

The data set used in this work consisted of 3164 samples by 18 variables. This rather large set of data (a total of 56,952 values) contained only eight missing data (0.01 %) which were replaced with median values of appropriate variables (Olsen et al. 2012) and 234 data (0.4 %) which were found to be below detection limit and were replaced with half of the detection limit value (Farnham et al. 2002). Basic statistical properties of the data set used are summarized in Table 2. Along with the distribution of missing data (MD) and the data below the detection limit (DL), the moments of the data distributions are given. The values of skewness and kurtosis of all variables but A-TEMP, W-TEMP, EC, and DO depart significantly from the normal distribution. Hence, to describe the distribution of the data in this work, the robust statistic, median, and interquartile range (IQR) are used (right hand-side of the Table 2). All data whose values lay outside the interval (Q 1−3⋅IQR, Q 3+3⋅IQR), where Q 1 and Q 3 are the first and third quartile, respectively, were considered as outlying. According to this criterion, there were 1.7 % univariate and 3.8 % multivariate outliers in the data set. The number of multivariate outliers was determined by applying described criterion on the distribution of Mahalanobis distances from common center (Mahalanobis 1936). We do not want to suggest that those values are indicative of measurement errors or bias in the data but to emphasize that populations have skewed and heavy-tailed distributions and that tools that assume a normal distribution should be used with caution.

Table 2 Descriptive statistics of the water quality parameters

Principal components analysis (PCA)

PCA is a mathematical procedure that uses orthogonal transformation to convert a set of observations into a set of linearly uncorrelated variables called principal components. Due to the quadratic error criterion, standard PCA algorithm is very sensitive and its output can change dramatically in the presence of only a few outliers (Maronna et al. 2006). The data set used in this work contained variables whose distributions strongly deviate from normal distribution, with some having strongly skewed and heavy tailed distributions and/or non negligible number of outliers (see Table 2). Therefore, in addition to standard algorithm, the robust PCA algorithm was used. The PCA was performed on covariance matrix that was calculated by weighting the mean and the products which form it. The data were weighted with respect to their distance from the center. Two ways of weighting were used: univariate where the variables were weighted independently and multivariate where all variables were weighted simultaneously. In the first case, the weight of each data point was calculated as inverse value of absolute distance from median in the units of median of absolute distances for a given variable:

$$ w_{i}=\frac{median(|x_{i}-median(x_{i})|)}{|x_{i}-median(x_{i})|} $$
(1)

where w i is the weight, x i are measurements of given variable, and || denotes absolute value. In this way, matrix of weights w i j of the same dimension as the data matrix was obtained. In the multivariate case, the data were weighted with the normalized reciprocal value of Mahalanobis distance. In this way, obtained vector w i obtained in this way was used in calculation of covariance matrix.

Cluster analysis (CA)

CA is a common technique for statistical data analysis and exploratory data mining used in many fields as well in water quality assessment (Shrestha and Kazama 2007; Kazi et al.2009). With the aim of examining spatial variability (grouping similar sampling sites and spreading them over the river), in this study, hierarchical agglomerative clustering was performed on the normalized data set by the Ward’s method of linkage with squared Euclidean distances as a measure of similarity. Unlike PCA which uses a number of principal components for display purposes, CA uses all the information contained in the original data set.

Results and discussion

Correlation

The correlation matrix of the 18 variables was calculated. The obtained correlation coefficients should be interpreted with caution because they reflect both spatial and temporal variations (see Fig. 2). Strong correlations between A-TEMP and W-TEMP (R = 0.89), COD Mn and COD Cr (R = 0.71), TSS and both CODs (R = 0.70, R = 0.54), as well as DO and OS (R=0.69) and DO and temperatures (R=-0.60, R = -0.51) can be observed. Somewhat weaker correlations are observed between both CODs and BOD 5, TO and MO, NH\(_{4}^{+}\) and TKN, and NO\(_{3}^{-}\) and temperatures.The obtained results are expected and consistent with the fact that these variables are dependent on each other.

Fig. 2
figure 2

Temporal and spatial variations of DO

The relationship between climatic, hydrological, and water quality parameters was recently studied by other authors. Most of them analyzed influence of air temperature on river water quality. Our results match well with the results of those studies. Analyzing meteorological and river water quality data Ozaki et al. (2003) found that an increase in A-TEMP resulted in rises in BOD and SS and a drop in DO. From the analyzes of the four seasons, the gradients differed significantly over the seasons for W-TEMP, BOD, DO, DO saturation ratio, SS and pH, and enhanced dependencies were observed in summer. In our work, enhanced A-TEMP dependencies of W-TEMP and BOD are observed in spring and DO and OS in autumn. It should be mentioned that Ozaki et al. (2003) used slightly different boundaries for seasons than the ones used here. Studying the lower Mekong River Prathumratana et al. (2008) found positive correlation between A-TEMP and TSS, NO\(_{3}^{-}\), TP and COD and negative correlations with DO and ALK. Similar results were obtained in this paper except for the negative correlation between A-TEMP and NO\(_{3}^{-}\). However, studying climate-water quality relationships in large rivers on a global scale by investigating the climate elasticity of river water quality Jiang et al. (2014) concluded that some parameters can be site specific (in their work NO\(_{2}^{-}\) and ortho-P). They also showed that elasticity is basically consistent with the results of statistical correlation on the direction of response. Correlations obtained here are in the same direction as the elasticity obtained by Jiang et al. (2014).

Multivariate analysis of the entire data set

Prior to PCA, data were normalized to have zero means and unit variances to account for different measurement units and to equalize the impact of all variables on the total variance in the data set. Seven principal components having eigenvalues larger than one were obtained explaining over 75 % of the total variance (with individual components explaining 20.0, 13.4, 13.1, 8.8, 7.7, 6.4, and 6.2 %). To simplify the structure of the obtained solution, varimax rotation of the seven component subspace was performed and the resulting variable loadings are given in Table 3.Simultaneously with simplifying the loadings structure of the components, varimax rotation equalizes their variances making previously more important components less important and vice versa. The percent of total variance explained by first and seventh component changed from 20.0 and 6.2 % to 14.3 and 8.3 %, respectively.

Table 3 Loadings of 18 variables on Varimax rotated principal components

The first rotated component (VF1) is responsible for 14.3 % of the total variance and has strong (Liu et al.2003) negative loadings for air and water temperatures and moderate positive loadings for DO and NO\(_{3}^{-}\) suggesting a physico-chemical source of variability. This varifactor can be explained that temperature increase leads to decrease the amount of DO and acceleration of nitrification in water. It also indicates that the most of the variability in the data is due to the temperature changes. VF2 accounts for 14.2 % of the total variance and is correlated primarily with TSS, COD Mn, and COD Cr. This component represents influences from organic pollutants related mostly to human activities, such as domestic and industrial discharges, and also to decaying plant and animal matter. VF3 accounting for 10.9 % of the total variance is largely contributed by DO and OS and represents physico-chemical source of the variability and influences from natural inputs. The total explained variance of VF4 is 10.8 %. It has a strong loading of EC and moderate of inorganic nutrients (NO\(_{2}^{-}\) and TP). This inorganic nutrient-type component could be interpreted as representing the influences from agricultural chemical application (excessive use of fertilizers), domestic and industrial discharges, and the erosion of natural deposits. VF5 (8.6 % of the variance) that is weighted on water-soluble nitrogen species (TKN and NH\(_{4}^{+}\)) is likely to represent the sewage and manure discharges to water bodies, influences from agricultural runoff and byproducts from the industrial manufacturing processes. VF6 (8.5 % of the variance) is negatively loaded on ALK and positively on pH, representing physico-chemical source of the variability (mineral component of the water). VF7 explaining 8.3 % of the total variance is correlated with oils. This component could be interpreted as representing influences from municipal and industrial sewage, runoff from roads and municipal areas. Also, ships and motorboats might contribute significantly to water pollution with oils.

These results are consistent with the results of other authors which have used PCA for water quality investigations. In order to characterize the nature of the water quality impairment in the Wen-Rui Tang River watershed (China) and the relationships among the water quality parameters, Mei et al. (2014) selected the ten correlated parameters for FA. FA identified three factors with eigenvalues >0.96 (1) summing to 70.9 % of the total variance in the water quality dataset. Results revealed that parameters related to organic pollutants (VF1), nutrients (VF2), and salt concentration (VF3) were the most important parameters contributing to water quality variation. Pinto and Maheshwari (2011) employ FA to considerably reduce the number of variables obtained in a routine monitoring program in the Hawkesbury-Nepean River system in New South Wales, Australia and identify the latent factors relative to river health in peri-urban landscapes. Out of 40 water quality variables measured on monthly basis during 2008 and 2009 included in the analysis, the FA identified nine key variables, under three varifactors (VFs), explaining 50 % of the variance in the river water quality. Anaerobic fermentation, microbial pollution, and eutrophication are three key environmental problems faced by peri-urban rivers. In an attempt to differentiate between sources of variation in the water quality of the Ebro River (Spain), Bouza-Deano et al. (2008) carried out exploratory analysis of data by CA and PCA. In their study, PCA has allowed the identification of the following factors: geologic (VF1 and VF2), climatic (VF3), and anthropogenic (VF4). Simeonov et al. (2003) employed PCA for the interpretation of a large and complex data matrix (21 parameters determined at 25 sampling sites for a period of 36 months) obtained during a monitoring program of surface waters in Northern Greece. Six principal components were obtained. VF1, VF2, and VF3 represent organic, nutrient, and physico-chemical source of the variability and are similar to components obtained in our study. Similar results were obtained by other authors which were studying surface water quality using multivariate techniques (Ouyang et al. 2006; Fan et al. 2010; Razmkhah et al. 2010; Nasir et al. 2011).

In the Fig. 3, the VF1/VF2 scores plot is given. Individual scores (a) as well as sampling stations (b) and seasonal (c) medians are shown. The variances of the VFs are denoted with the white ellipse in the panel (a) whose semimajor axes are equal to standard deviations of VFs. It can be noted that all of the points are located roughly within 3 σ in the VF1 direction while this is not the case in the VF2 direction. This reflects the structure of those components: VF1 is composed from variables whose distributions are close to normal distribution, while VF2 contains variables whose distributions differ significantly from the normal one. More negative values of VF1 indicate higher temperature, while more negative values of VF2 represent higher organic pollution. It can be readily seen that S4 has highest organic pollution. As already mentioned, station S4 is located downstream from Zagreb, the largest municipal center in the watershed. The temperatures and the organic pollution are higher in summer and spring than in autumn and winter, as expected. It can be noted on the figure that upper stations (S1–S4) have higher values of VF1 than the downstream stations indicating lower temperatures at those stations. This can be explained by the fact that samples from the stations near Zagreb have been collected in the morning while the stations downstream have been sampled mostly in the afternoon when the air temperature is higher. Stations S1–S4 have higher NO\(_{3}^{-}\) values and stations S1, S2, and S3 additionally higher DO median what also leads to increased VF1 median for these stations.

Fig. 3
figure 3

VF1 versus VF2 scores for the Sava River: (a) individual scores, (b) sampling stations, and (c) seasonal medians

On the banks of the Sava River and its tributaries are located cities with developed industries such as Zagreb - capital city of Croatia with significant population and diverse industry, Krško - nuclear power plant, Sisak - river port, refinery, food factory, factory of alcoholic beverages, Kutina - petrochemistry, Slavonski Brod - food processing, metal processing, Bosanski Brod - oil refinery, županja - sugar factory, wood industry, food industry, etc. Also, agriculture and animal breeding followed by food industry are well developed especially in the middle and lower part of the Sava watercourse. The main sources of organic and nutrient pollution come from untreated wastewaters discharged from municipalities and industries along the Sava River as well as the tributaries. For the year 2007, in Croatia, 104 agglomerations ≥2000 PE in the Sava River Basin were present out of which 89 agglomerations emitted wastewaters into the environment without any treatment.

Examination of spatial changes of rotated components reveals that all VF, except VF1, have large extreme values at station S4 which in most cases approach median through station S5 and S6. Therefore, we can conclude that the only notable (significantly above the error) source of pollution is Main Drainage Channel of Zagreb and its untreated wastewaters. All other changes of VF are of the order or below the error level so the impact of corresponding sources of pollution cannot be undoubtedly identified in our results.

To further analyze spatial variations of water quality and the similarity of the sampling sites, data were subjected to CA. The dendrogram obtained by Ward’s method is shown in Fig. 4. The resulted dendrogram grouped all 18 sampling sites into four statistically significant clusters at (D link/D max) ⋅100<70%. As in the work Wang et al. (2012), Li et al. (2011), Bouza-Deano et al. (2008), and Shrestha and Kazama (2007), the clustering procedure highlighted groups in which the sites have similar characteristics and natural source types. The cluster I (stations S10–S18) is situated at the most downstream site of the river and corresponds to relatively low polluted regions. On the other hand, cluster II is situated at the most upstream site of the river; it covers an area upstream of Zagreb (S1 and S2) and Zagreb (S3) and corresponds to moderately polluted region. Cluster III, i.e., station S4 where the impact of discharged Zagreb wastewaters were significant, corresponds to the lowest water quality site. Cluster III is associated with cluster IV (sites S5–S9) suggesting the impact of pollution on downstream stations. The cluster IV corresponds to moderately polluted sites and represents stations which are influenced by upstream pollution and also by the pollution of the city of Sisak.

Fig. 4
figure 4

Dendrogram obtained by cluster analysis

The results obtained in this section can be used to reduce the number of variables and/or stations in order to reduce the number of analyzes and the costs. For rapid quality assessment studies, number of the sampling sites could be reduced and only representative sites from each cluster identified by CA could be used. The same was already suggested by, e.g., Simeonov et al. (2003) and Bouza-Deano et al. (2008).

The PCA combines variables of ’similar’ patterns into groups indicating that those variables are dependent on each other or may have the same background. For each group of variables, the key or principal variable can be chosen as an representative. In this way, the number of variables could be reduced. In our case, we can identify pairs: COD Mn and COD Cr, TO and MO, and DO and OS. One member could be used as representative for each pair, so we could try to omit COD Cr, TO, and OS. However, one has to be very careful when applying this procedure because the variables with different causes may be combined in a single group as pointed out by Weilguni and Humpesch (1999).

Multivariate analysis of the seasonal subsets

To further examine the influence of seasonal changes of water quality parameters, the above described analysis was repeated for each season separately. Similar analysis was performed by Wang et al. (2012), Li et al. (2011), Ouyang et al. (2006), and Razmkhah et al. (2010). Ouyang et al. (2006) who investigated the seasonal variations in water quality of the lower St. Johns River in Florida, USA using the PCA and PFA techniques found that water quality parameter that is most important in contributing to water quality variation for one season may not be important for another.

The data set was divided into four subsets (spring, summer, autumn, and winter) which were again normalized to have zero mean and unit variance. PCA was performed and, most probably due to smaller data set size, only six PCs had eigenvalues greater than one explaining 73.6 % (spring), 71.4 % (summer), 70.3 % (autumn), and 71.3 % (winter) of the total variance.

The results for individual seasons are similar to those obtained for the whole data set. The most important parameters of varimax rotated solutions in all seasons are TSS and CODs (see Table 4) in contrast to the temperatures which were the most important parameters for the whole data set. The decrease of contribution of physical factor to the total variance in season subsets is expected as the temperature changes are significantly smaller within seasons (during spring or summer) than between the seasons (summer and winter). On the other hand, the importance of other parameters (TKN, NH\(_{4}^{+}\), and TP) has increased.

Table 4 Rotated factor correlations with water quality parameters for whole dataset, univariate and multivariate weighted data and for seasonal subsets

Furthermore, the results for seasonal subsets are qualitatively similar with the results recently obtained by Wang et al. (2012) who used PCA to explore the most important factors determining the spatiotemporal dynamics of water quality in Xiangxi River. They analysed a 5-year (2002–2006) continual monitoring data (14 parameters at 12 sites). PCA of the four data sets yielded six PCs for spring and autumn and five PCs for summer and winter with eigenvalues >1, explaining 74.70, 74.47, 71.02, and 69.86% of the total variance in respective water quality data sets. The VFs obtained from the PCs suggested that the parameters responsible for water quality variations are mainly related to the dilution of salt (natural), the point source pollution of phosphorus and the non-point pollution of nitrogen (anthropogenic).

To investigate water quality in the rivers along the water conveyance canal of the Middle Route of China’s interbasin South to North Water Transfer Project and to assess the spatial and temporal patterns of water quality in the rivers Li et al. (2011) used multivariate statistical analyses. PCA extracted four, five, five, and three principal components (PCs) for the measurement data in September, December, April, and June, respectively, and these PCs with eigenvalue >1 explained 79, 83, 88, and 83% of the total variance in the respective data sets. Seasonal FA/PCA allowed four categories of parameters such as mineral composition (primarily natural), toxic metals (industrial), nutrients (agricultural, domestic, and industrial), and organic pollutants (domestic, municipal, and industrial sources).

Robust PCA

To investigate the influence of the skewed, heavy tailed data distributions containing outliers on the results, the robust PCA algorithm was applied. This was performed by calculating the robust estimates of covariance matrix which were then subjected to PCA. Since covariance and correlation differ by the factor 1/(σ x σ y ), and the data set was normalized to have unit variances (and therefore standard deviations), the calculated covariances can be compared to the correlations obtained earlier and examined for the effects of weighting. The largest differences are in correlations between TSS and both CODs which decreased for about 0.2 in all cases. The decrease is more pronounced in the case of univariate weighting with maximum near 0.3 for the correlation between TSS and COD Mn. Similar but somewhat smaller change of correlation (about 0.1 decrease) is observed for correlation between COD Mn and COD Cr. The correlation between DO and temperatures and correlation between NH\(_{4}^{+}\) and TKN increased for about 0.1 with the increase more pronounced in the multivariate case for the former and in the univariate case for the later correlation. Other significant correlations remain unchanged.

The PCA yielded seven principal components explaining 75 % of the total variance, with individual components explaining 21.6, 14.1, 11.6, 8.3, 7.5, 6.8, and 5.7 % (univariate weighting) and 20.4, 14.6, 11.5, 8.1, 7.5, 6.7, and 6.2% (multivariate weighting). These results are the same as the result obtained by standard PCA algorithm. Again, the varimax rotation of seven component subspace was performed. Correlations between obtained rotated varifactors and water quality parameters are given in Table 4. Figure 5 shows the PC1 versus PC2 variable loading vectors (prior to varimax rotation) for data without weights (a), univariate (b), and multivariate (c) weighted data and for the spring data subset (d). The results of standard and robust PCA are fairly similar. Actually, the differences between the standard and robust PCA algorithm (univariate or multivariate weighting) are smaller than the differences between any of them and spring subset.

Fig. 5
figure 5

PC1 versus PC2 loadings for the Sava River (see Table 1 for acronym identification); a) data without weights, b) univariate and c) multivariate weighted data, d) spring data subset

Conclusion

In this study, surface water quality data for 18 parameters collected from 18 monitoring stations along the Sava River in Croatia from 2000 to 2006 were analyzed using multivariate statistical techniques.

Most of the variables were found to have skewed heavy tailed distributions containing outliers. It should be emphasized that the standard statistical methods assume normal distribution and should be used with caution when analysing the data whose distributions are significantly departing from normal distribution.

When analysing the correlations between variables, correlations between air temperature and some water quality parameters were found in agreement with the previous studies of relationship between climatic and hydrological parameters. Further significant correlations between TSS and CODs were found but they are diminishing when the weighting procedure is applied.

PCA has determined a reduced number of seven principal components that explain over 75 % of the data set variance. Varimax rotation of the seven components subspace resulted in simpler structure of rotated components each of them related to small group of measured water quality parameters.

Cluster analysis has found similarities in sampling stations across the river. All 18 sampling sites have been grouped into four statistically signi?cant clusters. The first group is located at the end and the second group at the top of the river. These two groups correspond to relatively low and moderately polluted sites. The third and fourth group which correspond to highly and moderately polluted sites are located in the middle of the river.

The PCA analysis of the seasonal subsets showed that the importance of the parameters is changing from season to season and that the parameters which are contributing most to the water quality variation in one season could be contributing less (or not at all) in another season. However, no significantly different results were found for any season when compared with the results for the whole data set. Changes of order or sign in some components like the increased importance of organic component and the decrease of the physical component in all seasons does not represent significantly different result. Therefore, these results give rise to the reliability of the obtained results.

Those results are similar to those obtained by other authors who applied statistical analysis techniques such as Pearson’s correlation, PCA, and CA for the analysis of the data obtained in the water quality monitoring programs.

The temperatures and CODs are found to be the parameters which are responsible for most of the variance in the data set indicating that the physical- and organic-related sources contribute mostly to the water quality variations.

Finally, to check the influence of the outliers in the data set whose distribution strongly deviates from the normal one, two robust estimates of covariance matrix were calculated and subjected to PCA. Again, no significant differences between the obtained results were found. Hence, in the case of the data set with small number of missing data, non-detect values and outliers (less than 4 %), the usage of standard PCA algorithm is justified.