Introduction

Human activity has negatively affected water quality and aquatic ecosystems, particularly in urban areas. Rivers passing through cities receive many contaminants as a result of release of domestic sewage and agricultural activity. This has imposed great pressure on ecosystems, resulting in a reduction of water quality and biodiversity (Wang et al. 2013). Surface water suffers from a variety of practices that lead to introduction of high nutrient loads, hazardous chemicals, and pathogens, causing diseases (Singh et al. 2005; Sayadi and Sayyed 2011). There is a need to preserve the quality of raw water in rivers to ensure the its safety, because deterioration of its quality reduces its usability (Fulazzaky 2005; Sayadi et al. 2010). Advanced water quality or ecologically based standards that integrate physical, chemical, and biological criteria have the potential to enable better understanding, management, protection, and restoration of water bodies (Magner and Brooks 2008). Characterization and interpretation of different physicochemical river water data require the handling of large datasets. Complexity is mainly associated with interpretation of many measured variables, with high variability arising as a result of a variety of natural and anthropogenic factors (Simeonov et al. 2002; Sayyed and Sayadi 2011). Multivariate statistical techniques, for example cluster analysis (CA), factor analysis (FA), principal-components analysis (PCA), and box and whisker plots, have been used to evaluate water quality (Singh et al. 2004, 2005; Sayadi et al. 2008; Wang et al. 2013; Shrestha and Kazama 2007). In this paper we report usefulness of multivariate statistical techniques for evaluation and interpretation of large complex water-quality datasets and identification of sources of pollution, with the intention of acquiring better information about water quality and designing a monitoring network for effective management of water resources.

The principal objectives of the study were:

  1. 1

    assessment surface water quality changes over a period of 40 years;

  2. 2

    identification of sources of contamination;

  3. 3

    evaluation of the effect of such human activities as urban development, agriculture, and industry; and

  4. 4

    determination of the effect of rainfall in the catchment area.

Materials and methods

Study area

The length of the River Gharasoo is approximately 20.7 km. The river runs through the city of Kermanshah, which is situated in the diplomatic area in Iran. The study area lies between latitudes 46° 36′ and 47° 37′ N and longitudes 34° 00′ and 34° 91′ E, at a height of 1,322 m above sea level (Fig. 1).

Fig. 1
figure 1

Location of sample sites on the Gharasoo River

Datasets

Datasets from two sampling stations which included results for 10 water-quality characteristics monitored monthly over a period of 40 years were obtained from the city's Hydraulic Works. The monitoring stations are shown in Fig. 1. All samples were taken monthly for each year from 1970 to 2009. The samples were stored in pre-cleaned, distilled water-rinsed plastic bottles. pH and EC were noted at the sampling site by use of portable meters. The other characteristics of the water samples were analyzed in the laboratory. Chemical analysis was performed in triplicate in accordance with standard recommended methods (Eaton et al. 1994), using double glass-distilled water and analytical-grade chemicals. Total dissolved solids were estimated gravimetrically; bicarbonate and chloride content were estimated by volumetric analysis; calcium, magnesium, sodium, and sulfate were determined by use of photometric methods. Sodium adsorption ratio (SAR) and percent sodium (%Na) were calculated as reported by Richards (1954), Wilcox (1948), and Paliwal (1972):

$${\text{SAR }} = {\text{ Na}}^{ + } / \, \left( {{\text{Ca}}^{ 2+ } + {\text{Mg}}^{ 2+ } / 2} \right)^{ 1/ 2}$$
$$\% {\text{ Na }} = \, \left[ {\left( {{\text{Na}}^{ + } + {\text{ k}}^{ + } } \right){ 1}00} \right] \, / \, \left( {{\text{Ca}}^{ 2+ } + {\text{ Mg}}^{ 2+ } + {\text{ Na}}^{ + } + {\text{L}}^{ + } } \right)$$

All statistical computations (Table 1) were performed by use of (SPSS 16) and (Excel 2010) statistical software.

Table 1 Mean Gharasoo River water-quality characteristics

Principal-components analysis (PCA)

PCA is a mathematical procedure in which the original variables are orthogonally transformed into new variables called principal components, which are linear combinations of the original variables. The number of principal components is less than or equal to the number of original variables. PCA defines a new orthogonal coordinate system that optimally describes the variance in a single dataset. Use of PCA enables the number of variables in a multivariate dataset to be reduced while retaining as much as possible of the variation present in the data (Huang et al. 2010; Helena et al. 2000; Singh et al. 2005). The principal components can be expressed as:

$${\text{Z}}_{\text{ij}} = {\text{a}}_{\text{i1}} {\text{x}}_{{ 1 {\text{j}}}} + {\text{a}}_{\text{i2}} {\text{x}}_{{ 2 {\text{j}}}} + \, \ldots \, + {\text{a}}_{\text{im}} {\text{x}}_{\text{mj}}$$
(1)

where z is the component score, a is the component loading, x is the measured value of the variable, I is the component number, j is the sample number, and m is the total number of variables.

Factor analysis (FA)

FA is conducted after PCA and is used:

  1. 1

    to reduce the number of variables; and

  2. 2

    to detect structure in the relationships between variables, i.e. to classify variables.

FA is used as a method of data reduction or structure detection. This can be achieved by rotating the axis defined by PCA according to well established rules, and constructing new variables, also called factor variables. A small number of factors will usually account for approximately the same amount of information as the much larger set of original observations (Varol and Sen 2009; Shrestha and Kazama 2007). FA can be expressed as:

$${\text{Z}}_{\text{ji}} = {\text{ a}}_{\text{f1}} {\text{f}}_{{ 1 {\text{i}}}} + {\text{ a}}_{\text{f2}} {\text{f}}_{{ 2 {\text{i}}}} + {\text{ a}}_{\text{f3}} {\text{f}}_{{ 3 {\text{i}}}} + \, \ldots \, + {\text{ a}}_{\text{fm}} {\text{f}}_{\text{mi}} + {\text{ e}}_{\text{fi}}$$
(2)

where z is the measured value of a variable, a is the factor loading, f is the factor score, e is the residual term accounting for errors or other sources of variation, and m is the total number of factors.

Cluster analysis (CA)

CA is a multivariate procedure for detecting natural groupings of data. It is based on placing objects into more or less homogeneous groups in a manner such that the relationship between the groups is revealed. CA requires decisions to be made by the user relating to the calculation of clusters, decisions which have a substantial effect on the results of the classification. CA was applied to surface water-quality data by use of the single-linkage method. In the single-linkage method, the distances or similarities between two clusters A and B are defined as the minimum distance between a point A and a point in B:

$${\text{D }}\left( {{\text{A}},{\text{ B}}} \right) \, = { \hbox{min} }\left\{ {{\text{d}}\left( {{\text{x}}_{\text{i}} + {\text{x}}_{\text{j}} } \right),\quad {\text{for x}}_{\text{i}}\, {\text{in A and x}}_{\text{j}}\,{\text{in B}}} \right\}$$
(3)

where d(x i + x j), is the Euclidean distance. At each step the distance is found for every pair of clusters and the two clusters with smallest distance are merged. If more than two clusters are merged the procedure is repeated for the next step: the distances between all pairs of clusters are calculated again, and the pair with the minimum distance is merged into a single cluster. The result of this hierarchical clustering procedure can be displayed graphically by use of a tree diagram, also known as a dendrogram, which shows all the steps in the hierarchical procedure (Juahir et al. 2011; Alkarkhi et al. 2008; Johnson and Wichern 2002).

Box plot

A box plot is a convenient means of graphical depiction of groups of numerical data by use of their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the first and third quartiles, and the band inside in the box is always the second quartile (the median). Any data not included between the whiskers should be plotted as outliers with a dot, small circle, or star, but occasionally this is not done (Vega et al. 1998).

Results

Climate

Integrated rainfall data were collected from daily synoptic meteorological data published by the Iran Meteorological Organization (IRIMO) for one station within the Gharasoo River basin for the years 1970–2009; these are presented in Fig. 2. Mean annual temperature and rainfall were 14 °C and 456.8 mm, respectively. Rainfall often occurs during the autumn and winter (December to March). Differences between rainfall between 1970 and 2009 for Kermanshah station are slight and variable. In recent years the Gharasoo River received less rainfall than the long-term average. It seems rainfall has decreased recently; because of the short period during which the data were recorded, however, it is not known whether this is a result of the normal variability of rainfall, drought, or climate change. Fluctuations of the flow of the Gharasoo River follow a specific cycle and are closely related to precipitation (Zhang et al. 2012).

Fig. 2
figure 2

Annual mean rainfall (mm) in the Gharasoo River basin for the years 1970–2009

Chemical and physical characteristics of the river water

The mean pH of the river water was neutral at the two stations for the study period; the range was 7.08–8.24 at station 1 and 7.47–8.71 at station 2. The pH fell within the range associated with most natural waters which is between 6.5 and 8.5 (Sundaray et al. 2006). The mean EC of the river water ranged between 412 and 684.60 mS/cm−1 at station 1 and ranged from 312 to 450 mg/l at station 2, the order of increasing magnitude was: downstream > selected site > upstream. This may be because of dilution by the river water, which has much higher conductivity. Mean HCO3 content of the river water ranged between 3.00 and 5.33 mg/l at station 1 and between 2.29 and 4.46 mg/l at station 2. Mean Cl ranged between 0.20 and 0.97 mg/l at station 1 and between 0.13 and 0.90 mg/l at station 2. Mean SO4 2+ ranged between 0.20 and 0.94 at station 1 and between 0.10 and 1.06 mg/l at station 2. Mean Ca2+ ranged between 2.32 and 3.61 mg/l at the station 1 and between 1.47 and 3.25 mg/l at station 2. Mean Mg2+ ranged between 1.12 and 2.44 mg/l at station 1 and between 0.54 and 2.05 mg/l at station 2. Mean Na+ ranged between 0.37 and 1.42 mg/l at station 1 and between 0.20 and 0.47 mg/l at station 2.

Discussion

PCA was applied to standardized log-transformed data (10 variables) not only to examine differences between stations 1 and 2 but also identify latent factors in the different spatial variability, as shown in Table 2. PCA of the datasets furnished three factors each for stations 1 and 2, explaining 90.36 and 79.52 % of the total variance in the respective water quality datasets. Singh et al. (2004) classified factor loadings as “strong”, “moderate”, or “weak” corresponding to absolute loading values of >0.75, 0.75–0.50, and 0.50–0.30, respectively. The corresponding factor variables explained by loading and variance are presented in Table 2.

Table 2 Rotated patterns of factors after varimax rotations

As shown in Table 2, for station 1, among the three factors, the first factor, explaining 52.11 % of the total variance, had strong positive loading on EC, Cl, SO4 2−, Na+, %Na, and SAR, which were 0.723, 0.773, 0.781, 0.969, 0.963, and 0.974, respectively. This first factor could be explained by point source and non-point source pollution arising as a result of anthropogenic activity, especially from industrial and domestic waste and agricultural runoff. The concentration of Cl is higher in wastewater than in raw water because sodium chloride, a common component of the human and diet, passes unchanged through the digestive system (WHO 2008). The high loading of EC and Cl may be attributed to anthropogenic input, i.e. industrial discharge wastewater and domestic sewage, into the Gharasoo River. The increase of Cl concentrations, particularly the peak value, seems to be attributable to industrial and domestic sewage. Cl increases the EC of water and its corrosive nature (WHO 2008). SO4 2− from different sources can have different isotopic profiles (Grasby et al. 1997). Sources of SO4 2− include atmospheric sulfur compounds, soil sulfur compounds, sulfur minerals in rocks, sulfur in hydrocarbon deposits, and sulfur in fertilizers, for example ammonium sulfate, (NH4) 2SO4 2− (Krouse and Grinenko 1991). High loading of SO4 2− in some areas may be related to farmers’ use of sulfate fertilizers, and the river receiving sulfate via surface runoff and irrigation water. Agricultural activities lead to accumulation of fertilizer in the soil, and use of ammonium sulfate fertilizers is high in this region. Fertilizer surface runoff contributes to the abundance of sulfate in the river water (Grasby et al. 1997). It is well reported that agricultural land use substantially affects river sulfate levels. The second factor (23.37 % of the total variance) had a strong negative loading on pH and a moderate positive loading on HCO3 and Ca2+, which were −0.912, 0.694, and 0.676, respectively. This factor represents the contribution of point and non-point pollution and the physical chemistry of the stream. Point pollution arises from domestic wastewater, non-point pollution from agriculture and livestock farms. Hydrolysis of acidic materials causes a decrease in pH (Vega et al. 1998; Singh et al. 2004). The low loading of pH may be attributed to anthropogenic input—discharges of industrial wastewater and domestic sewage into the Gharasoo River. The third factor (14.88 % of the total variance) had strong positive loadings on Mg2+(0.985). This factor represents the contribution of point source of pollution of the stream.

At station 2, among the three factors, the first factor explaining 29.82 % of the total variance had strong positive loadings on EC, pH, HCO3 , and Ca2+, which were 0.926, 0.110, 0.941, and 0.883, respectively. This factor can be interpreted as physicochemical variability. The second factor (29.70 % of the total variance) had strong loadings on Na, %Na, and SAR, which were 0.942, 0.910, and 0.924, respectively. Factor 3 (19.99 % of the total variance) had strong loadings on Cl, SO4 2−, and Mg2+ which were 0.655, 0.560, and 0.708, respectively. Increases in the amounts major chemical species, especially EC, Cl, SO4 2−, and Na+ from upstream to downstream, are related to releases from agricultural, industrial, and domestic areas into the river network (Chen et al. 2002). Therefore, it may be concluded that point-source pollution was stronger than non-point-source pollution in the study area. Moreover, the water quality upstream was better than that downstream.

Correlation between environmental variables

Pearson correlation coefficients (r) were calculated for the water-quality data. The results are listed in Table 3. At station 1, EC was negatively correlated with pH (r = −0.543) and positively correlated with HCO3 (0.873), Cl (0.823), SO4 2− (0.456), Ca2+ (0.802), Mg2+ (0.483), Na+ (0.770), %Na (0.582), and SAR (0.711). Ion chemistry can be affected by human activity (Meybeck and Helmer 1989). The large variety of EC concentrations, reflecting dissolved solutes, is related to lithology, land use, and human activity in the basin (Li and Zhang 2008; Li et al. 2008). EC depends on discharge into the river via surface runoff of domestic waste and fertilizer used in agricultural activity. pH was negatively correlated with HCO3 (r = −0.62), Cl (r = −0.51) and Ca2+ (r = −0.70). HCO3 was positively correlation with Cl (r = 0.62), Ca2+ (r = 79), Mg2+ (r = 52), Na (r = 59), and SAR (r = 51). Cl was positively correlated with SO4 2− (0.43), Ca2+ (0.63), Na+ (0.81), %Na (0.68), and SAR (0.78). The presence of Cl is because of discharge into the river via surface runoff of domestic sewage waste and industrial waste. A high positive correlation between Cl and Na+ concentrations has been reported (Giridharan et al. 2009). Similarly, SO4 2− was a positively correlated with Na+ (0.62), %Na (0.68), and SAR (0.62). A high positive correlation between sulfate and sodium concentration has been reported (Li and Zhang 2009). Ca2+ was positively correlated with Na+ (0.72), %Na (0.53), and SAR (0.66), and Na+ was positively correlated with %Na (0.95) and SAR (0.99). A high positive correlation between Na+ and SAR concentrations has been reported (Rasouli et al. 2012). At station 1 rain was significantly negatively correlated with HCO3 only, at a level of 0.05.

Table 3 Pearson correlation coefficients for water quality data and rain

At station 2, EC was positively significantly correlated with HCO3 (0.87) and Ca2+ (0.77). pH was negatively significantly correlated with Cl (−0.40). HCO3 was positively significantly correlated with Ca2+ (0.76) and Mg2+ (0.41) and negatively significantly correlated with Na+ (−0.47). Na+ was positively significantly correlated with %Na (0.82) and SAR (0.84). It is interesting to note there were no significant correlations between water data and rain at the station 2.

Box and whisker plots

Box and whisker plots of the water quality data over the 40-year period (1970–2009) are shown in Fig. 3. The trend for pH suggested that the average concentration at station 1 was lower than at station 2. Except for pH, the normalized data revealed high pollution at station 1. In statistics, an outlier is an observation that is numerically distant from the rest of the data (Grubbs 1969). Among large datasets, some data points will be farther from the sample mean than is deemed reasonable. This can be because of incidental systematic errors or flaws in the theory that generated an assumed family of probability distributions, or it may be that some observations are far from the center of the data (Vega et al. 1998; McGill et al. 1978). As Fig. 3 shows, Mg2+ and Cl deviated most from normal distribution and included outliers and extremes.

Fig. 3
figure 3

Box and whisker plots of the water parameters for Gharasoo River over the 40-year period

Hierarchical cluster analysis (HCA)

Spatial similarity and grouping of monitoring data at stations 1 and 2 are shown in Fig. 4. In this study, monitoring data at stations 1 and 2 were classified by use of HCA, and a dendrogram was produced. The clustering procedure generated two very convincing groups from the data from stations 1 and 2. At station 1, cluster 1 (Cl, SAR, SO4 2−, Na+, Mg2+, HCO3 , Ca2+, and pH) and cluster 2 (EC and rainfall) correspond to relatively low pollution in the high-pollution region. Hence, the temporal variation of water quality at station 1 was greatly shaped by industrial and municipal activities and agricultural and climate change (Sayadi et al. 2014; Sundaray et al. 2006), which confirms the outcome of the PCA. At station 2, cluster 1 (Na+, SAR, Cl, SO4 2−, Mg2+, HCO3 , Ca2+, pH, and %Na) and cluster 2 (EC and rainfall) correspond to a relatively low pollution in the high-pollution region. Hence, the temporal variation of water quality at station 2 was greatly shaped by agricultural activities and lithogenic sources, which confirms the outcome from PCA (Sayadi et al. 2014; Sundaray et al. 2006).

Fig. 4
figure 4

Hierarchical cluster analysis of Gharasoo River water parameters and rain at the stations

Conclusion

Data from 1970 to 2009 revealed distinctly different River Gharasoo water-quality characteristics. PCA of the two datasets revealed three factors each for stations 1 and 2; these showed that water-quality indicators vary substantially from upstream downward. The increase of solute concentrations from the upper basin downward is a result of anthropogenic input. HCA revealed that EC, only, was strongly affected by rain; Pearson correlation coefficients revealed that HCO3 , only, significantly correlated negatively with rainfall. Increases of EC, Cl, SO4 2−, and Na+, especially, from upstream to downstream are related to release of agricultural, industrial, and domestic waste into the river network. River water quality is better upstream of the city than downstream. To improve the quality of water of the Gharasoo River basin it is essential to treat industrial wastewater and municipal sewage and to adopt the best management practices to control diverse pollutants from agricultural land and urban surface runoff.