Introduction

Surface water hydrochemistry depends on several natural factors like changes in precipitation, erosion and weathering of crustal materials as well as anthropogenic factors such as industrial and agricultural activities and increased water consumption due to rapid urbanization. Furthermore, urbanization has severely affected the water resources by increasing the pressure on urban hydrology (Mishra 2011). These factors have degraded the quality of surface water for drinking purposes and are a threat to the existing water bodies (Singh et al. 2005a, b, c; Mishra 2010).

The discharge of waste water from municipal, industrial, and agricultural areas is an issue of serious concern as it affects river’s ecology (Varol et al. 2013). Surface run off from agriculture and urban areas varies mostly with seasonal changes in precipitation, ground water flow, water interception, and abstraction. These factors strongly affect the concentration of pollutants in river water which in turn deteriorates river water quality (Vega et al. 1998).

The quality of river water at a point is a function of several developmental pressures including the lithology of the basin, atmospheric and anthropogenic inputs, and climatic conditions (Yu et al. 2010). These developmental pressures and increasing human population have made the river of the study area vulnerable to sewage flow, solid waste dumping, etc. and consequently exert pressure on the percolation and infiltration processes responsible for the groundwater recharge (Kumar et al. 2012). Hence, these pressures affect the ground water conditions which make it imperative to analyse the hydrochemical properties of both the stream and the aquifers in order to comprehend the interactions existent between them.

The interaction between ground water and surface water (GW-SW) has been studied focussing on different perspectives. For example: Sophocleous (2002) studied this interaction focusing on geomorphological, hydrogeological, and climatic control; Valett and Sheibley (2009) studied the geomorphic and biological structure and how they influence the direction of GW-SW interaction; Kikuchi et al. (2012) performed spatial telescopic study for improved characterization of GW-SW interaction; Lu et al. (2011) analysed interactions between atmosphere, surface, and ground water using EARTH model in the unsaturated zone; Anibas et al. (2011) adopted a simple thermal mapping method for studying seasonal spatial patterns of GW-SW interactions; Harveya et al. (2006) modeled decadal time scale GW-SW interactions; and Wade et al. (2006) studied nitrate dynamics in these systems.

Multivariate statistical analysis techniques such as principal component analysis (PCA) and cluster analysis (CA) have been applied to characterize and evaluate surface freshwater quality, because they are useful in verifying the temporal and spatial variations caused by natural and anthropogenic factors linked to seasonality (Helena et al. 2000; Shrestha and Kazama 2007; Singh et al. 2005a, b, c). These techniques effectively compress the data and reveal various correlations between different parameters analysed. Studies have been conducted in the past to study GW-SW interactions by evaluating their hydrochemical properties through such methods (e.g., Woocay and Walton 2008; Mencio and Mas-Pla 2008). The previous studies have focused more on surface water quality (e.g., Sundaray et al. 2006; Sojka et al. 2008; and Selle et al. 2013) and hydrochemistry. For example: Singh et al. (2014) have used ensemble learning approach to evaluate influences on seasonal variations and anthropogenic activities on alluvial groundwater hydrochemistry; Isa et al. (2012) evaluated extent and severity of groundwater contamination based on hydrochemistry mechanism of sandy tropical coastal aquifers. Various other studies have also focused on aquifer dynamics and hydrochemistry, (Stigter et al. 2006; Soulsby et al. 2002). Multivariate statistical analysis techniques facilitate to draw an appropriate profile of the stream hydrology when there is insufficient data available or when the end members are imprecise (Mencio and Mas-Pla 2008).

In the present study river Varuna, a minor tributary of river Ganges was monitored. This river receives huge quantities of untreated sewage, agriculture runoff with pesticides, fertilizers, etc. from catchment areas which lead to degradation of water quality. A study by Singh and Dwivedi (2007) and Kumar et al. (2012) asserted that river Varuna receives sewage from twenty-two municipal drainages located on both sides of its 15-km-long passage through Varanasi city. So there is an urgent necessity to identify the pollution sources and estimate their quantitative contributions, in order to achieve an effective pollution control. Therefore, the objective of the present study is to establish correlations between various sites on the basis of CA and identify the major hydrochemical factors based on the PCA. These results will further help in the investigation of relevant human pressures contributing to degradation of the water quality and also aid in concluding about the GW-SW interactions. Hence, periodic monitoring and assessment of water quality helps to develop management strategies to control surface water pollution inspite of increasing urbanization and anthropogenic pressure on them.

Multivariate statistical analysis was employed to examine GW-SW interactions and the potent hydrochemical factors responsible for altering the quality of the river water in Varanasi. This work is structured as follows: (1) sampling and analysis of physiochemical parameters from different sites selected throughout the city; (2) PCA analysis on the collected data to identify the extent of ground water or surface water contribution to pollution in the relevant site; and (3) a CA to show the correlations between sites.

Materials and methods

Hydrology and geology of Varuna River

Varuna River basin is situated in Indo-Gangetic plain (Fig. 1) which is underlain by Quaternary alluvial sediments of Pleistocene to recent age. The basin of this river consists of three main geological formations: newer alluvium and older alluvium aging from upper Pleistocene to recent and from middle to upper Pleistocene, respectively (Raju et al. 2009). The rocks associated with the newer alluvium formations are composed mainly of unconsolidated sand, silt, and clay while that associated with older alluvium consist mainly of consolidated clay with kankar, fine to medium grain, and some gravel. In the study area, the unconsolidated sediments form a sequence of clays and sands of various grades. The presence of clay and sand zones controls the availability of ground water in alluvial zones. The sand beds with or without kankar in the study area forms the main aquifer zones of the multi-tier aquifer system (Shukla and Raju 2008). Near surface groundwater occurs under water table condition, while deeper aquifers occur in semi-confined to confined conditions. Extraction of ground water is through both dug wells and deep bore wells. The rainfall is the main source of recharge to groundwater. The average hydraulic gradient is 0.35 m/km which indicates porous nature of near surface formations of the area (Pandey 1993).

Fig. 1
figure 1

Geographical locations and photographs of water quality monitoring sites

Sampling and qualitative analysis

Seven sampling sites were randomly selected across the width of the river for the assessment of water quality. Each sampling site is in close proximity to a ground water pumping station to make the analysis more credible. These samples were collected during winter (2010–12) across a period of 3 months to monitor changes caused by anthropogenic as well as natural sources. Sampling, preserving, and transportation of the water samples to the laboratory were done as per the standard methods (APHA 1998). Samples were qualitatively analysed for eight different physiochemical parameters and were tabulated in Table 1. Triplicate samplings were done at each selected site at an interval of 1 month for three consecutive months as mentioned above. The pH, temperature, and electrical conductivity (EC) were determined immediately at the collection site by a portable pH meter, to minimize errors with time due to biological and chemical reactions between the atmosphere and the sample (Hutton 1983). Mean and standard deviation of water quality parameter studied both on ground (G) and surface (S) at different sites on Varuna River during winter season is given in Table 2.

Table 1 The associated water quality parameters, with abbreviation and their units used in study
Table 2 Mean and standard deviation of water quality parameter studied both on ground (G) and surface (S) at different sites on Varuna River during winter season

Statistical analysis

PCA extracts meaningful information from raw data by utilizing the existing correlations among various parameters which are being monitored and hence successfully aid the interpretation of meaningful information (e.g., Chaturvedi and Raghubanshi 2015). Hierarchical CA groups various sites into different clusters based on the similarity in various studied parameters.

The statistical analysis was done using IBM SPSS Statistics software package. PCA technique extracts the eigenvalues and eigen vectors from the covariance matrix of original variables to find association between the variables such that it reduces the dimensionality of the data set. The principal components (PCs) obtained from PCA are the uncorrelated (orthogonal) variables, obtained by multiplying the original correlated variables with the eigenvector (loadings or weightings). The eigenvalues of the PCs are the measure of their associated variance, the participation of the original variables in the PCs is given by the loadings, and the individual transformed observations are called scores (Singh et al. 2005a, b, c) (Arslan 2013). Thus, the PCs are weighted linear combinations of the original variables. These PCs extract meaningful information by capturing maximum variance and hence facilitate the compression of data (Vega et al. 1998; Helena et al. 2000). PCA is designed to transform the original variables into new, uncorrelated variables (axes), called the principal components, which are linear combinations of the original variables. The new axes lie along the directions of maximum variance. So, PCA provides an objective way of finding indices of this type such that the variation in the data can be accounted for as concisely as possible. This purpose can be achieved by rotating the axis defined by PCA, according to well-established rules, and constructing new variables, also called varifactors (VF). PC is a linear combination of observable water quality variables, whereas VF can include unobservable, hypothetical, and latent variables (Vega et al. 1998; Helena et al. 2000). PCA of the normalized variables was performed to extract significant PCs and to further reduce the contribution of variables with minor significance; these PCs were subjected to varimax rotation (raw) generating VFs. Consequently, a small number of factors will usually account for approximately the same amount of information as does the much larger set of original observations.

In CA, unsupervised pattern of the data is used to uncover intrinsic structure or underlying behavior of a data set without any prior assumption in classification of objects into categories based on similarity (Vega et al. 1998). The most commonly used hierarchical clustering technique is CA in which clusters are formed sequentially, starting with the most similar pair of objects followed by forming higher clusters step by step. The Euclidean distance usually gives the similarity between two samples, and a “distance” can be represented by the “difference” between analytical values from both the samples (Otto 1998). This technique was performed on the normalized data set by means of the Ward’s method, using Euclidean distances as a measure of similarity. This method uses the analysis of variance approach to evaluate the distances between clusters, attempting to minimize the sum of squares of any two clusters that can be formed at each step. CA has been applied to assess the quality of river water in relation to its spatial variability at selected sites spread over the river stretch. The linkage distance is reported as Dlink/Dmax = (quotient between the linkage distance for a particular case / maximal distance) × 100 to standardize the linkage distance and represented on y-axis (Wunderlin et al. 2001; Simeonova et al. 2003).

Results and discussion

Principal component analysis

PCA was performed on the combined data set of 3 years (between the sites and parameters) and PCs with eigenvalue >1 were retained. Results from this analysis including the loadings, eigenvalue, and variance were explained by each PC. The extracted three PCs explain 78.5 of the total variance as shown in Table 3, VF1 explains 46 % of total variance, has strong positive loadings of TDS, chloride, and EC both on ground water as well as surface water. VF2 explains 18.7 % of total variance, has strong positive loadings on PO4 −3, on ground and surface water and strong negative loadings on pH, whereas, it has moderate positive loadings on total hardness. VF3 (13.7 % of total variance) has strong negative loadings on total hardness on both ground and surface water as shown in Table 4.

Table 3 The PCA axes explaining complete variance of the data
Table 4 Loading of variables on significant principal component of surface and ground water data

From PCA (Fig. 2), we observed that Chaukaghat, Hakulganj, Nakhighat, and Kachahari are affected by total hardness and PO4 3- 4 −3, whereas, Rajghat site is influenced by TDS, EC, and NO3 -. The pollutants at these sites are contributed by anthropogenic sources like small dyeing industries, laundries, small scales foundry, battery industries, and haphazard dumping of municipal solid waste along the river (Barai and Kumar 2013).

Clustering of the sites based on spatial similarity

Cluster analysis (CA) was applied to detect spatial similarity among sites under the monitoring network. It rendered a dendrogram as shown in (Fig. 3), grouping all the seven sampling sites on the river into three statistically significant clusters at (D link/Dmax) × 100 ≤ 75. The clustering procedure generated three groups of sites in a very convincing way, as the sites in these groups have similar characteristic features and same source of contamination, and their range of correlation coefficient are shown in Table 5. Cluster 1 (Nakhighat, Hukulganj, Old Bridge, Kachahari, Dhobighat), cluster 2 (Chaukaghat), and cluster 3 (Rajghat) are the clusters that correspond to contamination of water. It is evident that this technique is useful in offering reliable classification of surface and ground water in the whole region. So instead of monitoring eight sites, only three sites could be selected in future spatial sampling strategy without affecting the result (Fig. 2). The reduction in sampling sites in the monitoring network will be more economical without losing any significance of the outcome (Singh and Singh 2010; Singh et al. 2005b, c).

Fig. 2
figure 2

Result of principal component analysis (PCA) for the site and physiochemical parameters of surface and ground water

Table 5 The range of correlation coefficient of cluster analysis
Fig. 3
figure 3

Dendrogram showing the cluster of sampling sites on Varuna River (linkage method: Ward’s method, distance measure: Euclidean (Pythagorean) percent chaining = 85.71)

Correlation analysis

As the raw data are not normally distributed, the correlation among the eight variables was determined by estimating Spearman’s ranked correlation coefficient (r), which is a non-parametric test of the degree of correlation (Shrestha and Kazama 2007). Because of the known association among variables used in the study and the fairly large number of observations (42), the critical value of r, which is 0.259 at p < 0.05 was replaced by 0.424 at p < 0.001, as the critical value for this study. This is expected to highlight the stronger positive or negative correlations that are likely to be of more utility (Helena et al. 2000). It is also of interest to see how these correlations are reflected in the multivariate statistics, especially PCA, and subsequently in the VFs after varimax rotation of the PCs. Data in Table 4 provide the correlation matrix of the quality parameters obtained from the PCA. Among the various physiochemical parameters of surface and ground water, only few parameters exhibited significant correlations (Table 6). Strong negative correlations have been observed between pH and PO4 3-, and NO3 - and hardness (Table 3). We observed significant positive relationship of temperature with PO (r = 0.54, p < 0.01). It has been reported that the solubility of calcium phosphate minerals increases with increasing temperature (House 1999). It can implicitly be concluded that the degree of contamination is also dependent on the season of sampling. There exists strong positive correlation of TDS with Cl- (r = 0.81, p < 0.01) and EC (r = 0.87, p < 0.01), and a moderate correlation with NO3 (r = 0.47, p < 0.01) which is clearly comprehensible as TDS accounts for both organic and inorganic impurities. The increase in impurities enhances number of free ions in the water leading to increased EC. Correlation analysis showed the degree of positive and negative correlation between the parameters used for analysis and made necessary conclusions about the interdependence of free ions on EC. NO3 was also positively correlated with the EC (r = 0.47, p < 0.01). The correlation coefficients of various parameters have been presented in Table 3.

Table 6 Correlation coefficient for the various physiochemical parameters of surface and ground water

Multivariate statistical techniques (PCA and CA) effectively evaluated the variations in surface water and ground water quality of river Varuna. PCA helped in identifying the factors or sources responsible for water quality variations explaining about 78.5 % of the total variability. CA reduced sampling sites into three groups based on the similarity in the water quality, and this could be helpful in minimizing the sampling location for the future monitoring. The interactions between hydrochemical parameters of ground water and surface water are quite evident by studying loading of variables on significant principal components of ground and surface water.

The previous studies over establishing this kind of interactions were done in different environment but used the same methodology. Mencio and Mas-Pla (2008) performed similar kind of analysis in urbanized Mediterranean streams, and Woocay and Walton (2008) also followed the same methodology and successfully exhibited the interactions which have hypothesized to establish for this study area. Furthermore, different studies have been conducted for analysis of ground water using multivariate statistical techniques but only a few of them proposed for any kind of interactions.

Conclusions

The present study took into account only a particular season spanning across 3 years so as to make the necessary conclusions. The data for different sampling sites collected across the aforementioned time period successfully established an interaction between ground water and surface water. These techniques further aided in finding out relevant parameters for monitoring the task at every site. In conclusion, there is an interaction as hypothesized in our proposal which was exhibited by the statistical technique. Moreover, this analysis will help in future water control management program as it has outlined the parameters contributing to pollution for every site. This will make the future monitoring more economical and also easier to comprehend. It is therefore, needful, to develop a comprehensive river water quality monitoring program all over the world (Sharma and Kansal 2011).