Introduction

The quality of water in a river is highly affected by both natural and anthropogenic sources since a river system, comprising both the main course and the tributes, carries the one-way flow of a significant load of matter in dissolved and particulate phases. This load reflects the major influences on the system: the lithology of the basin, atmospheric inputs, climatic conditions and anthropogenic inputs (Bricker and Jones 1995). Human activities are the major factors determining the quality of the river surface water through effluent discharges, use of agricultural chemicals, erosion of soils and land use (Niemi et al. 1990), whereas rainfall which causes surface water runoff is a seasonal phenomenon, largely affected by climate within the basin (Karbasi et al. 2008; Najafpour et al. 2008). Since the system of natural environment and anthropogenic sources is multivariate and complex, river quality assessment and management require a fundamental understanding of spatial and temporal variations in the water characteristics, including hydro-morphological, chemical and biological parameters. For this purpose, a monitoring program which can provide reliable estimation of the parameters of surface water quality is necessary. However, the program requires a large number of parameters to be measured, analysed and interpreted using multivariate methods (Shrestha and Kazama 2007; Simeonov et al. 2004).

The different multivariate statistical methods, comprising cluster analysis (CA), principal component analysis (PCA), factor analysis (FA) and discriminant analysis (DA), have been reported to be effective methods to facilitate complex data matrices for characterizing and evaluating surface water quality, and they are useful in interpreting temporal and spatial variations of water quality parameters caused by natural and anthropogenic factors (Wunderlin et al. 2001; Helena et al. 2000; Lee et al. 2001; Adam et al. 2001; Simeonov et al. 2004; Singh et al. 2004; Reghunath et al. 2002; Papatheodorou et al. 2007). Thus, these methods have been considered as valuable tools for reliable assessment and management of water resources in order to help develop rapid solutions to pollution problems. Some recent studies applying multivariate statistical approaches to evaluate river water quality have been reported. Singh et al. (2004, 2005) used CA, PCA/FA and DA to evaluate spatial and temporal variations in water quality of the Gomti River (India). Similarly, Shrestha and Kazama (2007) also used multivariate statistical techniques to evaluate the surface water quality of the Fuji River Basin in Japan. Qadir et al. (2008) applied multivariate statistical techniques to assess spatio-temporal variations in water quality of the river Chenab, Pakistan, and Pejman et al. (2009) also evaluated spatial and seasonal variation in surface water quality using multivariate statistical techniques.

This study aims to assess the spatial and temporal variations of surface water quality in 38 sampling sites of the Mekong River throughout Can Tho City, in the central area of the Mekong Delta Basin in Vietnam. The data set obtained, comprising 11 parameters of water quality, which were monitored quarterly at key points throughout the city during the period of 2008–2011, was analysed using different multivariate statistical techniques to (i) examine the similarities or dissimilarities between sampling sites and season, (ii) identify water quality variables responsible for spatial and temporal variations in river water quality and (iii) evaluate the influence of possible sources on the water quality parameters in the research location.

Methods

Study area and monitoring sites

The study was conducted in Can Tho City, a typical city in Mekong Delta in Vietnam. Can Tho is the fourth most populous city in Vietnam with a total area of 1,411.49 km2, a population of 1,188,390 people and a population density of 842 people/km2 (Huyen 2012). In the heart of the Mekong Delta, where the river system is intertwined, Can Tho residential life is strongly affected by hydrological factors. Can Tho City has a tropical monsoon climate with two main seasons annually which comprise dry season (December–April) and wet season (May–November).

In the heavily urbanized areas of Can Tho City, there are reticulated water supplies but there is no centralized wastewater treatment, with sewers discharging into local waterways. In areas outside of the urban districts, there is inadequate access to water and sanitation services. Many households in rural and peri-urban areas depend on a range of water sources including rainwater, groundwater, piped water and river surface water. Similar to other lower Mekong Delta Basin localities, the main sources of surface water pollution in Can Tho comprise urban waste, industrial waste water and agricultural activities.

Surface water quality in Can Tho was quarterly monitored by the Can Tho Centre for Natural Resources and Environment, and the sampling method was conducted according to the guidelines of the National Technical Regulation on Surface Water Quality in Vietnam (VNMONRE 2008) (Table 1). Water samples from the 38 sites (Fig. 1), which were located at key points throughout the city, were collected quarterly from January 2008 to December 2012 (Fig. 1). The data sets of the 38 water quality monitoring stations comprise 11 water quality parameters: pH, dissolved oxygen (DO) (i.e. in situ DO readings given in milligrams per litre and auto-compensated or corrected for variations), biochemical oxygen demand (BOD), chemical oxygen demand (COD), nephelometric turbidity units (NTU), suspended solids (SS), nitrate (NO3) and nitrite-nitrogen (NO2), ammonical nitrogen (NH3), total Fe and total coliform.

Table 1 Mean (standard deviation) values of water quality parameters at different sampling sites of the Mekong River in Can Tho City, Mekong Delta, Vietnam (2008–2012)
Fig. 1
figure 1

Sampling sites for monitored surface water in Can Tho City, Mekong Delta, Vietnam

Statistical analysis

Multivariate analysis of the river water quality data was conducted using four statistical techniques: cluster analysis (CA), factor analysis (FA), principal component analysis (PCA) and discriminant analysis (DA).

Cluster analysis (CA)

Cluster analysis is a group of multivariate techniques whose primary purpose is to assemble objects based on the characteristics they possess (Shrestha and Kazama 2007). The similarity between two samples is usually evaluated by Euclidean distance which can be represented by the difference between analytical values derived from the samples (Otto 1998). This study performed hierarchical agglomerative CA on the normalized log transformation of water quality parameters using Ward’s method, in which squared Euclidean distance was used as the measure of similarity (Singh et al. 2004). The spatial variability of water quality in the river basin was identified from CA, using linkage distance, reported as D link/D max representing the quotient between linkage distances to the maximal linkage. The quotient was then multipled by 100 as an approach to standardize the linkage distance represented on the y-axis (Simeonov et al. 2004; Wunderlin et al. 2001). The previous studies indicated that CA reliably classifies surface water quality and can guide future sampling strategies (Simeonov et al. 2004; Singh et al. 2004; Wunderlin et al. 2001).

Factor analysis/principal component analysis

Factor analysis/principal component analysis (PCA/FA) was used to identify the important quality parameters of surface water and to investigate the possible sources of different pollutants, and this technique was performed on all research locations and each cluster identified from the previous CA analysis. The PCA technique (Wunderlin et al. 2001; Singh et al. 2004) starts with a covariance matrix presenting the dispersion of the original variables (water quality parameters), and then extracting the eigenvalues and eigenvectors. Principal components (PCs) are obtained by multiplying an eigenvector, which is a list of coefficients (loading or weighting), by the original correlated variables. A principal component is the product of the original data and an eigenvector is the result of the data projected on to a new axis. The number of PCs is the same as the number of the original variables, but the results provide information about the most meaningful parameters (Vega et al. 1998). In the following step of PCA/FA, FA gives factors meaning by the transformation process using Varimax rotation, in which FA reduces the contribution of less significant variables obtained from PCA and the new groups of variables, so-called varimax factors (VFs), are generated by rotating the axis defined by PCA. The VFs can include unobservable, hypothetical, and latent variables (Vega et al. 1998; Wunderlin et al. 2001; Singh et al. 2004; Helena et al. 2000).

Discriminant analysis

DA was used to determine the surface water quality parameters which best discriminate between groups identified by CA. The principle of DA is to determine whether groups differ with regard to the mean of a variable, and then to use that variable to predict membership (Qadir et al. 2008; Singh et al. 2004). DA was applied to the water quality data set without any standardization to define spatial and temporal variations in these parameters. DA was performed to construct discriminant functions (DFs) using standard, forward stepwise and backward stepwise modes. DFs were calculated using the equation below:

$$ f\left(\left.Gi\right)\right)=ki+{\displaystyle \sum_{j=1}^n}{w}_{ij}{P}_{ij} $$

where i is the number of groups(G), ki the constant inherent to each group, n the number of parameters used to classify a set of data into a given group, w i the weight coefficient, assigned by DA to a given parameter (p i ).

In DA, the sites which were identified by CA were taken as spatial grouping variables and seasons (wet and dry) were considered as temporal grouping variables. These grouping variables (spatial and temporal) were used in the analysis as dependent variables, and the water quality parameters were considered independent variables. In forward stepwise mode, DA variables were simultaneously added until no significant change occurs, while in backward stepwise mode, variables were removed starting from the least significant until significant change occurred.

Results and discussion

Overall water quality evaluation

A total of 608 water samples were collected from 38 sampling sites during the 4 years (2008–2012) and analysed for 11 parameters (pH, NTU, SS, Fe, DO, COD, BOD, NH3, NO3, NO2, coliform). The descriptive statistics such as mean and standard deviation of water quality parameters are summarized in Table 1. The results shows that NO3 is within the permissible levels, while all other parameters—BOD (mean range, 8.8–26.2 mg/l), COD (mean range, 12.5–38.7 mg/l), DO (mean range, 2.7–5.3 mg/l), SS (mean range, 31.7–61.9 mg/l), Fe (mean range, 0.37-0.84 mg/l), NH3 (mean range, 0.24–2.7), NO2 (mean range, 0.04–0.19 mg/l) and total coliform (mean range, 8 × 103–3.9 × 105 MPN/100 ml)—exceed on average the levels set by national guidelines for residential use and other purposes. It is noteworthy that the BOD levels are quite high, with a minimum of 8.8 mg/l. However, the dissolved oxygen content of the water is relatively less with a minimum of 2.7 mg/l.

Spatial similarity of monitoring sites (CA)

Cluster analysis (CA) rendered a dendogram which grouped the 38 sites into three clusters at (D link/D max) < 60 (Fig. 2). The clustering procedure produced three groups of sites in a persuasive way, since the sites in these groups have similar features and natural background source types. Cluster 1 (C1) comprises 18 monitoring sites in five districts (Thot Not, Co Do, Thoi Lai, Phong Dien), which are mixed urban and rural areas. Cluster 2 (C2) is a small cluster comprising only two monitoring sites at Vinh Thanh District where agricultural land use is predominant. However, since this cluster is a predominantly agricultural district, the number of monitoring sites appears reasonable, with any increase in pollutants depending on the agricultural seasons and crops. Therefore, local authorities, experts, and field technicians will decide the number and location of additional sampling sites based on the real field situation. Cluster 3 (C3) comprises 18 monitoring sites in four districts (Cai Rang, Ninh Kieu, Binh Thuy) where urban and industrial zones are predominant. This implies that for rapid assessment of water quality, a reduced number of monitoring sites in each cluster may serve for spatial assessment of the water quality of the whole network. The evidence also offers an opportunity to design a future spatial sampling strategy in an optimal manner, leading to a more cost-effective water monitoring program in this Mekong Delta area. The results of this current study correspond to the successful application of this approach in water quality programs reported from previous studies (Wunderlin et al. 2001; Simeonov et al. 2004).

Fig. 2
figure 2

Dendogram showing clustering of sampling sites according to surface water quality characteristics of the Mekong Delta River in Can Tho City

Identification of potential sources (PCA/FA)

Waters are generally low in dissolved oxygen (means, 2.7–4.57 mg/l), reflecting organic loads, as indicated by BOD5 and COD levels. The mean values of these parameters appear to be co-related. Elevated nutrient measurements in the surface are limited to inorganic forms of nitrogen. Clearly, there are abundant non-point and point sources of nitrogen (and phosphorus) nutrients in a mixed urban and agricultural environment. Ubiquitous microbiological contamination is indicated by coliform counts, presumably related to faecal sources.

Principal component analysis (PCA/FA) was conducted on the normalized log transformation of water quality parameters (11 variables) separately for the entire research location and for individual clusters, as delineated by CA techniques, to compare the compositional pattern between analysed water samples and to identify the factors influencing each one. The input data matrices (variables*cases) for PCA/FA were [11*760] for the whole research location, [11*460] for C1, [11*40] for C2 and [11*260] for C3.

PCA generated three PCs for the entire research location, three PCs for cluster 1, four PCs for cluster 2, and four PCs for cluster 3 with eigenvalues >1 explaining 60, 60.2, 80.9, and 70 % of the total variance in the respective water quality data sets. An example of the Scree plot representing eigenvalues for the entire research location is shown in Fig. 3. An eigenvalue representing a measure of the significance of the factor is considered significant if it is greater than 1.0 (Kim and Muller 1987). The highest eigenvalues are the most significant. Equal numbers of varifactors (VFs) were then identified for all cases through FA performed on the PCs. Corresponding VFs, variable loadings and explained variance are presented in Table 2. Liu et al. (2003) classified the factor loadings as ‘strong’, ‘moderate’ and ‘weak’ corresponding to absolute loading values of >0.75, 0.75–0.50 and 0.50–0.30, respectively.

Fig. 3
figure 3

The Scree plot represents eigenvalues for entire research location, Can Tho City, Mekong Delta, Vietnam

Table 2 Loading of 11 experimental variables on factor analysis parameters for entire research location and three clusters

For both the entire research location and individual clusters, the VF1(s), accounting for 26.4, 25.9, 34.5, and 25.3 % of the total variance, have strong positive loading (>0.7) on nephelometric turbidity units (NTU) and suspended solids (SS), whereas these have a moderate loading (>0.6) on Fe and nitrates. These factors represent a result of soil erosion from disturbed land or inflow of effluent from sewage plants and industry, and a high loading of turbidity is also attributed to the presence of algae and plankton. The VF2(s), accounting for 23.5, 23.9, 24.4, and 24.5 % of total variance, have a strong positive loading on chemical and biochemical oxygen demand (BOD and COD), whereas they have only a moderate loading on ammonia. These factors represent influences from point sources, such as discharges from wastewater treatment plants, domestic wastewater, agricultural activities and industrial effluents. The VF3(s) of cluster 1, explaining 10.4 % of total variance, has a strong negative loading on pH. This factor can be explained by the anaerobic conditions in the river from the high loading of dissolved organic matter, which leads to formation of ammonia and organic acids resulting in a decrease in pH. In contrast, the VF3(s) of cluster 2 and cluster 3, accounting for 12.4 and 9.4 % respectively, have a strong positive loading on pH and represent the physiochemical source of variability. The VF4(s) of cluster 2 and cluster 3, accounting for the lowest total variance (9.5 and 9.4 %), have a strong positive loading on coliform bacteria. These factors indicate the contamination by sewage waste with faecal coliform bacteria through sewer and septic systems, feedlot and dairy runoff, runoff from broad acre farming, stormwater, and livestock and humans defecating directly into the water.

Spatial and temporal variation in river water quality (DA)

Spatial DA was conducted with the same raw data set comprising 11 parameters after grouping into three clusters as obtained through the CA technique. The clusters were the dependent variable, while all the measured parameters constituted the independent variables. Discriminant functions (DFs) and classification matrices (CMs), obtained from the standard, forward stepwise and backward stepwise modes of DA, are presented in Tables 3 and 4. The results of standard DA and forward and backward stepwise modes constructed for the 11 parameters (Table 3) indicate that the coliform bacteria group coefficients are zero. The classification matrices assign more than 67 % of cases correctly for the standard and forward stepwise using 11 discriminant parameters, and the CM gave the similar result (66.8 % of cases) for backward stepwise mode using three parameters (Tables 3 and 4). Backward stepwise DA indicates that NTU, COD and NH3 are the discriminating parameters in space.

Table 3 Classification function for discriminant analysis of spatial variation in Mekong River in Can Tho City, Mekong Delta, Vietnam
Table 4 Classification matrix for discriminant analysis of spatial variations in Mekong River, Can Tho City, Mekong Delta, Vietnam

Figure 4a–c shows box and whisker plots of discriminating parameters identified by spatial DA (backward stepwise mode) which was used to examine different patterns associated with spatial variations in river surface water quality in Can Tho City. The average COD is highest in cluster 3 as it is these urbanized and industrial zones which harbour domestic wastewater, wastewater treatment plants and industrial effluents. Ammonia levels are higher in clusters 3 and 1, which, combined with the high COD, suggests a high load of dissolved organic matter in these sites contributed by domestic wastewater, wastewater treatment plants and industrial effluents located upstream from the monitoring sites. This results in anaerobic conditions in the river, leading to formation of ammonia and organic acids. NTU levels are also higher in cluster 1 and 3 compared with cluster 2. This suggests a result of soil erosion from disturbed land or inflow of effluent from sewage plants and industry, contributing to the presence of algae and plankton. Although turbidity, COD and ammonia are mostly attributed to urbanization and industrial activities in this study, the question of why there is less loading of these parameters in the predominantly agricultural area compared with urbanized and industrial zones needs to be addressed by further studies.

Fig. 4
figure 4

Spatial variations: a COD, b NH3 and c NTU in surface water quality of the Mekong River in Can Tho City, Mekong Delta, Vietnam

Temporal DAs were also performed on the same raw data set comprising 11 parameters after grouping into seasonal groups (dry and wet season). Discriminant functions (DFs) and classification matrices (CMs) obtained from the standard, forward stepwise and backward stepwise modes of DA are shown in Tables 5 and 6. As with spatial DA, the standard DA- and forward stepwise DA mode-constructed DFs comprising 11 parameters, the coliform bacteria group coefficients are again zero. Both the standard and forward stepwise mode DFs using 11 parameters yielded the corresponding CMs assigning approximately 60 % cases correctly, whereas the backward stepwise mode DA yielded CMs with 56 % correct assignation using only two discriminant parameters (Table 5). Backward stepwise DA shows that pH and NO2 are the discriminating parameters according to season. Box and whisker plots of discriminating parameters identified by the seasonal backward stepwise DA are presented in Fig. 5a, b. The pH reveals a subtle elevation during the dry compared with the wet season, whereas there are significantly higher levels of NO2 during the wet. This can be attributed to the runoff of nitrogenous fertilizers from agricultural activities during the wet season in the research location. In addition, nitrite-nitrogen might also come from other sources such as anaerobic wastewaters, low dissolved oxygen levels and urban runoff containing poorly degraded organic and ammonia-nitrogen waste during the wet season. The seasonal variation of pH values observed in this study is in agreement with the results of a previous similar study (Ekeh and Sikoki 2003; Ansa 2005) where the values of pH in dry season were higher than those in the late rainy season. This may be due to the influx and decay of debris in the area as well as imbalance in the level of H+ ions from surface runoffs during the rains. In addition, the pH monitoring likely indicates the effects of acid sulphate soils, which is an issue in the Mekong Delta area (Neumann et al. 2013).

Table 5 Classification function for discriminant analysis of temporal variation in Mekong River in Can Tho City, Mekong Delta, Vietnam
Table 6 Classification matrix for discriminant analysis of temporal variations in Mekong River, Can Tho City, Mekong Delta, Vietnam
Fig. 5
figure 5

Temporal variations: a pH, b NO2 -N (mg/l) in surface water quality of the Mekong River in Can Tho City, Mekong Delta, Vietnam

Conclusion

This study used various multivariate statistical techniques to evaluate temporal and spatial variations in surface water quality of the Mekong Delta River in Can Tho City, Vietnam. Cluster analysis grouped a large number of sampling sites (38 sites) into three clusters of similar water quality features which divided the river basin into three zones comprising mixed urban-rural area, agricultural area and mixed urban-industrial area. The results indicate the low quality of river 2surface water in the Mekong Delta in Vietnam, which is proved by that fact that most of the quality parameters exceeded the national standard guidelines. The finding also suggests a possible revised sampling strategy that can reduce the number of sampling sites and affiliated recurring costs. However, further study is needed to assess how sensitive the recommended sampling clusters are to the changes of environmental and seasonal factors as well as how they respond to climate change impacts such as increasing temperature, flooding and longer dry periods. The principal component analysis and factor analysis were performed to extract and recognize the factors or origins responsible for water quality variation over the entire location and in each cluster. The results indicated that the nephelometric turbidity units (NTU), suspended solids (SS), Fe and nitrate are the most significant parameters contributing to water quality variations for the entire location and for each cluster, which represent a result of soil erosion from disturbed land or inflow of effluent from sewage plants and industry. The high loading of turbidity is also attributed to the presence of algae and plankton. DA revealed that three parameters (NTU, COD and NH3) discriminate clusters spatially, and two parameters (pH and NO2 ) discriminate clusters by season (dry and wet season). The results of this study demonstrate the usefulness of multivariate statistical techniques for analysis and interpretation of complex data sets, identification of pollution sources/factors, and evaluation oftemporal/spatial variations in water quality assessment and management.