1 Introduction

Coastal water quality has become one of the great environmental concerns from worldwide to regional scales, and is influenced by natural and anthropogenic disturbance, such as wastewater, runoff effluents, land reclamation, atmospheric deposition and climate change (Bowen and Depledge 2006; Kuppusamy and Giridhar 2006). Regular implementation of monitoring programs is recognized to be the essential step to characterize and control the coastal water pollution (Simeonov et al. 2003; Singh et al. 2004). However, many monitoring programs result in large and complicated data sets consisting of physical, chemical, biological and microbiological properties, which are difficult to analyze and interpret because of latent interrelationships among parameters and monitoring sites (Zhou et al. 2007a). In Macau, in order to evaluate the state of health of coastal water and its long-term change, the Drainage Services Department of Macau Civil Administration has set inshore monitoring sites along Macau peninsula since 2000 and produces large data sets with high complexity. It is therefore necessary to extract the meaningful information from this data set for effective coastal water quality management along Macau peninsula.

Multivariate techniques including cluster analysis (CA), discriminant analysis (DA) and principal component analysis (PCA) have been applied to a variety of environmental applications including assessment of spatiotemporal change of rainwater quality (Wunderlin et al. 2001; Vazquez et al. 2003; Hamers et al. 2003; Ouyang et al. 2006), identification of rainfall runoff characteristics on different urban surfaces (Goonetillekea et al. 2005; Yamada et al. 1993; Huang et al. 2007), and evaluation of the spatiotemporal patterns of land-based pollution on coastal areas (Zhou et al. 2007a, b, c), etc. Obviously, previous studies provided valuable insights into the application of CA, DA and PCA techniques to environmental management and protection. However, comprehensive application of CA, DA and PCA to analysis of the coastal water quality along Macau peninsula regarding spatial–temporal variation and sources identification has not been conducted.

In this study, we analyze the coastal water quality from 22 inshore sampling sites during 2000–2005 by the integration of CA, DA and PCA methods. The objectives of this study are: (1) to extract latent information about the similarities or dissimilarities among the monitoring periods or sites; (2) and to identify pollution sources leading to spatiotemporal variations in water quality in Macau peninsula.

2 Materials and methods

2.1 Study area and monitoring sites

Located on the western side of the Pearl River estuary, Macau lies between latitudes 22°06′39″N and 22°13′06″N, and between longitudes 113°31′33″E and 113°35′43″E. Macau consists of Macau peninsula, Taipa island and Coloane island, covering 9.3, 6.7 and 13.2 km2, respectively. Macau has a total population of 538,100, whereas nearly 95% of the population is located in the Macau peninsula (DSEC 2007).

The Drainage Services Officer of Macau civil administration has set inshore sampling sites along the Macau peninsula since 2000. Figure 1 shows the monitoring sites and sewer systems. The sampling sites are set near the overflow manholes of the sewer pipes. The inshore surface seawater was sampled at about 1 m underwater.

Fig. 1
figure 1

Location of monitoring sites along Macau peninsula

2.2 Parameters and analytical methods

The data sets of coastal water quality from 22 monitoring sites consisted of 14 water-quality parameters monitored monthly over 6 years (Jan 2000–Dec 2005). The water quality parameters included Acoliform (E.coli), Fcoliform (F.coli), pH, color, turbidity, Electronic conductance (EC), Dissolved Oxygen (DO), chloride (Cl), Total Suspended Solids (TSS), nitrite nitrogen (NO2 ), nitrate nitrogen (NO3 ), ammonia nitrogen (NH4 +), Total Phosphorus (TP), and Chemical Oxygen Demand (COD). Sampling and analysis for these parameters followed standard methods (APHA 1992). The 6-year data set consisted of 606 observations of coastal water quality in Macau peninsula.

2.3 Data treatment

Most multivariate statistical methods require variables to conform to the normal distribution, thus, the normality of the distribution of each variable was checked by analyzing kurtosis and skewness statistical tests before multivariable statistical analysis was conducted (Lattin et al. 2003). The original data demonstrated values of kurtosis ranging from 4.247 to 606.629 and skewness ranging from −0.714 to 24.626, indicating that distributions were far from normal with 95% confidence. Since most of the values of kurtosis or skewness were greater than 0, the original data were transformed in the form x′ = log10(x). After log-transformation, the kurtosis and skewness values ranges from −0.695 to 17.06 and −1.831 to 0.733, respectively. In the case of CA and PCA, all log-transformed variables were also z-scale standardized (the mean and variance were set to 0 and 1, respectively) to minimize the effects of difference units and variance of variables and to render the data dimensionless (Zhou et al. 2007c; Singh et al. 2004).

2.4 The multivariate techniques

In this study, CA, DA, and PCA were comprehensively coupled to carry out multivariate analysis for the data sets collected in 22 monitoring sites. A summary of principles of CA, DA, and PCA is described below.

2.4.1 Cluster analysis (CA)

Cluster analysis (CA) is an unsupervised pattern recognition method that divides a large group of cases into small groups or clusters of relatively similar cases that are dissimilar to other groups. Hierarchical CA, the most common approach, starts with each case in a separate cluster and joins the clusters together step by step until only one cluster remains (Lattin et al. 2003). The Euclidean distance usually gives the similarity between two samples, and a distance can be represented by the difference between transformed values of the samples (Otto 1998). In this study, hierarchical CA was performed on the standardized data using Ward’s method with squared Euclidean distances as a measure of similarity. Ward’s method uses analysis of variance (ANOVA) to calculate the distance between clusters to minimize the sum of squares of any two possible clusters at each step. Both temporal and spatial variations in water quality were determined from hierarchical CA using the linkage distance. The Ward method, based on hierarchical CA, is effective and popular (Wunderlin et al. 2001; Vazquez et al. 2003).

2.4.2 Discriminant analysis (DA)

Discriminant analysis (DA) is a method of analyzing dependence that is a special case of canonical correlation, and one of its objectives is to determine the significance of different variables, which can allow the separation of two or more naturally occurring groups. DA operates on original data, and the method constructs a discriminant function for each group as follows (Wunderlin et al. 2001; Zhou et al. 2007b):

$$ f\left( {G_{i} } \right) = k_{i} + \sum\limits_{j = 1}^{n} {w_{ij} p_{ij} } $$
(1)

where i is the number of groups(G), k i is the constant inherent to each group, n is the number of parameter numbers used to classify a set of data into a given group, and w j is the weight coefficient, assigned by DA to a given parameter (p j ).

Backward stepwise DA was proved to be the most effective mode for reducing the dimensionality of the large data set (Wunderlin et al. 2001; Singh et al. 2004; Zhou et al. 2007b).

2.4.3 Principal component analysis (PCA)

Principal component analysis (PCA) is one of the most powerful and common techniques used for reducing the dimensionality of large sets of data without loss of information. PCA mathematically operates from the covariance matrix, which describes the dispersion of the multiple measured parameters, to obtain eigenvalues and eigenvectors. Linear combinations of the original variables and eigenvectors result in new variables, called principal components (PCs). Further rotation of the axis defined by PCA produces new groups of variable called varifactors (VFs). The basic features associated with PCA are data reduction and data grouping. Data reduction is obtained because we usually need only a few PCs to get a good description of the entire data set variability without loss of information. The use of PCA to water quality assessment has increased in recent years, mainly due to the need to obtain appreciable data reduction for analysis and decision-making (Helena et al. 2000).

The steps of the PCA include: (1) regulate the original data; (2) create a relative coefficient matrix; (3) calculate the eigenvalue and eigenvector; (4) calculate the contribution and the accumulative contribution ratio, and confirm the main component; (5) calculate the main component factors matrix of the pollution load.

3 Results and discussions

3.1 Temporal similarity and period grouping

An initial exploratory approach involved the use of hierarchical CA on standardized log-transformed data sets sorted by season. CA generated a dendrogram (Fig. 2), grouping the 12 months into two clusters at (D link/D max) × 100 < 23, and the difference between the clusters was significant. Cluster 1 (the first period) included June, July, August, September, closely corresponding to the wet season (April–September). Cluster 2 (the second period) included the remaining months (January–May, October–December), approximately corresponding to the dry season in Macau. However, if the months had been empirically divided into spring (March–May), summer (June–September), autumn (October–December), and winter (January–February), or into dry/wet seasons, a mistake in grouping would have been made. In fact, Fig. 2 shows that the temporal patterns to water quality were not purely consistent with the four seasons or the dry/wet seasons.

Fig. 2
figure 2

Temporal cluster analysis of monitoring periods based on Ward’s method

83% of annual precipitation occurs during the period from April to September in Macau, so the grouping by CA basically corresponds to the dry/wet seasons. However, there will be little deviation error if the water quality is monitored only by the dry/wet season.

3.2 Spatial similarity and sites grouping

Considering the experience obtained from temporal CA, spatial CA was also used to identify similar monitoring sites. However, the influences of temporal differences on spatial-CA were considered. Both spatial similarity analysis for each temporal cluster and the integrated clusters (the first and second periods) were carried out, but the results were almost the same. Therefore, only the latter result is discussed. Spatial-CA produced a dendrogram, shown in Fig. 3, with second groups at (D link/D max) × 100 < 12.

Fig. 3
figure 3

Spatial cluster analysis of monitoring sites based on Ward’s method

Group A consisted of a1, a2, a14, a15, a16, and a17, and Group B consisted of point a, point b, point c, a4–a13, and a18–a20. The group classifications varied with significance level, because the sites in these groups had similar features and natural backgrounds that were effected by similar sources. In Group A, six sites were located in the eastern and southern part of Macau peninsula. The water quality of these sites is relatively good because these sites face the open ocean and the sewer system is separated (Fig. 1). In Group B, all sites were located in the western part of Macau peninsula. The water quality in Group B is poor because the region in the western side is the old urban area with a dense combined sewer system, where the major pollution sources were mainly from untreated domestic wastewater or combined sewer overflow (CSO). In contrast, coastal water quality in Group B is relatively poorer because this region (especially for YTH and KZJ shown in Fig. 1) does not face open coastal water, but a shadow watercourse located between Wanzai in Zhuhai district and Macau peninsula, which is not beneficial to the dilution of pollutants.

Hierarchical CA provided a useful classification of the coastal water in the Macau peninsula that can be used to design an optimal future spatial monitoring network with lower costs (Simeonov et al. 2003; Singh et al. 2004; Zhou et al. 2007b). According to the above results, the frequency of monitoring periods could only be selected from the first and second periods, and the number of monitoring sites could also be reduced and only chosen from Group A and Group B.

3.3 Temporal variation in water quality

Temporal variation in water quality parameters (Table 1) were evaluated using a period-parameter correlation matrix, which showed that most analyzed parameters were significantly correlated (p < 0.05) with period except F.coli, E.coli, pH, NO3 , and NO2 . Cl had the highest correlation coefficient (R = 0.52), followed by color (R = 0.28), DO (R = 0.26), EC (R = −0.19), TP (R = 0.13), NH4 + (R = 0.12), TSS (R = −0.12), Turbidity (R = −0.11), and COD (R = −0.08). These parameters accounted for the major temporal variation in water quality. The absence of a significant correlation of NO3 , NO2 , F.coli, E.coli with period indicates the contribution of anthropogenic pollution source in the coastal water along Macau peninsula.

Table 1 Classification matrix for backward DA of temporal variations

Temporal variation in water quality was further evaluated using backward stepwise DA. Before running the temporal DA, the number of clusters needed to be decided, so the clusters based on temporal-CA were applied. The objectives of DA in this study were (1) to test the significance of discriminant functions and (2) to determine the most significant variables associated with the differences between the clusters.

Discriminant functions (DFs) and classification matrices (CMs) obtained from the backward stepwise modes of DA are shown in Tables 1 and 2. DA produced a CMs with close to 77% correct assignations using only six discriminant parameters. Thus, the temporal-DA results suggest that color, NH4 +, Cl, TP, DO, and COD were the most significant parameters for discriminating between the first period and the second period and accounting for most of the expected temporal variation in water quality.

Table 2 Classification functions coefficients for backward stepwise DA of temporal variation

Box and whisker plots of the discriminant parameters are recognized by DA as being related to the temporal trend given in Fig. 4. The value of all discriminant parameter except color (including NH4 +, Cl, TP, DO, and COD) is higher in the second period (January–May, October–December) than in the first period (June–September) (Fig. 4). The first period belongs to the wet season in Macau, when rainy weather enables stormwater runoff and soil loss occurs on many occasions, which makes the value of color relatively higher in the first period. Comparatively, there is less precipitation in the second period, coastal water is mainly influenced by the seawater, thus the value of Cl is higher in this period. Additionally, due to the land-based pollution effects discussed above, the coastal water quality in the second period has higher concentration of COD, TP, and NH4 +.

Fig. 4
figure 4

Temporal variations of discriminant parameters derived from backward DA

3.4 Spatial similarity and sites grouping

The test significance in spatial-DA was calculated similarly to that for temporal-DA, as shown in Table 3. As shown in Table 3, the values of Wilks’ lambda (0.61–0.63) was medium, which suggests that the spatial-DA in this study was relatively valid and effective.

Table 3 Classification matrix for backward DA of spatial variations

Discriminant functions (DFs) and classification matrices (CMs) obtained from the backward stepwise modes of DA are shown in Tables 3 and 4. DA produced a CMs with close to 85% correct assignations using only five discriminant parameters. Thus, the spatial-DA results suggest that pH, Cl, TSS, color, and NH4 + were the most significant parameters for discriminating between Group A and Group B and accounting for most of the expected spatial variation in water quality.

Table 4 Classification functions coefficients for backward stepwise DA of spatial variation

Box and whisker plots of discriminating parameters identified by spatial backward stepwise DA were constructed to evaluate different patterns associated with spatial variation in water quality (Fig. 5). The average pH, color, Cl and TSS was higher in Group A than in Group B, whereas the average NH4 + was lower in Group A than in Group B.

Fig. 5
figure 5

Spatial variations of discriminant parameters derived from backward DA

As mentioned above, monitoring sites in Group A face the open coastal water, greatly suffering from seawater intrusion, which induces high levels of Cl. Meanwhile, wind and wave action in the estuary and the open coastal water conditions further leads to high levels of total suspended solid (TSS). In Group B, all sites are located in the western part of Macau peninsula (Fig. 1). The reason why coastal water quality in Group B is poorer than in Group A was discussed in Sect. 3.2. It is understandable that domestic wastewater from untreated or overflow sewer pipes constituted a major pollution source for the monitoring sites in Group B, which would cause a higher level of average NH4 + concentration and lower values of DO and pH. High levels of dissolved organic matter consume large amounts of oxygen, leading to anaerobic fermentation processes and the formation of ammonia and organic acids. Hydrolysis of these acidic materials causes a decrease of water pH value (Vega et al. 1998; Xi et al. 1999; Singh et al. 2004). Therefore, average pH value is lower in Group B than in Group A. The result of spatial-DA also supported the trends of discriminant parameters in water quality.

Based on the above results, backward stepwise DA proved to be a valuable tool to recognize the discriminant parameters in temporal and spatial variations of coastal water. Additionally, it was essential to strengthen the monitoring accuracy of pH, color, Cl, TSS, COD, DO, NH4 +, and TP to clearly identify variations in future. Furthermore, pollution of Group B was relatively serious and should be controlled.

3.5 Identification of potential pollution sources in monitoring sites

PCA were further applied to standardized log-transformed data set (22 parameters) to examine differences between Groups A and Group B and identify the latent factors in different spatial variability, as shown in Table 5. The input data matrices (Variables × cases) for PCA were [14 × 126] for Group A and [14 × 480] for Group B. PCA of the two data sets evolved five PCs each for Group A and Group B, explaining 71.44 and 61.18% of the total variance in respective water quality data sets. Corresponding VFs variables loading and variance explained are presented in Table 5.

Table 5 Loadings of 14 measured variables on VARIMAX rotated factors of two spatial clusters by CA

As shown in Table 5, for Group A, among five VFs, VF1 explaining 26.10% of the total variance had strong positive loadings on F.coli and E.coli. Thus, VF1 represented domestic wastewater contaminated by fecal pollution (Jiang 2003). VF2 (15.08% of the total variance) had strong positive loadings on Turbidity and TSS. Thus, VF2 represents soil weathering and subsequent runoff (Singh et al. 2005), which is greatly related to the natural condition of estuary and open coastal water. VF3 (13.53% of the total variance) had strong positive loadings on Cl and COD. Thus VF3 represented organic pollution. VF4 (8.98% of total variance) and VF5 (7.76% of total variance) had strong positive loadings on NH4 + and NO2 , thus both represented nitrogenous nutrient pollution.

The pollution structure of Group B was similar to that of Group A. VF1, which explained 21.59% of the total variance, had strong positive loadings on F.coli and E.coli. Thus, VF1 represented domestic wastewater contaminated by fecal pollution (Jiang 2003). VF2 (13.67% of the total variance) had strong positive loadings on Cl and COD. Thus VF2 represented organic pollution and salt. VF3 (10.79% of the total variance) had strong positive loadings on electronic conductivity (EC). Thus, VF3 represented the natural influence of seawater). VF4 (9.52% of total variance) had medium positive loadings on color, indicating unidentified sources. VF5 (8.60% of total variance) had strong positive loadings on NO3 and NO2 , and thus represented nitrogenous nutrient pollution.

According to the results by PCA, domestic wastewater contaminated by fecal pollution was the most important potential pollution source both for Group A and Group B. Soil losses and subsequent runoff is also the major pollution source for Group A due to its natural location, namely, estuary and open coastal area, which easily suffered from the influence of wind and wave. Comparatively, organic pollution is another important latent pollution source for Group B, since domestic wastewater discharges from the dense combined sewer system always happens in the old urban areas around this group’s sites. Such results were further validated by the spatial variation of coastal water quality by CA and DA in the above section.

4 Conclusions

Multivariate statistical analysis including CA, DA, and PCA was successfully applied to explore and identify temporal and spatial variation and potential pollution sources in coastal water quality along Macau peninsula, indicating that multivariate techniques are effective and useful for coastal water quality management.

Hierarchical CA grouped the 12 months into two periods, June–September and the remaining months, and the entire area divided into two clusters, one located at the western sides, and the other the southeast and southern sides of Macau peninsula. Through backward stepwise DA, pH, Cl, TSS, Color and TP, Cl, Color, NH4 +, DO, COD were discriminant variables of spatial and temporal variation, with 84.82 and 76.54% correct assignments, respectively. Domestic wastewater contaminated by fecal pollution, organic pollution and soil losses are among the major sources for coastal water quality deterioration along Macau peninsula. Based on these findings, spatio-temporal pattern of in situ monitoring of coastal water quality along Macau peninsula was recognized. As a result, optimal future spatial–temporal monitoring network with lower cost should be designed in terms of date, sites, water quality parameters and potential land-based pollution sources when conducting in situ monitoring coastal water quality along Macau peninsula. This study illustrates that application of the multivariate statistical techniques was beneficial to gain knowledge for further optimizing the monitoring network and controlling coastal water quality along Macau peninsula.