Introduction

In recent years, increasing attention has been given to surface water quality. The quality of surface water is an essential component of the natural environment and is considered as the main factor for controlling environmental health and potential hazards. Studies have shown that the quality of surface water is commonly determined by both natural and anthropogenic influences, including catchment geology, atmospheric inputs, anthropogenic inputs, and climatic conditions (Shrestha and Kazama 2007; Altın et al. 2009; Saim et al. 2009; Shokrzadeh and Saeedi Saravi 2009; Najar and Khan 2012). Due to intensive human activities, the anthropogenic inputs from urban, mining, industrial, and agricultural activities are the primary factors affecting the surface water quality. As such, surface water including rivers, lakes, estuaries, and seas are most susceptible to pollution owing to their easy accessibility for wastewater disposal especially for those close to highly urbanized regions (Singh et al. 2004). Therefore, a monitoring program that is capable of presenting a reliable estimation of the quality of surface water is necessary in order to evaluate the spatial and temporal variations. Comparing the value of the environmental variables with existing guidelines is the most common method in water quality assessment. However, this method does not readily give information regarding the status of the pollution sources (Debels et al. 2005).

Environmental datasets are usually complex and contain a large amount of information with internal relationships among variables, often in a partially hidden structure (Saim et al. 2009; Praveena et al. 2011). Indeed, the large and complicated data matrix results from surface water monitoring programs will lead to difficulty in interpretation and evaluation of the observed quality data. Thus, the analysis of such complex data requires statistical techniques, especially chemometric studies (multivariate statistical techniques) to assess the water quality with respect to sustainability (Simeonov et al. 2004; Singh et al. 2004). Multivariate statistical techniques are powerful tools for analyzing a large number of datasets, classifying similarity and assessing the impact of humans on the water quality and ecosystem conditions (Shrestha and Kazama 2007; Praveena et al. 2011). This technique has been widely applied to a variety of environmental applications, which suggests its versatility in handling various types of data (Shrestha and Kazama 2007; Praveena et al. 2008; Krishna et al. 2009). There are comparable studies including the evaluation of temporal/spatial variations and the interpretation of water quality on the surface water (Shrestha and Kazama 2007; Hussain et al. 2008; Kazi et al. 2009; Krishna et al. 2009; Najar and Khan 2012; Varol et al. 2012) and groundwater (Aris et al. 2007; Krishna et al. 2009; Aris et al. 2012). Hence, it can be considered that the multivariate technique provides a valuable way for the reliable management of water resources through the identification of the possible sources that influence the water system (Kazi et al. 2009).

Hierarchical cluster analysis (HCA) and principal component analysis (PCA) have been frequently applied to analyze the similarities among the sampling sites and source apportionment of pollution parameters in surface water (Akbal et al. 2011; Aris et al. 2012). HCA coupled with PCA is a powerful pattern recognition technique. HCA can be used to explain the interrelations among variables or sampling sites (Reghunath et al. 2002; Singh et al. 2005) and group the objects of interest into clusters based on the similarity within a class and dissimilarities among different classes (Bu et al. 2010; Praveena et al. 2011). In comparison, PCA is frequently employed in hydrochemical studies, geology and hydrogeology applications (Akbal et al. 2011; Najar and Khan 2012). It is a dimension-reduction technique that provides information on the most significant parameters with a simpler representation of the data, as well as a reduction in the memory required and faster classification (Shrestha and Kazama 2007). The potential factors or sources that affect water systems can be identified by reducing the dimensionality of the dataset (Davis 1986; Huang et al. 2011; Praveena et al. 2011). The objective of this study is to investigate the spatial variations in surface water quality and to identify the potential sources of pollution of the Langat River. Chemometric methods (HCA and PCA) were used to evaluate the information concerning the similarities between the sampling stations and to ascertain the contribution of the potential factors or pollution sources among 29 parameters at 30 different sampling points of the Langat River. Based on the information obtained, a holistic interpretation of the results and the use of selected parameters as source tracers for contamination were enhanced.

Materials and methods

Site descriptions

The Langat River Basin occupies the south and southeastern parts of the state of Selangor in Malaysia (Fig. 1). The basin lies between latitudes 2° 40′ 152″ N to 3° 16′ 15″ N and longitudes 101° 19′ 20″ E to 102° 1′ 10″ E with a total catchment area of approximately 1,815 km2. The basin can be divided into three areas: the mountainous area, the hilly area, and the lowland area (DOA 1995). The main river course is 141 km long. The river flows from the high hills in the north towards the plains and turns westward towards the coast of the state of Selangor (Mokhtar et al. 2009). The major tributaries are the Semenyih River, the Labu River, and the Mantin River. The Langat River is essential to the Selangor population and serves as one of the most important freshwater ecosystems in Selangor. Besides providing potable water, Langat River also supplies water for manufacturing and agricultural production. There are two major impoundments (Langat Dam and Semenyih Dam) that supply water to the entire basin. The source of the Langat River is on the Pahang–Selangor border where the hilly terrain reaches up to 1,500 m above the mean sea level. The basin consists of two estuaries, one is located on the northeastern side and the river water flows into the Lumut Strait while the other is on the southern side and flows directly into the Strait of Malacca (Mokhtar et al. 2009). Water samples were collected from different sampling sites covering from Dengkil to these two estuaries.

Fig. 1
figure 1

Map of sampling stations in Langat River

The Langat River receives an annual rainfall of 1,500 to 2,900 mm. The basin experiences an average temperature of 32 °C throughout the year with a relative humidity of approximately 80 %. The basin is underlain by schist, phyllite, and granite rock formation of Permian age. The bedrock in the mountainous area includes Permian igneous rocks, Pre-Devonian schist, and phyllite of the Howthornden Formation (Gobbett and Hutchison 1973). The bedrock in the hilly area is predominantly Permo-Carboniferous meta-sandstone, consisting of mainly quartzite and slates of Kajang Formation and Kenny Hill Formation (Gobbett and Hutchison 1973; Taha 2003). The lowland area is of Quaternary deposits of Beruas, Gula, and Simpang Formations. These formations overlie the sedimentary bedrock of the Kenny Hill Formation and Kajang Hill Formation and grow progressively younger and thicker toward the coast (Gobbett and Hutchison 1973). The Quaternary deposits are made up of marine and continental deposits, which consist of gravel, sand, clay, and silt (JICA and MGDM 2002; Taha 2003).

Field sampling and preservation

The sampling was carried out in the rainy season (December 2010). The intense rainfall during the rainy season erodes the topsoil and carries the accumulated pollutants by surface runoff before draining into the river. Consequently, the elevated concentrations of certain pollutants are more likely to be evident in the rainy season. Triplicate water samples were collected from 30 sampling stations and were homogenized. During sampling, the pre-cleaned polyethylene bottles that were used were normalized by rinsing thoroughly with the river water to be collected and filled with running water facing the direction of the flow. In order to prevent the occurrence of biochemical and surface reaction of the water samples during transportation and storage, each sample bottle was fully filled with the water sample without entrapping air bubbles. Each bottle was labeled with its corresponding sampling station and time of sampling. The collected samples were kept at 4 °C to minimize the microbial activity in the water (APHA 2005). Generally, water samples containing colloidal or suspended particulate material could interfere with the metal analysis. The samples were immediately filtered with 0.45 μm cellulose acetate membrane filter (Whatman Milipores, Clifton, NJ) after being transported to the laboratory. This procedure is crucial to prevent the occurrence of clogging during analysis with spectrometry instruments and to obtain the dissolved ions for metal analysis (APHA 2005). Then, samples were acidified with HNO3 to pH <2 in order to prevent precipitation of the components, such as metal oxides and hydroxides, and to retard any biological activities (APHA 2005).

Water analyses

Multiparameter probes (SevenGo pro probe and SevenGo Duo pro probe, Mettler Toledo AG, Switzerland) were used to conduct in situ measurement of electrical conductivity (EC), total dissolved solids (TDS), salinity, redox potential (Eh), and pH. The temperature and dissolved oxygen (DO) were measured using a YSI 52-dissolved oxygen meter (YSI Inc., Yellow Springs, Ohio). All probes were calibrated prior to sampling. Bicarbonate (titration method using 0.02 N HCl) and chloride ions (argentometric method using 0.0141 N AgNO3) were analyzed on site using unfiltered samples (APHA 2005). Meanwhile, the filtered samples were separated into two polyethylene bottles. The first bottle was for subsequent analysis of sulfate (SurfaVer 4 HACH method) and nitrate (NitraVer 5 HACH method) and the second bottle was for the determination of cations (Ca, Na, Mg, and K) and metals (27Al, 75As, 138Ba, 9Be, 111Cd, 59Co, 63Cu, 52Cr, 57Fe, 55Mn, 60Ni, 208Pb, 80Se, and 66Zn). The cations (Ca, Na, Mg, and K) were analyzed by flame atomic absorption spectrometry (FAAS, Shimadzu AA6800) while the trace metals were analyzed by inductive couple plasma mass spectrometry (ICP-MS, ELAN DRC-e, Perkin Elmer).

Quality control and quality assurance were applied on samples and data collected in order to ensure the overall precision and accuracy of the data. Sampling, preservation, and transportation of the water samples to the laboratory were based on the Standard Method for Water and Wastewater Analysis (APHA 2005). All the reagents used were of analytical grade or equivalent and free from any contaminants. All the laboratory apparatus was pre-cleaned with 5 % (v/v) concentrated nitric acid (HNO3) and then rinsed with distilled water (APHA 2005). This procedure is crucial to ensure that any contaminants and traces of cleaning reagent were removed before the analysis (APHA 2005). Polyethylene bottles (free from material that may contain metals) were used for collecting the water samples in order to avoid and minimize interference for heavy metal analysis (APHA 2005). The accuracy of the result was also determined by performing triplicate samples (n = 3) with relative standard deviation. Blanks and calibration standards were used throughout the FAAS and ICP-MS analyses. Standard solutions were prepared using stock standard solutions with Milli-Q water (water resistivity >18.2 Mohms·cm at 25 °C; Millipore, MA, USA). Blanks were determined for background correction. The concentration of trace metals were expressed as micrograms per liter and milligrams per liter for cations. The accuracy of the ICP-MS performance was assessed by external standards, which were prepared by diluting the ICP Multi-Element Mixed Standard III (Perkin Elmer) into a series of concentrations with the same acid mixture used for sample dissolution. The recoveries of trace elements ranged from 95 to 105 % (±5 %), as shown in Table 1.

Table 1 Percentage recoveries of trace metals by ICP-MS

Data analyses

All statistical analyses were performed using the PASW Statistics 18 (formerly known as SPSS Statistics 18, or SPSS Base). ANOVA was applied to test the significant difference for all water quality variables among stations. A post hoc test was performed using the least significant difference test with a degree of significance at 0.05. The chemometric approach was performed through HCA and PCA (Singh et al. 2004; Praveena et al. 2011). HCA was first applied to the spatial variations among the stations, followed by the use of PCA to extract, and distinguish the potential factors or sources of pollution contributing to the variations of the water quality measures.

Hierarchical cluster analysis

In this study, HCA was used to investigate the groupings of the sampling points. This is the most common approach in which clusters are formed sequentially. This approach classifies variables or cases/observations into classes (clusters) on the basis of similarities within a class and dissimilarities between different classes from the dataset with respect to the predetermined characteristics (Boyacioglu and Boyacioglu 2008; Praveena et al. 2011). It is a useful technique to investigate spatial and temporal variations (Singh et al. 2005; Praveena et al. 2011). HCA was performed on the river water quality data to group similar sampling points within the Langat River. The squared Euclidean was applied as a distance matrix and Ward’s method as a linkage method (Singh et al. 2005). Ward’s clustering procedure is acknowledged to be the best method (Reghunath et al. 2002). It was used for the calculation in HCA since it yields a larger proportion of correct classified observations than other methods. The result of a hierarchical clustering procedure can be displayed graphically using a tree diagram, also known as a dendrogram. A dendrogram distinguishes groups of high similarity that have small distances between clusters while the dissimilarity between groups is represented by the maximum of all possible distances between clusters. A dendrogram shows a picture of the group and their proximity with a dramatic reduction in the dimensionality of the original data (Shrestha and Kazama 2007; Alkarkhi et al. 2009a, b). Moreover, previous studies also showed the reliably of HCA in the classification of water quality and as a guide for future sampling strategies (Singh et al. 2004; Shrestha and Kazama 2007; Alkarkhi et al. 2009a, b; Praveena et al. 2011).

Principal component analysis

Osman et al. (2012) stated that PCA is an exploratory, multivariate, statistical technique that can be used to examine data variability. It is a useful technique employed to find the optimal ways of combining variables into a small number of subsets. PCA attempts to explain the variance of a large set of intercorrelated variables by transforming them into a smaller set of independent variables and reduce the complexity of data into principal components (Singh et al. 2004, 2005). In this study, PCA was applied in datasets that had been pre-clustered by extracting the eigenvalues and eigenvectors from a square matrix produced by multiplying data matrix. The most significant components were extracted to reduce the contribution of variables with minimum significance. Then, the obtained components were further subjected to varimax rotation to generate varimax factors and maximize the differences between the variables, thus facilitating easy interpretation of the data. A principal component provides information on the most meaningful parameters, which describes a whole dataset, affording data reduction with a minimum loss of the original information (Shrestha and Kazama 2007). The components are ordered in such a way that the first PC explains most of the variance in the data, and each subsequent one accounts for the largest proportion of variability that has not been accounted for by its predecessors. This is to clearly differentiate potential factors or pollution sources contributing to the variation of water quality.

Results and discussion

Table 2 shows the descriptive statistics for the selected physicochemical parameters, major ions, and trace metal concentrations. The coefficients of variance (CV) for all variables were above 50 % except for the pH and temperature. The CV was calculated based on the sum value of standard deviation from each studied metal divided by its mean value. The high CV indicated a high variation between sampling stations. In addition, one-way ANOVA analysis also proved that the studied variables varied significantly among the stations (p < 0.05).

Table 2 Descriptive analysis for selected water matrices at Langat River (n = 90)

Hierarchical cluster analysis

In this study, HCA was applied to detect similarities between the sampling stations. A total of 29 variables which included physicochemical parameters (temperature, EC, TDS, salinity, DO, pH, and Eh), major ions (HCO3, Cl, SO4, NO3, Ca, Na, K, and Mg), and trace metals (Al, Ba, Be, Cd, Co, Cu, Cr, Fe, Mn, Ni, Pb, Se, and Zn) were first subjected to HCA. The dendrogram of the locations of different sites along the study area applied for water datasets are presented in Fig. 2. It shows that the 30 sampling stations can be grouped into two clusters (namely group 1 and group 2). The results indicate the potential contributing sources, which are attributed to both natural and anthropogenic origin. Group 1 accounts for sampling stations LY 1 to LY 14, which are mainly located in the vicinity of agricultural land and the Strait of Malacca (Fig. 1). The movement of seawater during the tidal flow has significantly contributed to the high load of salinity, EC, TDS, and also additional ions notably K, Mg, and Na within downstream of the Langat River. Thus, LY 1 to LY 14 were grouped under the same cluster and can be denoted as a group being governed by seawater intrusion. Group 2 consisted of sampling stations LY 15 to LY 30, which are mainly located in the eastern part of the study area. This area has experienced urbanization and the land use pattern is predominantly that of urban activities and agricultural fields (DOA 1995; JICA and MGDM 2002; Juahir et al. 2011; Osman et al. 2012; Fig. 1). The sampling stations are mainly located further inland from the estuary and in close proximity to the major pollution sources, such as industrial and domestic discharge. As such, sampling stations from group 2 receive minimal impact from seawater intrusion compared to group 1. In addition, the differences between group 1 and group 2 can likewise be substantiated by the changes in water type from Na-Cl facies (LY 1 to LY 14) to Ca-HCO3 facies (LY 15 to LY 30) (downstream to upstream), as depicted in Fig. 3.

Fig. 2
figure 2

Dendrogram showing hierarchical cluster analysis between stations

Fig. 3
figure 3

Ternary plots for a cations and b anions of water samples in Langat River

Principal component analysis

The PCA was applied on the water quality dataset to identify the spatial sources of pollution within group 1 and group 2 in the Langat River (Tables 3 and 4). In reference to the eigenvalues (greater than 1), four components were extracted in group 1 and explained 85 % of the total variance (Table 3), whereas group 2 extracted six components with a total variance of 88 % (Table 4). Comparable loadings were observed in group 1 and group 2.

Table 3 Component loadings of river water quality variables on varimax rotated matrix for group 1
Table 4 Component loadings of river water quality variables on varimax rotated matrix for group 2

In Table 3 (group 1), PC 1 accounted for 35 % of the total variance. This component showed high loading of EC, salinity, TDS, Ca, HCO3, SO4, Na, Mg, temperature, DO, Eh, pH, K, NO3, and As (Table 3). EC, salinity, and TDS are commonly regarded as indicator for the presence of dissolved ions including inorganic salt and organic matter in water (Reza and Singh 2010). Generally, the EC and TDS increases as the dissolved ions increase. Similarly, as the dissolved salt concentration increases, the salinity will also increase (Connell and Miller 1984; Elder 1988). Such a statement is also supported by the strong positive component loadings for EC, salinity, and TDS in PC 1 (Table 3). Meanwhile, the high loadings for the major ions (Ca, Na, Mg, K, HCO3, SO4, and NO3) in PC 1 may be explained by the mixing condition between freshwater (river) and seawater (Aris et al. 2007; Praveena et al. 2011; Aris et al. 2012). In addition, forest and agriculture are regarded as the primary landuse found in the sampling locations and were included within group 1 (DOA 1995; Juahir et al. 2011; Fig. 1). Taking this into consideration, the observed high loading of As and NO3 in PC 1 also implies the possible contribution from agricultural land. Furthermore, farm draining during the rainy season led pollution of the river caused by the by-products from the agricultural applications (Diagomanolin et al. 2004; Shokrzadeh and Saeedi Saravi 2009). Such relatively high loadings strongly indicate that the river water in group 1 is primarily controlled by seawater intrusion and agricultural discharges. PC 2, with a total variance of 23.17 %, consists of Co, Be, Al, Mn, Ba, Ni, Zn, and Fe (Table 3). This component was primarily contributed by trace metals. The elements especially Al, Mn, Ni, Zn, and Fe have been constantly released into the environment through weathering processes (Alloway 1995). The high concentrations of Al, Fe, and Mn in the Langat River are due to the composition of sediments, which is predominantly controlled by its lithology. Ferralsols (oxisols and ultisols), which are rich in Al and Fe, are acidic and highly weathered (Alloway 1995). During the rainy season, intense rainfall erodes the topsoil and carries these elements into the river. In this case, this component can be attributed to rock weathering. Meanwhile, the Fe, Al, and Mn oxides have a profound effect in controlling the adsorption and flocculation of other elements in the sediment (Alloway 1995). PC 3, with a total variance of 10 %, consists of Pb, Cl, Cd, and Cr (Table 3). Negative loadings of Pb and positive loading of Cl, Cd, and Cr suggest the occurrence of ion competition between these element binding sites (Campbel and Stokes 1985). Furthermore, Abdullah and Royle (1974) observed that an increase in salinity caused a decrease in concentrations of Zn, Cu, Fe and Cd, whereas a reverse trend was observed in for Pb and Zn. PC 4 with a total variance of 8 % consists of Se and Cu (Table 3). This component was deemed to be attributed to the pig farming activities (UPUM 2002; Lee et al. 2006). Copper sulfate is normally added to the animal feed as an additive to control certain diseases (Sarmani et al. 1992). The by-products from pig food contributed to the elevated Cu to the river via the effluent discharged (Sarmani et al. 1992; UPUM 2002; Lee et al. 2006; Juahir et al. 2011).

Group 2 comprised sampling stations LY 15 to LY 30. In Table 4 (group 2), PC 1, with a total variance of 35 %, consists of salinity, EC, TDS, Mg, Na, Se, Cu, Cl, Ni, SO4, and Cr. The component loadings are comparable with PC 1 in group 1. The similar loading in EC, salinity, TDS, and major ions indicates that the seawater intrusion still influence the hydrochemistry of the study area. Meanwhile, the high loading of Se and Cu in PC 1 may be attributed to the extraction of selenium dioxide from residues obtained during the purification of copper (Langner 2000; Hait et al. 2009). In addition, the fluctuation in flow between freshwater and seawater causes elevated salt concentration, which, consequently, increases the competition between the cations and the trace metals for binding sites in the particulates (Connell and Miller 1984; Elder 1988). The cations, being more prominent, drive the trace metals into the overlying water column. As a result, metals may be desorbed from the sediment thereby increasing their concentrations (Connell and Miller 1984; Elder 1988). PC 2, with a total variance of 25 %, consists of Pb, Be, Zn, Fe, Co, Al, As, Mn, and HCO3 (Table 4). The metals, including Pb, Be, Zn, Fe, Co, and Al, may be attributed to industrial activity, which is deemed to be closely related to the steelmaking industries (Shazili et al. 2006). In fact, the largest steelmaking industry in Malaysia is located in proximity to the upstream area (Sarmani 1989; Mokhtar et al. 2009) thus explaining the high loadings of these elements in PC 2. The high loading of Pb is related to the heavy shipping traffic and antifouling paints used (Goh and Chou 1997; Shazili et al. 2006; Berandah et al. 2010). Furthermore, intensive dredging, reclamation, construction, and shipping activities which disturb the river currents will lead to re-suspension of the sediment-bind trace elements in the environment, and, probably, in the soluble forms readily absorbed by aquatic organisms (Zulkifli et al. 2010). PC 3, which accounted for 12 % of total variance, consists of pH, Eh, K, and temperature (Table 4). This component illustrates the influence of pH, Eh, and temperature on the quality of the river water. PC 4, PC 5, and PC 6, which explain about 6, 5, and 4 % of the total variance, respectively, have a strong positive loading on temperature as well as Ba, DO, Cd, Ca, and NO3. The presence of Ba and Cd may be attributed to industrial activity, which is deemed to be closely related to metal finishing processes such as electroplating, etching, and preparation of metal components (Shazili et al. 2006). The NO3 is possibly derived from geologic deposits, organic matter decomposition, untreated wastewater input, agricultural runoff, and atmospheric input (Alkarkhi et al. 2009a).

Conclusion

The present study has applied the chemometric approach to investigate the spatial variation and identify the pollution sources in the Langat River, Malaysia. The HCA rendered the sampling stations into two clusters. Cluster 1 (LY 1 to LY 14) was heavily affected by seawater while cluster 2 (LY 15 to LY 30) mainly corresponded to the agricultural and industrial activities. The cluster results suggested that certain stations should be given a high priority if remediation efforts are to be undertaken. PCA identified several intrinsic factors responsible for river pollution, either from natural or anthropogenic inputs. Group 1 extracted four components with a total variance of 85 %, while group 2 extracted six components with a total variance of 88 %. The results suggested that seawater intrusion, agricultural pollution, industrial pollution, and geological weathering were potential pollution sources for both groups. In conclusion, this study highlights the usefulness of chemometric approach in delineating factors that govern the spatial variability of hydrochemistry in a tropical river. It is evident that the chemometric approach is useful in providing a reliable classification on the basis of pollution status and identification of pollution sources. Such effort provides holistic information for effective river basin management and makes it possible to design a future spatial sampling strategy in an optimal manner.