Prediction of Water Quality Using Principal Component Analysis

Mahapatra, S. S.; Sahu, Mrutyunjaya; Patel, R. K.; Panda, Biranchi Narayan

doi:10.1007/s12403-012-0068-9

Prediction of Water Quality Using Principal Component Analysis

Published: 04 May 2012

Volume 4, pages 93–104, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Water Quality, Exposure and Health Aims and scope Submit manuscript

Prediction of Water Quality Using Principal Component Analysis

Download PDF

S. S. Mahapatra¹,
Mrutyunjaya Sahu²,
R. K. Patel³ &
…
Biranchi Narayan Panda⁴

2207 Accesses
69 Citations
Explore all metrics

Abstract

The groundwater is contaminated heavily with acidity, alkalinity, toxicity, heavy minerals, and microbes throughout the world due to population growth, urbanization and industrialization. Hence, evaluation of water quality of groundwater is extremely important to prepare for remedial measures. This paper presents application of an empirical approach for classification of water samples based on 10 quality parameters of water. In this research work, water samples from 10 sources in three different years and seasons have been collected to assess the quality of water. Q-mode principal component analysis has been applied to classify the water samples into four different categories considering parameters such as pH, DO, turbidity, TDS, hardness, calcium ion (Ca⁺⁺), chloride ion (Cl⁻), BOD, iron (Fe⁺⁺), sulfate ($\mathrm{SO}_{4}^{--}$). This classification will be useful for the planners and field engineers for taking ameliorative measures in advance for preventing the contamination of groundwater. The non-parametric method proposed here efficiently assesses water quality index for classification of water quality. The model can also be used for estimating water quality on-line but the accuracy of the model depends upon the judicious selection of parameters.

Use of Principal Component Analysis for Parameter Selection to Compute Water Quality Index and Assessment of Groundwater Quality Status: A Case Study of Sanganur Stream, Coimbatore City

Article 28 March 2022

Monitoring groundwater quality using principal component analysis

Article 15 February 2024

Unsupervised learning techniques in groundwater quality assessment of Mewat region, Haryana, India

Article 12 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Environmental Chemistry

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Groundwater makes up about 20 % of the world’s fresh water supply, which is about 0.61 % of the entire world’s water including oceans and permanent ice. In India, almost 80 % of the rural population depends on untreated groundwater for domestic and agriculture purpose. In recent years, the rate of discharge of pollutants into the environment is continuously increasing due to rapid growth of population, urbanization and accelerated pace of industrialization. It causes contamination of both fresh water and groundwater. Groundwater pollution can occur where industrial waste is discharged into pits, ponds, lagoons, rivers enabling the waste to percolate to water table. The polluted water endangers not only the valuable human life but also causes considerable biological disorder in the organisms. In addition, they cause serious environmental pollution responsible for health hazards. Groundwater contamination and its management have become important because of far reaching impact on human health. Therefore, knowledge of the water quality and evaluation of water quality index (WQI) plays a significant role in water quality control and management. WQI is generally considered as a means of summarizing the various water quality parameters into a simple index. The index helps in interpreting the quality of water in a single numerical value (Horton, 1965; Brown et al., 1970; Dinius, 1972; Lohani and Todino, 1984).

Classification of WQI is useful for quantifying the heteroginity existing in the system. Researchers such as Bhargava (1983), Swamee and Tyagi (2000), Sarkar and Abbasi (2006) and Tiwari and Mishra (1985) have worked in this direction and proposed mathematical and statistical models for classification of groundwater. Recently, Lumb et al. (2011) have reviewed various water quality classification models based on WQI. Various WQIs differ in the manner in which statistical integration and interpretation of parameter values are made. A totally different approach was adopted in the Canadian Water Quality Index also known as Canadian Council of Ministers of the Environment Water Quality Index (CCME WQI) (Lumb et al., 2011, 2006). CCME WQI is also being used by many countries all over the world and has also been endorsed by United Nations Environmental Program (UNEP) in 2007 as a model for Global Drinking Water Quality Index (GDWQI). The most commonly used parameters in this model are dissolved oxygen, pH, turbidity, total dissolved solids, nitrates, phosphates, and metals, among others.

The present study proposes an empirical method of classification scheme for assessing water quality of groundwater. The correlation of water quality parameters sometimes results in unjustified classifications if water quality is expressed by an index. The proposed method is quite efficient and prediction quality is reasonably good. The method first calculates similarity coefficients for all the members in the data set using Euclidean distance as a similarity measure. The Euclidean distance matrix is used as input to the Q-mode Principal Component Analysis (PCA) to classify the data (Albadawi et al. 2005). The factor loadings of each member of data set on principal component are taken into account to cluster the water sample into appropriate group. The number of clusters is decided on the basis of percentage variation explained by the principal components.

Literature Review

The quality of groundwater has been studied earlier by various researchers. Among them, Karnchanawong and Ikeguchi (1993) have evaluated quality of well water near the Mae-Hia waste disposal site. Zhang et al. (1996) have predicted water quality index in 14 cities of China where the groundwater is polluted due to use of fertilizers for agricultural purposes. Lind et al. (1998) have studied the impact of mining activity on pH of groundwater and its effect on water quality. Maticie (1999) have observed the impact of agriculture on groundwater quality in Slovenia. Shamruck et al. (2001) have studied the impact of environmental parameters on quality of Nile Valley aquifer. Ammann et al. (2003) have evaluated groundwater pollution and its impact on water quality by run-off. Almasri and Kaluarachchi (2004) have reported on occurrence of nitrate in the groundwater in agricultural watersheds in Whatcon County, Washington. WQI is strongly dependent on various correlated parameters taken for the study.

However, identification of the suitability of the parameters is critical for accurate evaluation of WQI. Water quality is generally ascertained based on physical,chemical, and biological indicators using indicators such as pH, electrical conductivity (EC), total dissolved solids (TDS), total suspended solids (TSS), hardness, turbidity and contaminant concentrations based on guidelines provided by agencies such as the World Health Organization WHO (2006) and the Bureau of Indian Standards (BIS) (1991). Nagarajan and Priya (1999) have studied the groundwater quality deterioration in Tiruchirapalli, Tamil Nadu and found that TSS, iron and magnesium values are beyond the permissible limit. Singh and Parwana (1999) have studied the pollution load in the groundwater in Punjab state due to industrial waste water and found the presence of chromium and cyanide in groundwater beyond permissible limit of drinking water standards. Jha and Verma (2000) have studied the physicochemical properties of drinking water in town area of Godda district under Santal Pargana (Bihar). They have reported that most of the surface water quality parameters are within the limit of drinking water standards. However, well water is characterized by a very high concentration of chloride, chromium and selenium. Srinivas et al. (2000) have studied the groundwater quality of Hyderabad taking 32 tube well water samples and reported that electrical conductivity, TDS, total alkali, hardness, calcium, magnesium, sodium and chlorides are above the permissible limit according to WHO and Indian Standards. Chaudhari et al. (2004) have studied the quality of groundwater near an industrial area at Jalgaon (Maharastra) and WQI of samples suggests that the water is not suitable for direct consumption. Shaji et al. (2009) have studied quality of water in mineral and industrial area of Kerala in India. Waste materials near the factories are subjected to reaction with percolating rain water and reach the aquifer system and, hence, degrade the groundwater quality (Tyagi et al. 2003). Heavy metals constitute a very heterogeneous group of elements widely varied in their chemical properties and biological functions. They are persistent in nature; therefore, they get accumulated in soil and plants. Dietary intake of many heavy metals through consumption of plants and drinking water has long-term detrimental effect on human health (Sharma and Agarwal, 2005; Ubala et al., 2001; Sabal and Khan, 2008).

However, quality is a vague term that cannot easily be described using crisp data set e.g. good quality water cannot simply be described as having a pH value of 7.0 or above. Instead, water quality can best be described based on its degree of potability and potential usages rather than expressing its constituents in numerical terms. Fuzzy reasoning technique has also been applied in groundwater and surface water quality forecasting (Dahiya et al. 2007; Parinet et al. 2004; Singh et al. 2008). Some of the artificial neural network as well as multivariate analysis methods are also found to be more useful for determination of water quality based on fuzzy and principal component analysis (PCA) techniques (Iscen et al., 2008; Shrestha and Kazama, 2007).

A number of indices have been developed to summarize water quality data in an easily expressible and easily understood format (Couillard and Lefebvre 1985). Horton (1965) proposed the first water quality index (WQI), where a great deal of consideration has been given to the development of index methods. The basic differences among these indices are the way their sub-indices were developed. Walski and Parker (1974) used an exponential function to represent the sub-indices of various quality variables. Landwehr (1979) suggested the Pearson type 3-distribution function to represent the sub-indices of all the quality variables. Bhargava (1987) modified the exponential formula; Dinius (1987) used power function for the majority of sub-indices. Nives (1999), Swamee and Tyagi (2000) proposed aggregate index for water quality description. In addition, Harrison et al. (2000), Faisal et al. (2003), Ahmed et al. (2004) and Shiow-Mey et al. (2004), each have recently modified a water quality index. Some of the sub-indices have since been incorporated into water quality indices used by agencies such as the National Sanitation Foundation (NSF) (Ahmed et al. 2004). The most important WQIs belonging to environmental departments or agencies are the National Sanitation Foundation (NSF), British Colombia Water Act Quality Index, Oregon Water quality Index, Stream Watch (Southern Indiana), Malaysian Water Quality Index, France Water Quality Index, French Creek Quality Index, Florida Stream Water Quality Index, British Colombia Water Quality Index, Canadian Water Quality Index, Taiwan Water Quality Index and Washington State Water Quality Index. Comparison among several WQI systems currently in use showed that none of them describes quality of water from mining effluent because most of sub-indices in current WQIs are not relevant to indicating changes in water quality brought about by mining activities.

Study Area

Odisha is a state in India located at an elevation of about 219 meters above mean sea level. In this study, groundwater quality of wells of urban area of Rourkela in Sundergarh district which is located at 84.54E longitude and 22.12N latitude is considered. Rourkela comes under tropical monsoon climate and is more like that of the Deccan Plateau. Being in the north-eastern corner of the Deccan Plateau, the climate is milder than the climate of the main Deccan region. The area of Rourkela is 200 square kilometers approximately. Red and laterite soils are found here which are quite rich in minerals. The area near Rourkela is rich in iron ore; hence a steel plant and other iron and steel industries are situated in the region. These industries are polluting the surrounding areas including groundwater resources. Large number of motor vehicles may also contribute in the release of heavy metals into surrounding environment. The climate is hot and dry during summer season. Normally, there is heavy rainfall due to south-west monsoon and light rainfall during the pre-monsoon seasons. The south-west monsoon usually onsets during second week of June and retreats by mid September. The humidity is generally high mostly in the monsoon and post-monsoon periods. The relative humidity is low during summer season. The mean values of the humidity, however, in a year range from 35 to 85 % and the annual average is 66 %. The Koel and Sankha Rivers meet at Vedvyas, Rourkela and flow as a single river called Brahmani. Hence, Rourkela is the confluence of three rivers: Koel, Sankha and Brahmani. The geographical location of study area is shown in Fig. 1.

Methodology for Sampling

In order to classify water quality into different clusters, a number of water samples were collected from 10 wells and their locations are shown in Table 1. Water samples of three years have been taken into consideration for the study. Also season-wise data for each corresponding year was taken. Water samples of all the places and seasons were not included due to some seasonal impact like heavy rainfall, encroaching heat, etc. Water samples from different sampling stations are collected in standardized PET (polyethylene terephthalate) bottles, which are thermostated bottles. The PET bottles of 1.5 liter capacities with stopper were used for collecting samples. The PET bottles can be used for collection of samples to analyze both organic and inorganic constituents in water. The bottles were washed thoroughly with 2 % nitric acid and subsequently rinsed with distilled water. The bottles were then preserved in a clean place. Before taking the water samples, all the supply bottles are rinsed with sample water 2–3 times. As all the physicochemical parameters are measured within 24 hours of sample collection, there is very little possibility of changing concentration of any parameters including heavy metals. The sampled bottle is made watertight by air tightening it inside water. Precautions have been taken to remove any air bubble present. Each container was clearly marked with the name and date of sampling. Various physicochemical parameters such as pH, turbidity, total dissolved solids (TDS), hardness, biochemical oxygen demand (BOD), dissolved oxygen (DO), chloride, sulfate, iron, calcium hacw been taken for analysis. Physicochemical parameters such as pH, turbidity, dissolved oxygen (DO), and total dissolved solids (TDS) were measured using water analysis kit model 191 E. The methodologies adopted for determination of water quality parameters of the collected samples are shown in Table 2.

Table 1 The location of wells used for collection of water samples

Full size table

Table 2 Test methods for various parameters of water samples

Full size table

Water samples were collected from groundwater sources on monthly basis. However, study of physicochemical characteristics was made through seasonal observations. The seasons are broadly divided into three seasons such as summer (March to June), rainy (July to October) and winter (November to February). The summer is too hot (max temperature 48 °C) in this part of the country; hence, data have not been collected in summer season. The procedure for estimating 10 quality parameters such as pH, DO (dissolved oxygen), turbidity, TDS (total dissolved solids), hardness, calcium ions, chloride ions, BOD (biological oxygen demand), iron and sulfate is shown in Table 2. The samples were collected in three years: 2008, 2009, and 2010. The physicochemical characteristics of water samples are given in Tables 3a and 3b, respectively, for rainy and winter seasons. The notation used to denote samples is as follows: W stands for the well and the number following W denotes the well number. The letters r and w respectively denote the rainy and winter season. The numbers 01, 02, and 03 respectively denote the year of sampling, 2008, 2009, or 2010.

Table 3a Values of physicochemical parameters in rainy season

Full size table

Table 3b Values of physicochemical parameters in winter season

Full size table

The average values of water quality parameters in two different seasons are verified with the permissible limits prescribed in IS:10500 shown in Table 4. From Tables 3a, and 3b it can be observed that most of the parameters are within permissible limit. However, parameters such as TDS, hardness, calcium, chloride, and iron lie below the permissible limit whereas, while DO lies above the limit. It may be noted that the value of turbidity decreases to almost half during winter season as compared to rainy season. Pearson correlation coefficients for parameters in rainy and winter seasons are shown in Tables 5a and 5b, respectively. The correlation coefficient of 0.5–0.75 is considered as moderate correlation between two variables (Montgomery and Runger 1999). In rainy season, highest correlation is observed for parameters turbidity and pH (0.731), followed by calcium and hardness (0.699). In winter season, the strongest correlation is observed between calcium and pH (0.529). The outliers in the data are removed through examination of scatter plots. The changes in average parameter values and the correlation coefficients in different seasons are observed due to human and industrial activities.

Table 4 Permissible limits of parameters (IS:10500)

Full size table

Table 5a Correlation matrix of parameters (rainy season)

Full size table

Table 5b Correlation matrix of parameters (winter season)

Full size table

Determination of Water Quality Index (WQI)

In the formulation of water quality index, the importance of various parameters depends on the intended use of water and water quality parameters are studied from the point of view of suitability for human consumption. The ‘standards’ (permissible values of various pollutants) for the drinking water are recommended by the Indian Council of Medical Research (ICMR). When the ICMR standards are not available, the standards of United States Public Health Services (USPHS), World Health Organization (WHO), Indian Standards Institution (ISI) and European Economic Community (EEC) are being quoted.

The quality rating q _i for the ith water quality parameter is obtained from the relation

$$ q_{i} = 100(v_{i}/s_{i}) $$

(1)

where v _i=value of the ith parameter at a given sampling station and s _i=standard permissible value of the ith parameter. This equation ensures that q _i=0 when a pollutant (the ith parameter) is absent in the water while q _i=100 if the value of this parameter is just equal to its permissible value for drinking water. Thus, the larger the value of q _i the more polluted is the water with the ith pollutant. However, quality ratings for pH and DO require special handling. The permissible range of pH for the drinking water is 7.0 to 8.5. Therefore, the quality rating for pH may be

$$ q_{\mathrm{pH}} = 100\bigl[(v_{\mathrm{pH}} - 7)/(8.5 - 7.0)\bigr] $$

(2)

where v _pH=value of pH∼7, it means the numerical difference between v _pH and 7.0 ignoring algebraic sign. Equation (2) ensures the q _pH=0 for pH=7.0. In contrast to other pollutants, the case of DO is slightly complicated because the quality of water is enhanced if it contains more DO. Therefore, the quality rating q _DO has been calculated from the relation

$$ q_{\mathrm{DO}} = 100\bigl[(14.6 - v_{\mathrm{DO}})/(14.6 - 5) \bigr] $$

(3)

where v _DO= value of DO.

In Eq. (3), 14.6 is the solubility of oxygen (mg/l) in distilled water at 0 °C and 5.0 mg/l is the standard for drinking water. Equation (3) gives q _DO=0 when DO=14.6 mg/l and q _DO=100 when v _DO=5.0 mg/l. The more harmful a given pollutant is, the smaller is its permissible value for drinking water. So the ‘weights’ for various water quality parameters are assumed to be inversely proportional to the recommended standards for the corresponding parameters, i.e.

$$ W_{i} = \frac{K}{S_{i}} $$

(4)

where W _i=unit weight for the ith parameter (i=1,2,3,…,10), k=constant of proportionality which is determined from the condition and k=1 for the sake of simplicity:

$$ \sum_{i = 1}^{10} W_{i} = 1 $$

(5)

To calculate the WQI, first the sub-index $\mathrm{(SI)}_{i}$ corresponding the ith parameter is calculated. These are given by the product of the quality rating q _i and the unit weight W _i of the ith parameter, i.e.

$$ \mathrm{(SI)}_{i} = q_{i}W_{i} $$

(6)

The overall WQI is then calculated by aggregating these sub-indices (SI) linearly. Thus, WQI can be written as

$$ \mathrm{WQI} = \Biggl[ \sum_{i = 1}^{12} q_{i}W_{i} \Biggl/ \sum_{i = 1}^{12} W_{i} \Biggr] $$

(7)

which gives

$$ \mathrm{WQI} = \sum_{i = 1}^{12} q_{i}W_{i} $$

(8)

Water quality can be categorized into five classifications depending on WQI values. Water quality can be treated as excellent, good, poor, very poor, and unsuitable for drinking if WQI lies in the ranges of 0–25, 26–50, 51–75, 76–100, and above 100, respectively.

The water quality index values for all the data shown in Tables 3a and 3b are obtained using Eqs. (1) through (8) and are shown in Table 6. It should be noted that WQI for all the data lies in the range from excellent to very poor for human consumption.

Table 6 The water quality index (WQI) values for all samples

Full size table

Generation of Euclidean Distance Matrix

The parameters of water samples shown in Tables 3a and 3b possess different measuring scales. Therefore, they need to be normalized to reduce the scaling effect. A simple normalization procedure of dividing selected variables by their maximum value is adopted here. After normalization, all data vary from zero to one. Implementation of factor analysis requires the correlation matrix of the initial data set. The correlation matrix is obtained in the form of a Euclidean distance matrix (Hair et al. 2009). Euclidean distance is taken as the similarity measure and is defined as the sum of the squares of the difference between the values of attributes of two water samples. Mathematically, it may be given as

$$ d(x,y)= \sqrt{\sum_i(x_i-u_i)^2} $$

(9)

where d(x,y)=Euclidean distance, x=x ₁,x ₂,…,x _m and y=y ₁,y ₂,…,y _m represent m attribute values of two samples. If the distance is zero, both the coal samples are similar. If it is above zero, the Euclidean distance indicates the intensity of dissimilarity between two water samples. The Euclidean distance matrix is generated considering all the water samples. An entry in the matrix denotes Euclidean distance between the pth row and the (p+1)th row of the water samples. The Euclidean distance matrix is thus generated for the use in Q-mode PCA.

Q-mode Principal Component Analysis

PCA is the most widely used, straightforward and quantitatively involved method for transforming a given set of interrelated variables into a new set of variables called the principal components (corresponding to factors in factor analysis). The set of principal components generated presents uncorrelated linear combinations of the original variables and accounts for the total variance of the original data. In this method, all the principal components are generated in such a way that they are orthogonal to each other; hence, correlation between them is zero. The principal components are generated in a sequentially ordered manner with decreasing contributions to the variance, i.e. the first principal component explains most of the variations present in the original data, and successive principal components account for decreasing proportions of the variance. This property means that the data points can be rigorously separated into distinct clusters when projected into a space spanned by the first few principal components, which are called factors. This achieves the dimensionality reduction objective of factor analysis. PCA can be broadly classified into two categories, viz., R-mode and Q-mode, based on application. If PCA is used to develop a structure among variables, it is referred to as an R-mode PCA. When PCA analysis is used to group cases, it is called a Q-mode PCA. It is customary to use rotation methods to transform the factors to simpler and more interpretable constructs. After rotation, each variable will be only related to one of the factors and each factor will have high correlation with only a small set of variables. In recent years, Q-mode PCA has been widely adopted by the researchers for classification of groundwater quality, coffee preference, gene regulatory process, and machines in cellular manufacturing (Albadawi et al., 2005; Dijksterhuis, 1998; Park et al., 2001; Singh et al., 2008; Woocay and Walton, 2008).

Assuming the sample parameters as the original set of variables, and the Euclidean distance matrix as an estimate of the correlation matrix explaining the correlations between each pair of samples, we proceed to use the PCA framework for grouping the samples into separate independent clusters. In the PCA method, the initial clusters are extracted out by the eigenvalue-eigenvector analysis of the similarity coefficient matrix as presented in Eq. (10):

$$ (S-I\lambda_{i}) Y_{i} = 0,\quad i = 1, 2, \ldots, P $$

(10)

where S is a P×P Euclidean distance matrix, I is the identity matrix, λ _i are the characteristic roots (eigenvalues), and Y _i are the corresponding eigenvectors.

Equation (10) is an eigenvalue-eigenvector equation, λ ₁≥λ ₂≥⋯≥λ _p are the real, nonnegative roots of the determinant polynomial of degree P given as

$$ I(S - I\lambda_{i})I = 0 $$

(11)

This equation is solved for λ _i and then Y _i can be calculated, using the values of λ _i in Eq. (10). It is proven that the eigenvectors thus computed represent the unique set of P independent principal components (factors) of the data set, which maximize the variance (Basilevsky 1994). According to the PCA method, each of the P independent principal components (factors) can be written as a linear combination of the original variables (water samples), with the elements of the P eigenvectors as the coefficients of these linear combinations. Furthermore, the elements of these eigenvectors reflect the degree of association between each principal component (factor) and the sample, and are called the ‘factor loadings’ of the samples on the ith principal component in factor analysis terminology. Each of the P independent principal components represents a cluster. There should be low similarities among samples that are associated with different clusters and high similarities among samples strongly associated with the same cluster. In regard to the number of sample size, Basilevsky’s assumed data set should be three to four times the number of variables.

Results and Discussion

Considering water sample parameters as shown in Tables 3a and 3b as variables, and applying the above methodology, the corresponding eigenvalues and eigenvectors for the Euclidean matrix were calculated using SPSS version 14.0 software. It is customary to use rotation of the components to obtain optimal distribution of variances in various components. Varimax rotation was applied to obtain optimal distribution of variances in various components. The number of factors (clusters) can now be selected based on principal components showing eigenvalues above one or number of principal components forming the cliff in scree plot or Akaike’s information criterion (AIC) (Basilevsky, 1994; Kaiser, 1960; Valarmathie et al., 2009). In this work, scree plot is used to select the number of clusters. It can be observed from the scree plots (Figs. 2a and 2b) for two seasons that only four clusters are needed to group the water samples. These four groups contribute 85.62 % in rainy and 88.22 % in winter seasons. Therefore, it is clear that the water sample data for two seasons can be clustered into four groups. In order to make a fair comparison with WQI values, it is decided that water quality can be categorized into four classifications depending on WQI values. Water quality can be treated as excellent, good, poor, and very poor if WQI lies in the ranges 0–25, 26–50, 51–75, and 76–100, respectively.

The next step is to assign the water samples into various clusters. In this study, absolute values of the elements of the eigenvectors (the factor loadings) are used to identify the clusters for water samples. The rotated factor loadings are shown in Table 7. For example, for water sample of W₁.r.01, the factor loading in cluster 1 is 0.93 which is higher as compared to loadings in other clusters, hence a stronger relationship with cluster 1 rather than with clusters 2, 3, and 4. It is to be noted that all the samples of well number 1 are clustered into group 1 irrespectively of seasons. From Table 6 it can be seen that the WQI of samples from well number 1 lying in the range 0–25 indicates that water samples belong to excellent category. Therefore, water samples belonging to cluster 1 (principal component 1 or PC1), cluster 2 (PC2), cluster 3 (PC3), and cluster 4 (PC4) are treated as of excellent, good, poor, and very poor quality, respectively. This procedure was repeated for all the samples to find out their respective clusters. In the same argument, sample W₄.w.03 belongs to cluster 4 (very poor quality).

Table 7 Rotated factor loadings for water samples

Full size table

If a comparison is made on classification of water samples by Q-PCA mode (four clusters) and WQI method, it is observed that same classification has resulted in both the methods. It is found that nine samples belong to cluster 1, four samples to cluster 2, three samples to cluster 3, and three samples to cluster 4 for data from rainy season. Similarly, nine samples belong to cluster 1, three samples from cluster 2, two samples to cluster 3, and three samples to cluster 4 for data from winter season. The major advantage of clustering water quality data in an empirical manner lies in the fact that it avoids subjectivity on weight assignment to parameters for WQI calculation. Furthermore, it provides a computationally elegant method with lesser dependence on choice of parameters. However, quality is a vague term that cannot be easily described using crisp data set, e.g. good quality water cannot simply be described as having a pH value of 7.0 or above. Instead, water quality can best be described based on its degree of potability and potential usages rather than expressing its constituents in numerical terms. Therefore, the non-parametric empirical method proposed here is efficient for such application. Any new water sample can be placed in any one of the above categories by knowing the constituents of physicochemical analysis, which is a routine analysis in the field and laboratory. The method is quite generic and can take care of any number of parameters. Although the method classifies water samples into proper groups, the performance of Q-mode clustering can be improved if size of data set is increased.

Conclusions

In this work, PCA-based classification has been proposed for classification of water samples. It has been demonstrated that the methodology efficiently classifies into various clusters as far as the present data set is concerned. A similar classification can be obtained when WQI is calculated for the data set. The approach presented here is efficient and computationally elegant for classification of water samples. Importantly, it can be used in the field and laboratory due to easy accessibility and availability of statistical packages. The present approach has several advantages over other approaches:

The physicochemical analysis of any water sample can be determined in a laboratory conveniently, since no sophisticated and costly experimental setup is required for the purpose. The classification system matches closely with the classification system based on calculation of WQI.
It can be supported by available commercial software programs such as SPSS in order to facilitate industrial applications.
It has the flexibility in allowing the user to identify the required number of clusters in advance, or consider it as a dependent variable.

The method of classification of water samples proposed in this work is quite generic and works well for present data set. Since such structured approach has already been applied in various fields of engineering due to its strong foundation, it is expected to work efficiently irrespectively of data sets. However, the efficiency of the method needs to be tested with water samples from other parts of the world.

References

Ahmed S, David KS, Gerald S (2004) Environmental assessment: an innovation index for evaluation water quality in streams. Environ Manag 34:406–414
Article Google Scholar
Almasri ML, Kaluarachchi JJ (2004) Assessment and management of long-term water pollution of groundwater in agriculture-dominated watersheds. J Hydrol 295(1–4):225–245
Article CAS Google Scholar
Albadawi Z, Bashir HA, Chen M (2005) A mathematical approach for the formation of manufacturing cells. Comput Ind Eng 48:3–21
Article Google Scholar
Ammann A, Eduard H, Sabine K (2003) Groundwater pollution by roof infiltration evidenced with multi-tracer experiments. Water Res 37(5):1143–1153
Article CAS Google Scholar
Brown RM, McClelland NI, Deininger RA, Ronald GT (1970) A water quality index Do we dar? Water Sew Works 117(10):339–343
Google Scholar
Basilevsky A (1994) Statistical factor analysis and related methods. Wiley, New York
Book Google Scholar
Bhargava DS (1983) Use of a water quality index for river classification and zoning of the Ganga River. Environ Pollut B 6:51–67
Article Google Scholar
Bhargava DS (1987) Nature and the Ganga. Environ Conserv 14:307–318
Article CAS Google Scholar
Chaudhari GR, Sohani D, Shrivastava VS (2004) Groundwater quality index near industrial area. Indian J Environ Prot 24(1):29–32
CAS Google Scholar
Couillard D, Lefebvre Y (1985) Analysis of water quality indices. J Environ Manag 21:161–179
Google Scholar
Dahiya S, Singh B, Gaur S, Garg VK, Kushwaha HS (2007) Analysis of groundwater quality using fuzzy synthetic evaluation. J Hazard Mater 147(3):938–946
Article CAS Google Scholar
Dijksterhuis G (1998) European dimensions of coffee: rapid inspection of a data set using Q-PCA. Food Qual Prefer 9(3):95–98
Article Google Scholar
Dinius SH (1972) Social accounting system for evaluating water resource. Water Resour Res 8(5):1159–1177
Article Google Scholar
Dinius SH (1987) Design of an index of water quality. Water Resou Bull 23(5):833–843
Article CAS Google Scholar
Faisal K, Tahir H, Ashok L (2003) Water quality evaluation and trend analysis in selected watersheds of the Atlantic region of Canada. Environ Monit Assess 88:221–248
Article Google Scholar
Hair JF Jr, Black WC, Babin BJ, Anderson RE (2009) Multivariate data analysis, 7th edn. Prentice Hall, New York
Google Scholar
Harrison TD, Cooper JAG, Ramm AEL (2000) Water quality and aesthetics of South African estuaries, Department of Environment Affairs and Tourism, South Africa. Available from www.environment.gov.za/soer/reports/ehi/ehi_ch4.pdf
Horton RK (1965) An index number system for rating water quality. J Water Pollut Control Fed 37(3):300–306
Google Scholar
Iscen CF, Emiroglu O, Ilhan S, Arslan N, Yilmaz V, Ahiska S (2008) Application of multivariate statistical techniques in the assessment of surface water quality in Uluabat Lake. Environ Monit Assess 144(1–3):269–276
Article CAS Google Scholar
Jha AN, Verma PK (2000) Physico-chemical properties of drinking water in town area of Godda district under Santal Pargana (Bihar), India. Pollut Res 19(2):245–247
CAS Google Scholar
Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20:141–151
Article Google Scholar
Karnchanawong S, Ikeguchi SKT (1993) Evaluation of shallow well water quality near a waste disposal site. Environ Int 19(6):579–587
Article CAS Google Scholar
Landwehr JM (1979) A statistical view of a class of water quality indices. Water Resour Res 15(2):460–468
Article Google Scholar
Lind CJ, Carol L, Creasey CA (1998) In situ alteration of minerals by acidic groundwater resulting from mining activities: preliminary evaluation of method. J Geochem Explor 64(1–3):293–305
Article CAS Google Scholar
Lohani BN, Todino G (1984) Water quality index for Chao Phraya River. J Environ Eng 110(6):1163–1176
Article CAS Google Scholar
Lumb A, Sharma TC, Bibeault JF (2011) A review of genesis and evolution of water quality index (WQI) and some future directions. Water Qual Health Expos: 11–24
Lumb A, Halliwell D, Sharma T (2006) Application of CCME water quality index to monitor water quality: a case of the Mackenzie River Basin, Canada. Environ Monit Assess 113:411–429
Article CAS Google Scholar
Maticie B (1999) The impact of agriculture on groundwater quality in Slovenia: standards and strategy. Agric Water Manag 40(2–3):235–247
Article Google Scholar
Montgomery DC, Runger GC (1999) Applied statistics and probability for engineers. Wiley, New York
Google Scholar
Nagarajan P, Priya GK (1999) Groundwater quality deterioration in Tiruchirapalli, Tamilnadu. J Ecotoxicol Environ Monit 9(2):155–159
CAS Google Scholar
Nives SG (1999) Water quality evaluation by index in Dalmatia. Water Resour 33:3423–3440
Google Scholar
Park S, Choi D, Jun CH (2001) A clustering method for discovering patterns using gene regulatory processes. Genome Inform 12:249–251
CAS Google Scholar
Parinet B, Lhote A, Legube B (2004) Principal component analysis: an appropriate tool for water quality evaluation and management-application to a tropical lake system. Ecol Model 178:295–311
Article CAS Google Scholar
Sabal D, Khan TI (2008) Fluoride contamination status of groundwater in Phulera tehsil of Jaipur district, Rajasthan. J Environ Biol 29:871–876
CAS Google Scholar
Sarkar C, Abbasi SA (2006) Qualidex—a new software for generating water quality indices. Environ Monit Assess 119:201–231
Article CAS Google Scholar
Shaji C, Nimi H, Bindu L (2009) Water quality assessment of open wells in and around Chavara industrial area, Quilon, Kerala. J Environ Biol 30(5):701–704
CAS Google Scholar
Shamruck M, Corapcioglu MY, Fayek AA, Hassona (2001) Modeling the effect of chemical fertilizers on groundwater quality in the Nile Valley aquifer, Egypt. Groundwater 39(1):59–67
Article Google Scholar
Sharma RK, Agarwal M (2005) Biological effects of heavy metals. J Environ Biol 26:301–313
CAS Google Scholar
Shrestha S, Kazama F (2007) Assessment of surface water quality using multivariate statistical techniques: a case study of the Fuji River Basin, Japan. Environ Model Softw 22:464–475
Article Google Scholar
Singh KP, Parwana HK (1999) Groundwater pollution due to industrial wastewater in Punjab state and strategies for its control. Indian J Environ Prot 19(4):241–244
CAS Google Scholar
Singh B, Sudhir D, Sandeep J, Garg VK, Kushwaha HS (2008) Use of fuzzy synthetic evaluation for assessment of groundwater quality for drinking usage: a case study of southern Haryana. Indian Environ Geol 54:249–255
Article Google Scholar
Singh UK, Kumar M, Chauhan R, Jha PK, Ramanathan AL, Subramanian V (2008) Assessment of the impact of landfill on groundwater quality: a case study of the Pirana site in western India. Environ Monit Assess 141(1–3):309–321
Article CAS Google Scholar
Srinivas C, Shankar R, Venkateshwar C, Rao MSS, Reddy RR (2000) Studies on groundwater quality of Hyderabad. Pollut Res 19(2):285–289
CAS Google Scholar
Swamee PK, Tyagi A (2000) Describing water quality with aggregate index. J Environ Eng 126(5):451–455
Article CAS Google Scholar
Shiow-Mey L, Shang-Lien, Shan-Hsien W (2004) A generalized water quality index for Taiwan. Environ Monit Assess 96:35–52
Article Google Scholar
Tyagi P, Budhi D, Sawhney RL (2003) A correlation among physico-chemical parameters of groundwater in and around Pithampur industrial area. Indian J Environ Prot 23(11):1276–1282
CAS Google Scholar
Tiwari TN, Mishra M (1985) A preliminary assignment of water quality index of major Indian Rivers. Indian J Environ Prot 5(4):276–279
CAS Google Scholar
Ubala B, Farooqui M, Arif M, Zaheer A, Dhule DG (2001) Regression analysis of groundwater quality data of Chikalthana industrial area, Aurangabad (Maharashtra). Orient J Chem 17(2):347–348
Google Scholar
Valarmathie P, Srinath MV, Dinakaran K (2009) Increased performance of clustering high dimensional data through dimensionality reduction technique. J Theor Appl Inf Technol 5(6):731–733
Google Scholar
Walski TM, Parker FL (1974) Consumers water quality index. J Environ Eng Div 100(3):593–611
Google Scholar
WHO (2006) Guidelines for drinking water quality first addendum to 3rd edn (I) recommendations, Geneva, Switzerland
Woocay A, Walton J (2008) Multivariate analyses of water chemistry: surface and ground water interactions. Groundwater 46(3):437–449
Article CAS Google Scholar
Zhang WL, Tian X, Zhang N, Li XQ (1996) Water pollution of groundwater in northern China. Ecosyst Environ 59(3):223–231
Article CAS Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechanical Engineering, National Institute of Technology, Rourkela, India
S. S. Mahapatra
Department of Civil Engineering, National Institute of Technology, Rourkela, India
Mrutyunjaya Sahu
Department of Chemistry, National Institute of Technology, Rourkela, India
R. K. Patel
Department of Manufacturing Science, Veer Surendra Sai University, Burla, India
Biranchi Narayan Panda

Authors

S. S. Mahapatra
View author publications
You can also search for this author in PubMed Google Scholar
Mrutyunjaya Sahu
View author publications
You can also search for this author in PubMed Google Scholar
R. K. Patel
View author publications
You can also search for this author in PubMed Google Scholar
Biranchi Narayan Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. S. Mahapatra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahapatra, S.S., Sahu, M., Patel, R.K. et al. Prediction of Water Quality Using Principal Component Analysis. Water Qual Expo Health 4, 93–104 (2012). https://doi.org/10.1007/s12403-012-0068-9

Download citation

Received: 12 December 2011
Revised: 01 April 2012
Accepted: 05 April 2012
Published: 04 May 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s12403-012-0068-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Prediction of Water Quality Using Principal Component Analysis

Abstract

Similar content being viewed by others

Use of Principal Component Analysis for Parameter Selection to Compute Water Quality Index and Assessment of Groundwater Quality Status: A Case Study of Sanganur Stream, Coimbatore City

Monitoring groundwater quality using principal component analysis

Unsupervised learning techniques in groundwater quality assessment of Mewat region, Haryana, India

Introduction

Literature Review

Study Area

Methodology for Sampling

Determination of Water Quality Index (WQI)

Generation of Euclidean Distance Matrix

Q-mode Principal Component Analysis

Results and Discussion

Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Prediction of Water Quality Using Principal Component Analysis

Abstract

Similar content being viewed by others

Use of Principal Component Analysis for Parameter Selection to Compute Water Quality Index and Assessment of Groundwater Quality Status: A Case Study of Sanganur Stream, Coimbatore City

Monitoring groundwater quality using principal component analysis

Unsupervised learning techniques in groundwater quality assessment of Mewat region, Haryana, India

Explore related subjects

Introduction

Literature Review

Study Area

Methodology for Sampling

Determination of Water Quality Index (WQI)

Generation of Euclidean Distance Matrix

Q-mode Principal Component Analysis

Results and Discussion

Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation