Keywords

1 Introduction

Water pollution is an issue of social concern both in Vietnam in particular and the world in general. Water pollution caused by industrial factories increasingly degrades environments quality, leads to severe problems in health for local inhabitants. The building of water quality monitoring stations is also essential, but also difficult because of expensive installation costs, no good information of selected areas for installation in order to achieve precise results. According to the Center for Monitoring and Analysis Environment (Department of Natural Resources and Environment Binh Duong), automatic water quality monitoring network of Binh Duong province has 4 stations observation including Tan Hiep, Vinh Nguyen, Thu Dau Mot and Tan Uyen. The system continuously monitors daily with monitoring parameters such as TSS, pH, Nitrate, temperature and salinity. With a large area, the province needs to install more new monitoring stations. However, with the rapid development of industrial parks, the problem of environmental pollution, especially water pollution is a hot issue, the scarcity of clean water and polluted water leads to many diseases danger. Therefore, it is necessary to have a mathematical model to predict whether the quality of water in a certain area is safe to use in the future? By using continuous monitoring data for 3 consecutive months (from February to April 2018, at Tan Hiep station), the author provides an appropriate mathematical model to predict water pollution in the following months.

Currently, there are a number of models predicting water pollution such as models QUAL2K, IPC model, … Many water quality models were developed over the past years for various types of water bodies. QUAL2E water quality model developed during the earlier stages had many limitations. To overcome those limitations, QUAL2K was developed by Park and Lee in 2002 [1]. Model QUAL2K is a version of the model QUAL2E [2]. This model was developed due to the cooperation between Tufts University and the US Environmental Environment Center (US.EPA). The model is widely used to predict river water quality developments and predict the load of waste into rivers. IPC model developed by the World Bank, the World Health Organization and the US health organization. IPC model assesses river water quality, forecast changes in river water quality, calculates the amount to be cut by each source.

2 Materials and Methods

Dong Nai River is the longest inland river in Vietnam, the second largest in the South to the basin, just behind the Mekong River. The Dong Nai river flows through Lam Dong, Dak Nong, Binh Phuoc, Dong Nai, Binh Duong and Ho Chi Minh cities with a length of 586 km (364 miles) and a basin of 38,600 km2 (14,910 mi2). If calculating from the beginning of the Da Dang river source, it is long 586 km. If calculating from the confluence point with Da Nhim river under Pongour waterfall, it is long 487 km. Dong Nai river flows into the East Sea in Can Gio district. 89 data were collected from Tan Hiep automatic water monitoring station on Dong Nai river, continuously monitored daily from February 1st, 2018 to April 31, 2018 (see Table 1).

Table 1. Pollution data of TSS water at Tan Hiep station.

TSS parameters (turbidity and suspendid solids) are total suspended solids. Usually, it is measured with a turbidity meter (turbidimeter). Turbidity is caused by the interaction between light and suspended solids in water such as sand, clay, algae, microorganisms and organic matter in water. Suspended solids disperse light or absorb them and re-emit them in the manner depending on the size, shape and composition of suspended particles and thus allow application turbidity measuring devices to reflect the change in the type, size and concentration of the particles in the sample, etc. The author uses a geostatistical method to predict the concentration of TSS water pollution in the next time.

The main tool in geostatistics is the variogram which expresses the spatial dependence between neighbouring observations. The variogram \( \gamma \left( h \right) \) can be defined as one-half the variance of the difference between the attribute values at all points separated by has followed [3, 8]

$$ \gamma (h) = \frac{1}{2N(h)}\sum\limits_{i = 1}^{N(h)} {[Z(s_{i} ) - Z(s_{i} + h)]^{2} } $$
(1)

where Z(s) indicates the magnitude of the variable, and N(h) is the total number of pairs of attributes that are separated by a distance h.

Under the second-order stationary conditions [4, 9] one obtains

$$ E\text{[}Z(s)\text{]} = \mu $$

and the covariance

$$ Cov\text{[}Z\text{(}s\text{)},Z\text{(}s + h\text{)]} = E\text{[(}Z\text{(}s\text{)} - \mu \text{)(}Z\text{(}s + h\text{)} - \mu \text{)]}\, = E\text{[}Z\text{(}s\text{)}Z\text{(}s + h\text{)} - \mu^{2} \text{]} = C\text{(}h\text{)} $$
(2)

Then \( \gamma \text{(}h\text{)} = \frac{1}{2}E\text{[}Z\text{(}s\text{)} - Z\text{(}s + h\text{)]}^{2} = C\text{(}0\text{)} - C\text{(}h\text{)} \)

The most commonly used models are spherical, exponential, Gaussian, and pure nugget effect [5, 8]. The adequacy and validity of the developed variogram model is tested satisfactorily by a technique called cross-validation.

Crossing plot of the estimate and the true value shows the correlation coefficient r2. The most appropriate variogram was chosen based on the highest correlation coefficient by trial and error procedure.

Kriging technique is an exact interpolation estimator used to find the best linear unbiased estimate. The best linear unbiased estimator must have a minimum variance of estimation error. We used ordinary kriging for spatial and temporal analysis. Ordinary kriging method is mainly applied for datasets without and with a trend.

The general equation of linear kriging estimator is

$$ \hat{Z}(s_{0} ) = \sum\limits_{i = 1}^{n} {w_{i} Z(s_{i} )} $$
(3)

In order to achieve unbiased estimations in ordinary kriging the following set of equations should be solved simultaneously.

$$ \left\{ {\begin{array}{*{20}l} {\sum\limits_{i = 1}^{n} {w_{i} \gamma (s_{i} ,s_{j} ) - \lambda = \gamma (s_{0} ,s_{i} )} } \hfill \\ {\sum\limits_{i = 1}^{n} {w_{i} = 1} } \hfill \\ \end{array} } \right. $$
(4)

where \( \hat{Z}(s_{0} ) \) is the kriged value at location s0, Z(si) is the known value at location si, wi is the weight associated with the data, \( \lambda \) is the Lagrange multiplier, and \( \gamma \left( {s_{i} ,s_{j} } \right) \) is the value of variogram corresponding to a vector with origin in si and extremity in sj.

Kriging minimizes the mean squared error of prediction

$$ \hbox{min} \sigma_{e}^{2} = {\mathbb{E}}[Z(s_{0} ) - \hat{Z}(s_{0} )]^{2} $$

For second order stationary process the last equation can be written as

$$ \sigma_{e}^{2} = C(0) - 2\sum\limits_{{{\text{i}} = 1}}^{\text{n}} {w_{i} C(s_{0} ,s_{i} ) + \sum\limits_{{{\text{i}} = 1}}^{\text{n}} {\sum\limits_{{{\text{j}} = 1}}^{\text{n}} {w_{i} w_{j} C} } } (s_{i} ,s_{j} )\,{\text{subject}}\,{\text{to}}\,\sum\limits_{{{\text{i}} = 1}}^{\text{n}} {w_{i} } = 1 $$
(5)

Therefore the minimization problem can be written as

$$ { \hbox{min} }\left\{ {C(0) - 2\sum\limits_{i = 1}^{\text{n}} {w_{i} } C(s_{0} ,s_{i} ) + \sum\limits_{{{\text{i}} = 1}}^{\text{n}} {\sum\limits_{{{\text{j}} = 1}}^{\text{n}} {w_{i} } } w_{j} C(s_{i} ,s_{j} ) - 2\lambda (\sum\limits_{{{\text{i}} = 1}}^{\text{n}} {w_{i} } - 1)} \right\} $$
(6)

where λ is the Lagrange multiplier. After differentiating (6) with respect to w1, w2, …, wn, and λ and set the derivatives equal to zero we find that

$$ \sum\limits_{{{\text{j}} = 1}}^{\text{n}} {w_{j} } C(s_{i} ,s_{j} ) - C(s_{0} ,s_{i} ) - \lambda = 0,\;i = 1,2, \ldots ,{\text{n}}\;{\text{and}}\,\sum\limits_{{{\text{i}} = 1}}^{\text{n}} {w_{i} } = 1 $$

Using matrix notation the previous system of equations can be written as

$$ \left( {\begin{array}{*{20}c} {{\text{C}}({\text{s}}_{1} ,{\text{s}}_{1} )} & {{\text{C}}({\text{s}}_{1} ,{\text{s}}_{2} )} & \ldots & {{\text{C}}({\text{s}}_{1} ,{\text{s}}_{\text{n}} )} & 1 \\ {{\text{C}}({\text{s}}_{2} ,{\text{s}}_{1} )} & {C(s_{2} ,s_{2} )} & \ldots & {{\text{C}}({\text{s}}_{2} ,{\text{s}}_{\text{n}} )} & 1 \\ \vdots & \vdots & \ddots & \ldots & \vdots \\ {{\text{C}}({\text{s}}_{\text{n}} ,{\text{s}}_{1} )} & {{\text{C}}({\text{s}}_{\text{n}} ,{\text{s}}_{2} )} & \ldots & {{\text{C}}({\text{s}}_{\text{n}} ,{\text{s}}_{\text{n}} )} & 1 \\ 1 & 1 & \ldots & 1 & 0 \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {{\text{w}}_{1} } \\ {{\text{w}}_{2} } \\ \vdots \\ {{\text{w}}_{\text{n}} } \\ { - \lambda } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {C(s_{0} ,s_{1} )} \\ {C(s_{0} ,s_{2} )} \\ \vdots \\ {{\text{C}}({\text{s}}_{0} ,{\text{s}}_{n} )} \\ 1 \\ \end{array} } \right) $$

Therefore the weights w1, w2, …, wn and the Lagrange multiplier λ can be obtained by

$$ {\text{W}} = {\text{C}}^{ - 1} {\text{c}} $$

where \( {\text{W}} = ({\text{w}}_{1} ,{\text{w}}_{2} , \ldots ,{\text{w}}_{\text{n}} , - \lambda ) \)

$$ {\mathbf{c}} = ({\text{C}}({\text{s}}_{0} ,{\text{s}}_{1} ),{\text{C}}({\text{s}}_{0} ,{\text{s}}_{2} ), \ldots ,{\text{C}}({\text{s}}_{0} ,{\text{s}}_{n} ),1)^{\prime} $$
$$ {\text{C}} = \, \left\{ {\begin{array}{*{20}l} {{\text{C}}({\text{s}}_{\text{i}} ,{\text{s}}_{\text{j}} ),} \hfill & {{\text{i}} = 1,2, \ldots ,{\text{n}},} \hfill & {{\text{j}} = 1,2, \ldots ,{\text{n}},} \hfill \\ {1,} \hfill & {{\text{i}} = {\text{n}} + 1,} \hfill & {{\text{j}} = 1,2, \ldots ,{\text{n}},} \hfill \\ {1,} \hfill & {{\text{i}} = 1,2, \ldots ,{\text{n}},} \hfill & {{\text{j}} = {\text{n}} + 1,} \hfill \\ {0,} \hfill & {{\text{i}} = {\text{n}} + 1,} \hfill & {{\text{j}} = {\text{n}} + 1.} \hfill \\ \end{array} } \right. $$

The GS+ software (version 5.1.1) was used for geostatistical analysis in this study [6].

3 Results and Discussions

In order to check the anisotropy of TSS, the conventional approach is to compare variograms in several directions [7]. In this study major angles of 0°, 45°, 90°, and 135° with an angle tolerance of \( \pm \)45° were used for detecting anisotropy.

Figure 1 shows fitted variogram for spatial analysis of TSS. Gaussian model [Nugget = 6.5 (mg/l); Sill = 64 (mg/l); Range = 95 (mg/l); r2 = 0.969, and RSS = 101]. It shows the best fitted omnidirectional variogram of water pollution obtained based on cross-validation. Through variogram map of parameter TSS, the model of isotropic is suitable. The variogram values are presented in Table 2.

Fig. 1.
figure 1

Model of isotropic variogram for TSS parameters.

Table 2. Variogram values of TSS.

Residual Sums of Squares (RSS) provides an exact measure of how well the model fits the variogram data; the lower the reduced sums of squares, the better the model fits. When GS+ autofits the model, it uses RSS to choose parameters for each of the variogram models by determining the combination of parameter values that minimizes RSS for any given model. The Residual SS displayed in the This Fit box is calculated for the currently defined model.

r2 provides an indication of how well the model fits the variogram data; this value is not as sensitive or robust as the Residual SS value for best-fit calculations; use RSS to judge the effect of changes in model parameters.

Model Testing: The reliable result of model selection using appropriate interpolation is expressed in Table 3 by coefficient of regression, coefficient of correlation and interpolated values, in addition to the error values as the standard error (SE) and the standard error prediction (SE Prediction) [10, 11].

Table 3. Testing the model parameters.

Figure 2 shows the results of testing the error between the estimated value and the actual value by interpolation method kriging with isotropic TSS. The regression coefficient is 1.005, the correlation coefficient is close to 0.859 (the best result is 1), the standard error is 0.044 (close to 0) and the forecast error of 2.258 shows the choice of Kriging interpolation model in accordance with the set data in Fig. 3.

Fig. 2.
figure 2

Error testing result of prediction TSS.

Fig. 3.
figure 3

Cross-Validation (Kriging) (a), (b) và (c) của TSS.

From Figs. 4 and 5 we see that from February to April (from 1 to 89 days), in February there is the lowest concentration of TSS and gradually increases to April this is also consistent with the fact because February is the dry season and April is the beginning of the rainy season, so TSS increases. The X-axis and Y-axis represent the number of days (starting from February 1st, 2018 to April 30, 2018, which means 89 days. Based on Figs. 4 and 5, we can predict the TSS contamination concentration in next month (May) and offer remedial solutions.

Fig. 4.
figure 4

Kriging interpolation for TSS parameters in 2 dimensions.

Fig. 5.
figure 5

Kriging interpolation for TSS parameters in 3 dimensions.

The application of geotechnical methods mentioned to predict the concentration of TSS pollution at Tan Hiep station shows that the forecast results are small errors as shown in Fig. 2. Through this forecast study, methods and tissues are used. Based on interpolation, we can predict the level of TSS pollution levels for the following months without monitoring data, thus suggesting measures to improve and protect the environment.

From the forecast map, we find that the forecast gives the best results in the 89 days period, outside of this time the forecast results may be inaccurate. The more time the pollutant observes, the easier it is to select interpolation models, with higher interpolation results and vice versa. Different colors show different levels of pollution. The lowest level of pollution is blue and the highest is white. Areas of the same color have the same level of pollution. The results of the model still have this error, which may be due to many other factors affecting TSS parameters such as salinity, temperature, nitrate content, water flow … This is the first article the author uses the method. Kriging interpolation to predict water pollution over time.

4 Results and Discussions

The statistical applications for predicting TSS concentrations in rivers at the Tan Hiep monitoring station have resulted in small errors between estimated values and real values (standard errors equal to 0.044 and projected errors reported by 2.258). Since then, the study has shown the effectiveness and rationality with the high reliability of geostatistics to build appropriate predictive models. When building the model, author should pay attention to the error values of the model, the data characteristics of the object. Author also consider the results of the model selection in order to select the most suitable model for the actual data, because separate models provide different accuracy. Therefore, the experience of selecting models also plays a very important role in interpolation results. Finally a comparison of the proposed method with several other methods can be made as follows. Polygon (nearest neighbor) method has advantages such as easy to use, quick calculation in 2D; but also possesses many disadvantages as discontinuous estimates; edge effects/sensitive to boundaries; difficult to realize in 3D. The Triangulation method has advantages as easy to understand, fast calculations in 2D; can be done manually, but few disadvantages are triangulation network is not unique. The use of Delaunay triangles is an effort to work with a “standard” set of triangles, not useful for extrapolation and difficult to implement in 3D. Local sample mean has advantages are easy to understand; easy to calculate in both 2D and 3D and fast; but disadvantages possibly are local neighborhood definition is not unique, location of sample is not used except to define local neighborhood, sensitive to data clustering at data locations. This method does not always return answer valuable. This method is rarely used. Similarly, the inverse distance method are easy to understand and implement, allow changing exponent adds some flexibility to method’s adaptation to different estimation problems. This method can handle anisotropy; but disadvantages are difficulties encountered when point to estimate coincides with data point (d = 0, weight is undefined), susceptible to clustering.

This paper, QUAL2K is not suitable, because QUAL2K has been used to predict pollution on the river section and it has not been applied to the forecast of pollution over time.