Introduction

Mine leachates and acid mine drainage from mine wastes and tailings can cause environmental contamination in surface waters and soils (Song and Choi 2015). When mine tailings dam fails from a severe rainstorm, the surrounding environments can be greatly influenced. In addition, various harmful metals enriched in these mine wastes will be a serious threat to human health when they move through the ecosystem in biogeochemical cycles (Park et al. 1995; Shamsi et al. 2016). To prevent and mitigate the damage from heavy metal contamination, comprehensive studies on the extents, pathways, distribution patterns, or other characteristics of heavy metals are required (Kim et al. 2012a). However, a field investigation on the effects of mining pollution is difficult to perform because of poor accessibility in steep terrain and extensive range of contamination (Lee and Choi 2016). Sufficient data for analyzing contamination trends in the area of interest are also difficult to acquire because of time and budgetary constraints. Therefore, to predict the pollution degree and extent precisely from the limited data is a significant challenge. Geostatistics, which is a branch of statistics focusing on spatial and spatiotemporal datasets, can be utilized to solve this problem.

Heavy metals in stream sediments are suitable variables to estimate the continuous environmental impacts because they are abundant in contaminated environments and barely change over time (Thornton 1983; Lee et al. 2016). To analyze the heavy metals in stream sediments, the path and flow direction of the stream are essential to be considered (Choi et al. 2011; Choi 2012; Kim et al. 2016). Geographic information systems (GISs) combined with geostatistics can be a useful tool to analyze the characteristics of streams and topography. Furthermore, GIS analysis has the advantage of obtaining data and analyzing extensive areas easily (Choi et al. 2008; Suh et al. 2013). With the growth of computer technologies, various studies on GIS modeling have estimated the pathway and distribution of contaminants (Heathwaite et al. 2005; Yenilmez et al. 2011; Kim et al. 2012b). However, most of these studies have limitations in indicating the real contamination degree and distribution because field survey data were not utilized or were only used to verify the predicted result. On the other hand, geostatistics studies predict the contamination degree using field survey data. Salgueiro et al. (2008) estimated the chemical contamination of stream sediments at the Vale das Gatas mine in Portugal using the geostatistical method. Khalil et al. (2013) assessed the extension and magnitude of soil contamination with toxic elements from abandoned mines in semi-arid areas using geochemical analysis and geostatistics. In addition, many studies predicted soil contamination in cultivated land (Steiger et al. 1996; White et al. 1997; Lin et al. 2001; Liu et al. 2004). Particularly, Kriging, which predicts an unknown value using the weighted linear summation, is widely used and studied in different fields. Although a variety of software platforms provide Kriging tools according to each objective, the aquatic variables could not be predicted reasonably because of their use of Euclidean distance when analyzing the correlationship between samples.

Several studies that predict aquatic variables are represented in Table 1. Smith et al. (1997) suggested a method using spatially referenced regressions of contaminant transport on watershed attributes in regional water-quality assessment. Yuan (2004) used a spatial interpolation to estimate stressor levels in unsampled streams by considering the geology and land use of each sample catchment basin. However, these studies predicted the overall water quality on the area including land because Euclidean distance was used in Kriging application. Dent and Grimm (1999) quantified patterns of nutrient concentration in surface waters of an arid land stream and compared spatial patterns using stream distance. Torgersen et al. (2004) sampled and analyzed multiscale, spatially continuous patterns of stream fishes and physical habitat using the distance between points along the stream path, or the network distance. However, Kriging predictions were not attempted. Curriero (1996), Little et al. (1997), and Rathbun (1998) predicted water quality using Kriging based on stream distance instead of Euclidean distance, but it was not applied to stream networks, and topography was not considered. Skøien et al. (2006) estimated the 100-year flood in stream networks considering the topology of stream networks and nested catchments. However, the method only predicted the variable in the unit of stream segment for the stream networks and was not able to predict the continuous change of the variable in a stream segment. VerHoef et al. (2006) developed spatial statistical models for stream networks that can make predictions at unsampled locations by incorporating flow and stream distance using spatial moving averages. Most of these studies analyzed aquatic variables in a stream or stream networks composed of vector objects. These vector-based methods have difficulties in considering topography, requiring the construction of objects or database to preserve information, such as flow direction and the distance between samples. Therefore, those studies have difficulty in analyzing the continuous change of variables and automating the analysis process. Software development has not been reported.

Table 1 Review of the pattern analysis and prediction for aquatic variables

This study aims to develop the grid-based Kriging algorithm to predict aquatic variables along the stream networks. To compensate for the drawbacks of Kriging based on Euclidean distance, this study utilizes the stream distance by analyzing the stream path and networks using digital elevation model (DEM). In addition, the area of each sample catchment basin is applied to correlation modeling and Kriging prediction to predict unknown values reasonably at stream junctions. Raster dataset, which is a matrix data structure representing a grid of pixels, is used in this study because of its advantage in analyzing the flow direction, stream networks, catchment basins, and other overlay analyses considering topography. Only the present study and Skøien et al. (2006) in Table 1 considered topography of the study area when predicting variables in the stream by using raster dataset. Furthermore, only the present study developed a software platform designed for automation of aquatic variable prediction. This study presents two case studies of application to a synthetic sample data and stream sediment data of an abandoned mining area in South Korea.

Grid-based Kriging considering the stream distance and catchment basin area

The Kriging method proposed in this study was designed to consider the stream distance and networks in contrast to general Kriging methods, which consider only spatial locations of samples. Figure 1 shows the difference between the stream distance and Euclidean distance in rasterized digital data. In Fig. 1a, the solid line is a stream, and the circle is a sample in streams. A grid of pixels denotes a raster dataset, and the colored pixel X indicates the unknown location to be predicted. Figure 1b shows the rasterized samples and streams. The distances from pixel X to samples A and B are both (\(\sqrt 2 \; \times \;{\text{pixel}}\;{\text{size}}\)) in raster format (Fig. 1b), and the two samples have same weights in general Kriging. However, it is rational that sample B is more associated with X because sample B is even closer to the pixel X than sample A when considering the stream path.

Fig. 1
figure 1

Rasterization of a unknown area, b samples and stream paths

Figure 2 shows the flowchart for calculating the stream distances between samples and predicting the unknown value using Kriging by considering stream distance (i.e., STD-Kriging in this paper). First, raster format data representing the sample locations and flow direction of the study area are required to calculate the distances between samples. If a sample is located on the stream path, the unit distance in a pixel is accumulated along the flow direction until the next downward pixel meets other samples. There are sample pairs for all samples in Kriging using Euclidean distance (i.e., EUC-Kriging in this paper). For example, ten distance pairs exist excluding duplicates for five samples in Fig. 3a. On the other hand, stream networks should be considered when calculating the stream distance. In Fig. 3b, it is reasonable to calculate the distances between samples 2, 4, and 5 in the Kriging prediction because sample 2 is connected with samples 4 and 5. However, no distance pair between samples 1 and 2 exists because they are unconnected and irrelevant to each other.

Fig. 2
figure 2

Flowchart for calculating the stream distances between samples and predicting the unknown value using STD-Kriging

Fig. 3
figure 3

Euclidean distance (a) and stream distance (b) between samples in the stream network

After calculating the distances between samples, the empirical variogram should be obtained using the distances and sample values. In this step, an adequate lag distance should be defined because the shape of the empirical variogram depends on the lag distance. The empirical variogram with the lag distance h is defined as follows for observations of z i at locations x i (i = 1, …) (Cressie, 1985):

$$\gamma \left( h \right) = \frac{1}{{2\left| {N\left( h \right)} \right|}}\mathop \sum \limits_{{\left( {i,j} \right) \in N\left( h \right)}} \left| {z_{i} - z_{j} } \right|^{2}$$
(1)

where N(h) denotes the set of pairs with |x i  − x j | = h, and |N(h)| is the number of pairs in the set. In this study, the approximate distance is used using a tolerance, which is the half of the lag distance, h, to include all data.

Kriging requires valid variogram at every lag distance to predict the unknown value. However, the empirical variogram cannot be computed at every lag distance, and it is not ensured to be valid because of variation in the estimation. Therefore, theoretical variogram models ensuring validity are applied to approximate the empirical variogram (Chiles and Delfiner 2012). Theoretical variogram modeling is an important step in Kriging because it affects the prediction directly. The final step is to predict the unknown value by calculating weights of samples, which is calculated based on the theoretical variogram according to the distance between the sample and the unknown location to be predicted.

This study proposes a new Kriging method that uses the catchment basin area instead of the distance. A stream sediment sample is a result of constant erosion and sedimentation of introduced materials, which are transported primarily by the flow of rainwater or stream. On the assumption that the surface water flows to the lower area, it is possible to analyze the catchment basin of the sample where the water and materials came from by using GIS-based spatial analysis. The area of the catchment basin of each pixel can also be calculated by using the flow accumulation algorithm (Jenson and Domingue, 1988). If a wide gap of catchment basin area exists between two pixels that are connected to each other, they tend to have different values because many materials are introduced from another area between the two pixels. On the other hand, if a small gap of catchment basin area exists between the two pixels, they tend to have similar values.

Figure 4 shows the advantage of Kriging using the catchment basin area (i.e., CAT-Kriging in this paper) when compared with STD-Kriging. The red line denotes a contaminated stream, the blue line represents an uncontaminated stream, and the arrows indicate the flow direction of the stream pixel. Points A, B, and C are samples with known value, and points X and Y are unknown locations to be predicted. When predicting the value of point X, which is ahead of the stream junction, the stream distances from point X to samples B and C are 2 and 5, respectively. When considering the stream distance, an exaggerated value may be predicted for point X due to the effect of polluted sample C. The differences of the catchment basin area between point X and samples B and C are 2 and 44 (pixels), respectively. CAT-Kriging can provide a more reasonable prediction for point X because the effect of uncontaminated sample B is much larger than the contaminated sample C. When predicting the value of point Y after the stream junction, the stream distances from point Y to samples A, B, and C are 5, 5, and 2, respectively. The closest sample C has the largest weight, and samples A and B have same weights in STD-Kriging. However, it is reasonable that sample A, which is the representative of a larger catchment basin area, has more effect on point Y than sample B because of the assumption that stream sediment is a homogeneous mixture of materials in the catchment basin. In the case of CAT-Kriging, the respective differences of the catchment basin area are 14, 44, and 2 for samples A, B, and C, respectively. Therefore, samples C, A, and B are arranged according to weights on point Y in CAT-Kriging.

Fig. 4
figure 4

An example showing the effect of the catchment basin area to the chemistry values in the stream network

Software development to implement the Kriging process

To model the variogram and predict the unknown value using the Kriging method, a new software was implemented. The software was written in Visual Basic 2013 and utilizes the ESRI ASCII grid file as a standard data format, exchangeable with ESRI ArcGIS software.

Figure 5a is a module developed for calculating the distance between samples, which provides Euclidean distance, stream distance, and catchment basin area difference as a table. Figure 5b is the variogram-modeling module constructed to calculate the empirical variogram and model the theoretical variogram. Using the raster format input data, such as the sample locations, sample concentrations, and flow direction of the study area, this module calculates the empirical variogram for each lag distance and displays it as points in the graph. To predict the unknown value using Kriging, theoretical variogram models that ensure validity are required to approximate the empirical variogram. In this module, the empirical variogram is approximated by a combination of five widely used theoretical variogram models, such as the linear model, the spherical model, the exponential model, the Gaussian model, and the nugget model. It is important to combine the models to represent the data and empirical variogram appropriately. This module supports modeling the theoretical variogram using weighted least squares regression. The theoretical variogram can be saved as its own format to be used for prediction. Figure 5c is the Kriging analysis module, which predicts the unknown values using GIS layers and the saved variogram file. The predicted result is exported to ESRI ASCII grid file. In addition, this module provides a leave-one-out cross-validation function that uses one observation as the validation set and the remaining observations as the training set. If the number of samples is N, N models are created, and this function is repeated N times. The advantage of this function is that it does not allow for randomness because all the data can be used for training. This function is commonly used and is known as a useful validation method for Kriging (Aelion et al. 2009; Menafoglio et al. 2014). To compare the predicting capabilities quantitatively, mean error (ME), mean absolute error (MAE), and root mean square error (RMSE) are also computed.

Fig. 5
figure 5

Graphical user interface of a distance calculator module, b variogram-modeling module, and c Kriging analysis module

Application to synthetic datasets

Synthetic datasets creation

To demonstrate the improvements of the proposed methods, synthetic datasets were used. Synthetic data should be created similar to the principle of real contaminant distribution. In this study, two factors, such as soil erosion and catchment basin derived from real DEM, were considered to reflect the contaminant distribution.

Figure 6a shows a DEM with 100 m resolution and the locations of synthetic stream sediment samples. Figure 6b shows the catchment basins of some samples that are derived from the DEM. Sample 8 is the lowest of the five samples, and the catchment basin is the sum of A, B, C, D, and E. Sample 65 is the highest of the five samples, and the catchment basin is E. The heavy metal concentration of the stream sediment is affected by contaminants in the eroded soil. This study estimated the soil erosion of each pixel using the Universal Soil erosion Equation (USLE) defined as follows:

Fig. 6
figure 6

Synthetic datasets: a DEM and samples, b catchments of some samples, c distribution of LS factor that represents soil erosion, d distribution of nonpoint pollution source, and e simulated heavy metal concentration of synthetic data

$$A = R \times K \times L \times S \times C \times P$$
(2)

The USLE has been widely used to estimate the average annual soil erosion per unit land area (Wischmeier and Smith 1978) and considers six major factors affecting soil erosion, represented as the rainfall erosivity factor (R), the soil erodibility factor (K), the slope length factor (L), the slope steepness factor (S), the cropping management factor (C), and the supporting conservation practices factor (P). To create synthetic data, this study calculated the L-factor and S-factor as relative soil erosion (Fig. 6c) using GIS-based analyses of slope length and slope gradient (McCool et al. 1987) supposing that other factors are the same for all pixels. Figure 6d shows the synthetic concentrations of soil distributed in the study area as a nonpoint source.

The synthetic concentrations of stream sediment samples were simulated using thematic maps with the following steps:

  1. 1.

    Estimate the amount of heavy metal eroded at each pixel by multiplying the soil erosion (\(A_{i}\)) and concentration (\(C_{i}\)) of each pixel i.

  2. 2.

    Analyze the catchment basin (\(W_{k}\)), and calculate the area (the number of pixels) for each sample \(k\).

  3. 3.

    Define the amount of heavy metal introduced to each sample as \(\sum\nolimits_{i}^{{W_{k} }} {(A_{i} \times C_{i} )}\) on the assumption that the eroded heavy metals are mixed homogeneously.

  4. 4.

    Calculate the concentration of stream sediment sample \(k\) by dividing \(\sum\nolimits_{i}^{{W_{k} }} {(A_{i} \times C_{i} )}\) by \(\sum\nolimits_{i}^{{W_{k} }} {A_{i} }\), which is the sum of eroded soil in the catchment basin \(W_{k}\).

Figure 6e shows the concentrations of stream sediment samples derived from these steps. Although this result cannot represent the contamination in real world precisely, synthetic datasets reflecting contamination and dilution could be created.

Prediction for synthetic datasets

The variogram should be modeled before predicting the unknown values using Kriging. Figure 7a–c shows the empirical variograms and the theoretical variograms for the three methods (i.e., Euclidean distance, stream distance, and catchment basin area). The linear model was applied in the same manner, and the ranges of the three models were defined as 25,000 m, 40,000 m, and 800 km2 (80,000 pixels), based on the shape of the empirical variogram. The range refers to the distance at which the variogram reaches the sill. There is no correlation between two samples.

Fig. 7
figure 7

Empirical variograms and theoretical variograms calculated from a Euclidean distance, b stream distance, and c catchment basin area

Figure 8 shows the prediction results using EUC-Kriging (Fig. 8a), STD-Kriging (Fig. 8b), and CAT-Kriging (Fig. 8c). Unlike other methods, EUC-Kriging predicts for all pixels of the study area, but only the predicted values on the stream network were extracted to be compared with other two results. A remarkable difference among the three methods exists at the stream junctions, such as area A and area B. As shown in Fig. 9a, in the case of EUC-Kriging, the predicted values continuously vary regardless of the shape of stream network. On the other hand, the other two methods (Fig. 9b, c) can distinguish both uncontaminated and contaminated streams. The prediction tendencies are similar for STD-Kriging (Fig. 9b) and CAT-Kriging (Fig. 9c). However, at the stream segment marked by a square, CAT-Kriging predicts larger values than STD-Kriging because the stream segment covering sample 57 has larger catchment basin area than the stream segment covering sample 56. Figure 9d–f shows the prediction for area B in Fig. 8a. The stream segment with no observation is predicted using downstream samples, such as sample 28. Predicting these stream segments is challenging task, but it may be generally not contaminated. In the case of STD-Kriging (Fig. 9e), larger values are predicted than CAT-Kriging (Fig. 9f) because of the effect of the adjacent sample 28.

Fig. 8
figure 8

Prediction of heavy metal concentrations for synthetic data using a EUC-Kriging, b STD-Kriging, and c CAT-Kriging

Fig. 9
figure 9

Prediction of heavy metal concentrations for area A using a EUC-Kriging, b STD-Kriging, and c CAT-Kriging; and for area B using d EUC-Kriging, e STD-Kriging, and f CAT-Kriging

Figure 10 illustrates the predicted concentrations as a graph along the path indicated by pink line. A to M represents junctions where other stream segments meet the path. In the case of EUC-Kriging, the predicted values do not depend on junctions. In other two cases, the predicted values change drastically at junctions. For example, a massive increase at junction C is observed where a contaminated stream joins the path. On the contrary, a huge decrease at junction H occurred where an uncontaminated stream with large catchment area joins the path. This trend is significant in CAT-Kriging.

Fig. 10
figure 10

Graph of the predicted heavy metal concentrations for the stream path represented by a pink line. A to M are stream junctions

As a result of cross-validation, the ME of EUC-Kriging, STD-Kriging, and CAT-Kriging is −0.2, 3.0, and 1.4, the MAE is 11.5, 6.2, and 5.2, and the RMSE is 15.2, 9.4, and 13.3, respectively. ME indicates bias in prediction, and MAE or RMSE indicates the capability of prediction from the point of overall errors. Compared to MAE, RMSE amplifies large errors because of the square term in the formula. Based on the ME results, the EUC-Kriging result is the most unbiased and the STD-Kriging result is the most biased. Based on the MAE and RMSE results, the prediction capabilities of STD-Kriging and CAT-Kriging are confirmed to have improved in terms of overall error reduction. Even though the MAE of CAT-Kriging is the lowest, the RMSE of CAT-Kriging is larger than that of STD-Kriging owing to the influence of samples 50 and 61, which have large errors. If samples 50 and 61 are excluded, the RMSE of EUC-Kriging, STD-Kriging, and CAT-Kriging is 15.4, 9.4, and 9.3, respectively. Figure 11 shows the relationships between the predicted and the original values. From the comparisons, the predicted values of CAT-Kriging tend to be positioned on the red line, indicating the position of one-to-one correspondence, compared to other methods.

Fig. 11
figure 11

Original values versus predicted values based on cross-validation for a EUC-Kriging, b STD-Kriging, and c CAT-Kriging. The red line indicates the position of one-to-one correspondence. The blue line with the equation represents the trend line of points

Application to real-world datasets

Study area

For a real-world application, the area of Yeonhwa II mine (37°07′N, 129°08′E), which is located in Samcheok-si, a city in South Korea, was selected as the study area. Two tailings dams (Fig. 12a) hold mine tailings, which include mainly lead (Pb) and zinc (Zn). The average annual rainfall of the study area is 1315.1 mm, and 54.8% of the rainfall is concentrated during summer (July to September), causing surface water contamination by erosion of mine tailings and mine water leaks (MIRECO 2007). Figure 12b shows a DEM (30 m grid spacing) and sampling data of the study area. To generate a DEM, topographical contours (contour interval: 5 m) were extracted from 1:5000 scale topographical maps published by the National Geographic Information Institute of Korea (http://www.ngii.go.kr). A triangulated irregular network surface was created from the topographical contours and converted to a DEM. As seen in a DEM, the western part of the study area is higher than the eastern part, where the main stream is flowing to the East Sea. The circles denote stream sediment samples, and the squares represent farmland soil samples. The heavy metal concentrations in the stream sediment samples are presented in Table 2. The Zn concentrations of 14 samples out of 22 samples exceed the US standard of sediment removal. The other eight samples were collected from the uncontaminated stream (i.e., not the main stream). The Zn concentrations become smaller on the whole as the distance from Yeonhwa II mine increases. Therefore, Zn is an indicator for pollution in this area. Table 3 shows the heavy metal concentrations in farmland soil samples near a stream. The concentrations of farmland soil are much smaller than those of the stream sediment, which implies that the major cause of contamination in this area is the stream flow. Although farmland soil samples are close to the stream, to consider the stream sediment samples and farmland soil samples as one dataset is irrational. Therefore, the Zn concentration of the stream sediment samples was selected as a variable in this study.

Fig. 12
figure 12

Location of the study area on the satellite image (Google Earth, left), and DEM and locations of samples (right). Sample numbers in brackets indicate stream sediment samples collected from the uncontaminated stream

Table 2 Heavy metal concentrations of stream sediment samples (italicized number exceeds the soil contamination warning standards of South Korea, and the bold number exceeds the US standard of sediment removal)
Table 3 Heavy metal concentrations of farmland soil samples (italicized number exceeds the soil contamination countermeasure standards of South Korea, and the bold number exceeds the soil contamination warning standards of South Korea)

Prediction for study area

To predict the heavy metal concentration, the distances between samples were first calculated (Table 4) based on Euclidean distance, stream distance, and differences of catchment basin area. To make the real-world and synthetic results comparable, the DEM with 30 m resolution was resampled to 100 m resolution, which is the same as that of the synthetic data. It can be observed from the results in Table 4 that the calculated Euclidean distance, stream distance, and differences in catchment basin area are similar for different DEMs. The stream distance is confirmed to be longer than the Euclidean distance for the same sample set. In cases of stream distance and catchment basin area, some samples are not connected with each other. If the relative distances between samples and the relative range are the same for different datasets, the prediction results are the same, according to the characteristic of Kriging. Therefore, the number of pixels can be used as a unit of catchment basin area, instead of km2 or m2, regardless of the resolution. The difference of the catchment basin area is noticeably large for some sample pairs despite the short stream distance (e.g., a pair of samples 11 and 12, or the pair of samples 15 and 16). This phenomenon is remarkable where a junction is placed between two samples.

Table 4 Euclidean distance, stream distance, and catchment basin area difference between samples

Figure 13a–c shows the empirical and theoretical variograms of the 30-m-resolution DEM for the three methods (Euclidean distance, stream distance, and catchment basin area), and Fig. 13d–f shows the variograms of the 100-m-resolution DEM. Even though there are a few differences between the two resolutions, the variogram shapes are similar. The exponential model combined with the nugget model was identically applied using weighted least squares regression. The ranges of the three models were defined as 14,000 m, 20,000 m, and 140 km2 for both resolutions. In comparison with the case of synthetic data, the nugget effect is prominent because there is noise in real data and the observed values of the near samples are significantly different in the real world. Kriging prediction was applied to the 100-m-resolution DEM to achieve correspondence with the synthetic data and computational efficiency in analysis.

Fig. 13
figure 13

Empirical and theoretical variograms calculated from a Euclidean distance, b stream distance, and c catchment basin area using 30-m-resolution DEM. Empirical and theoretical variograms calculated from d Euclidean distance, e stream distance, and f catchment basin area using 100-m-resolution DEM

Figure 14 shows the results of Zn prediction using EUC-Kriging (Fig. 14a), STD-Kriging (Fig. 14b), and CAT-Kriging (Fig. 14c). The predicted values of the western part of the study area are more contaminated than the values of the eastern part in all three results. However, in contrast to the other two methods, EUC-Kriging cannot distinguish the uncontaminated stream segments from the contaminated streams. The stream segment of the western part heading toward the southeast is predicted to have larger values than the other stream segments because the main stream samples connected with the stream segment have large values while the uncontaminated sample directly collected from the stream segment is only one. Therefore, these stream segments need additional sampling to prevent the exaggerated prediction. STD-Kriging tends to be more exaggerated than CAT-Kriging.

Fig. 14
figure 14

Prediction of Zn concentrations for the study area using a EUC-Kriging, b STD-Kriging, and c CAT-Kriging

Figure 15 illustrates the predicted concentrations as a graph along the main stream path, which is indicated by a pink line. A to M represent the junctions where other stream segments meet the path. In the case of EUC-Kriging and STD-Kriging, there is slight variation in the predicted values except in near samples due to a little variation and large nugget value in the theoretical variogram (Fig. 13a, b). However, the predicted values of STD-Kriging change drastically at junctions, unlike EUC-Kriging. The predicted values of CAT-Kriging change significantly at junctions and on the whole.

Fig. 15
figure 15

Graph of the predicted Zn concentrations for the study area stream path represented by a pink line. A to M are stream junctions

As a result of cross-validation (Table 5), the number of samples that each method predicted most accurately is 4 for EUC-Kriging, 4 for STD-Kriging, and 14 for CAT-Kriging, respectively. The ME of EUC-Kriging, STD-Kriging, and CAT-Kriging is 37.3, 177.3, and 274.3, MAE is 2954.3, 2337.6, and 2095.3, and RMSE is 3633.0, 3016.7, and 2844.0, respectively. Based on the ME results, the EUC-Kriging result is the most unbiased and the CAT-Kriging result is the most biased. Based on the MAE and RMSE results, the prediction capabilities of STD-Kriging and CAT-Kriging are confirmed to have improved in terms of overall error reduction. Figure 16 shows the relationships between the predicted and the observed values. From the comparisons, CAT-Kriging provides more accurate prediction than the others, particularly in the case of exaggerated predictions of uncontaminated stream segments in Fig. 16b.

Table 5 Cross-validation results of EUC-Kriging, STD-Kriging, and CAT-Kriging for the study area (the bold number is the most similar value to the observed value)
Fig. 16
figure 16

Observed values versus predicted values based on cross-validation for a EUC-Kriging, b STD-Kriging, and c CAT-Kriging. The red line indicates the position of one-to-one correspondence. The blue line with the equation represents the trend line of points

Conclusions

In this study, new prediction methods, namely the STD-Kriging and CAT-Kriging, were developed and applied to synthetic and real-world datasets. These methods predicted the concentrations of stream sediment samples rationally by considering stream networks. In particular, CAT-Kriging reduces the exaggeration problem in predicting the sample of uncontaminated stream segment. As a result of cross-validation, CAT-Kriging obtained the best result for each sample prediction and overall error reduction. This study has several important advantages over existing studies on aquatic variables. First, this study performed a prediction for aquatic variables along the stream network in contrast to studies that apply Kriging on an entire area/region or analyze only the patterns of aquatic variables without prediction. In addition, this study takes flow direction, stream networks, or catchment basin into account by analyzing the topographical conditions based on a DEM. Owing to the raster format basis of the study, the continuous change of variable in a stream segment can be predicted, and there is no necessity for constructing objects or database to preserve information, unlike the vector-based studies. The new software was developed to automate the process in raster format, and it can provide fast analysis by adjusting the variables easily.

A problem exists in predicting the upstream value in the presence of contaminated downstream samples. To reduce the problem, CAT-Kriging was proposed in this study. However, this issue cannot be solved thoroughly if there is no sufficient upstream data in that stream segment. In addition, the upstream value may have some problems in modeling a valid variogram using the catchment basin area instead of the distance. If sufficient samples are available in each stream segment, the application of each variogram and prediction for each stream segment can be a solution for this problem. This study can prioritize environmental hazards and provide useful information for the reclamation of stream networks. Furthermore, this study is universally applicable to abandoned mines, industrial areas, farming areas, and other various environments and can be applied to stream sediments and other aquatic variables. The effectiveness of the prediction is expected to be improved if the stream networks have a complex and sharp bend. Based on the assumption that contaminants disperse through permanent or transient flow, this study can be applied to the case of soil contamination in semi-arid or arid climates. However, if the effect of rainfall is negligible or the shape of the catchment area is not definite, CAT-Kriging may not provide a good prediction. Therefore, it is necessary to apply this methodology to various environments and conditions.