1 Introduction

All soils naturally contain trace levels of metals. The presence of metals in soil is, therefore, not indicative of contamination. However, if the level of heavy metals exceeds some ranges, it might be considered as a potential risk for the human health. The soil heavy metals in the environment are relatively stable and are difficult to remove through natural processes [1]. Thus, the monitoring and assessment of the environmental quality of soils play an important role in restoring damaged ecosystems and protecting soil environmental quality. Many calculation methods have been presented to assess the environmental quality of soil, such as geo-accumulation index [2], principle component analysis [3], integrated pollution index [4] and maturity index [5]. In addition, up until now, there have been some studies on the concentration of heavy metals in soil samples in different parts of Iran [69]. On the contrary, several calculations have been made using artificial neural networks (ANN) [1018] and support vector machines (SVMs) [1922] in different fields of environmental sciences. Moreover, considering the researches in the field of soil science, recently some of the most important soil properties such as cation exchange capacity (CEC), field capacity (FC) and permanent wilting point (PWP) which are hard to measure in the field are being predicted through more readily available soil properties such as particle-size distribution (sand, silt and clay content), organic matter or organic C content, bulk density, porosity, etc. by neural network [23] and SVMs [24] methods via Pedotransfer Functions. Some of the other applications of ANN and SVM in soil management include prediction of soil hydraulic conductivity [25], soil moisture [26, 27] and soil organic carbon [28].

Support vector machine, based on the structural risk minimization (SRM) principle, seems to be a promising method for data mining and has been used for both classification and regression problems. The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes.

On the other hand, artificial neural network is a type of artificial intelligence (computer system) that attempts to mimic the way the human brain processes and stores information. The ANNs are considered as standard nonlinear estimators, and their abilities have been verified in a variety of fields [29]. Details of the neural network, including different algorithms for network training, can be found in the extensive published literatures in this field [3032].

The main difference between these two artificial intelligence methods is due to the algorithm used for reducing the generalization error. In SVM, the empirical risk minimization (ERM) in ANN is replaced by the SRM principle, which seeks to minimize an upper bound of the generalization error rather than minimize the training error [33]. The advantages of artificial neural network over traditional statistical techniques as explained by Peng and Wen [34] include: (1) Neural network is more accurate than statistical techniques especially when dealing with incomplete data records. (2) As the neural network can develop its own weighting scheme, so it is faster than other statistical techniques. (3) There is no need for prior knowledge during ANN’s modeling, so it is a more flexible and powerful tool, while besides the main redeeming features of artificial neural network, the advantage of the SVM is the elimination of the local minimum issue of ANN.

To the best of our knowledge, SVM has not been used for the prediction of soil quality in the previous studies. In this research, two objectives are followed (a) to study the extent of soil pollution in Shahrood and Damghan located in Semnan Province, Iran, and (b) to predict the soil pollution index (SPI) by SVM and ANN and compare the performance of these two methods, accordingly.

2 Materials and methods

The study site covers an area of about 65,760 km2 and a population of about 325,000 inhabitants resides in the study area. It is located in Semnan Province, in central part of Iran. Shahrood and Damghan are the main cities in this region. The area is characterized by a mountainous climate, having an average precipitation of 133.7 mm per year and the temperature varies from about –10 °C in winter to about 40 °C in summer [35].

One of the dominant activities in this region is mining and coal washing operations (e.g., Tazareh coal mine area is located 31 km far from Damghan City) with a high capability for soil pollution by heavy metals. To study the soil pollution, a systematic random sampling approach [36] was followed and a total of 229 soil samples were taken from the top 10 cm of the soil. Samples were collected in clean, dry plastic containers and were protected from contamination until preparation. Following dryness of samples in 60 °C, they were sieved through a <2-mm stainless steel mesh. The soil samples were then digested with nitric acid (HNO3) and hydrochloric acid (HCl) in a ratio of 3:1 (HNO3:HCl). Finally, Ag, Co, Pb, Tl, Be, Ni, Cd, Ba, Cu, V, Zn and Cr were analyzed by inductively coupled plasma (ICP) optical emission spectroscopy (ICP-OES).

In order to show the relative magnitudes of soil pollution, Nemerow’s synthetical pollution index (P n) for all the soil sampling points was calculated. Higher value for P n indicates more serious pollution. The index has been utilized in the previous studies [37], and in the present research, using Iranian Soil standards of the Department of Environment (DOE) (Table 1), the index was calculated and applied as the soil quality assessment criterion. This index is calculated with the following equations:

$$P_{\text{n}} = \sqrt {\left( {\hbox{max} P_{i}^{2} + {\text{average}}\,P_{i }^{2} } \right)} /2$$
(1)

where

$$P_{i} = \frac{{C_{i} }}{{S_{i} }}$$
(2)
Table 1 Descriptive statistics of heavy metals along with the associated standards and the calculated soil pollution index (SPI)

In equations, \(P_{\text{n}}\) is the Nemerow’s synthetical pollution index, \(P_{i}\) is the pollution index of the ith heavy metal, \(C_{i}\) and \(S_{i}\) are the measured and assigned standard of the ith heavy metal, whereas max P i and average P i are the maximum and average values of pollution indices for all considered heavy metals, respectively [37].

Support vector regression (SVR) was one of the modeling procedures used in this study to predict the SPI given 12 heavy metals as the features. A popular regression version of SVM, ɛ-SVM, is used to find a function that has at most ɛ deviations from the actual obtained targets for all the training data, and is as flat as possible.

For building SVR forecasting model, the LIBSVM package proposed by Chang and Lin [38] was used in this study. Since in some of the previous studies, the most common kernels that obtained the best improvements were RBF, polynomial and linear, while other known kernels achieved poor results, so these kernel functions were tried in this study.

Besides kernel function, the performance of SVMs directly depends on the support vector machine’s parameters such as regulation parameter (C), insensitive loss function (ɛ), gamma parameter (γ) etc., and the sensitivity of results is based on the precise optimization of these parameters. Having optimized the associated parameters for each kernel function, eighty percents of the original data were selected randomly as the training set and the model was trained using the optimized parameters. Finally, the prediction of SPI was implemented using both the training data (e.g., re-substitution error) and test data (e.g., generalization error).

As stated earlier, the sensitivity of the results of SVMs hinges on the value of each parameter, so these parameters have to be optimized. There are three parameters for Radial Basis Function (RBF) kernel [39]: C (penalty parameter) and γ (a tuning parameter controlling the width of the kernel function) and epsilon value (ɛ). For linear SVMs, the penalty parameter (C) was the only optimized parameter and for polynomial kernel, the degree of polynomial was tuned. A good way of choosing the value of d (degree of the exponent in a polynomial kernel) is to start with 1 (a linear model) and increment it until the estimated error ceases to improve [40]. The cross-validation procedure can prevent the over-fitting problem. Over-fitting occurs when a forecasting model has good performance on the training data but its generalization ability (e.g., its performance on the testing data set) is poor.

The parameters to optimize in each experiment were encoded in a vector, bound to maximum and minimum values and tuned with a program written in MATLAB (R2013b). To have an independent data set for which the out-of-sample generalization error of the method is considered, five-fold cross-validation was applied on the training data.

One of the other modeling procedures utilized for the prediction of SPI was artificial neural network. To keep within the scope of this paper, we limited our survey of ANN models to the feed-forward neural network with one hidden layer. As a whole, too many hidden nodes may lead to the problem of over-fitting, whereas too few nodes in the hidden layer may cause the problem of under-fitting [41]. The linear transfer function (e.g., y i  = x i ) and the following transfer function was used for the output and hidden layers, respectively:

$$y_{j} = \tanh \left( {\mathop \sum \limits_{i = 1}^{d} w_{ij} x_{i} + b_{j} } \right)$$
(3)

where w ij and b j are the weight and bias parameters in which “i” and “j” subscripts refer to the input and neuron, respectively. In addition, Levenberg–Marquardt algorithm was used to update the weight and bias of the network according to this formula:

$$x_{k + 1} = x_{k} - \left[ {J^{\text{T}} J + \mu I} \right]^{ - 1} J^{\text{T}} e$$
(4)

where J is the Jacobian matrix containing first derivatives of the network errors with respect to the weights and biases, e is a vector of network errors, I is the identity matrix, x is a vector containing weights and biases, and µ is a scalar value, respectively. Prior to the data introduction to the neural network, standardization of the data (i.e., the data have zero mean and unit standard deviation) was done according to the following equation:

$$Z_{i} = \left( {x_{i} - \bar{x}_{i} } \right)/s_{i}$$
(5)

In which, \(\bar{x}_{i}\) and s i are the mean and standard deviation of the observed variables, respectively, whereas Z i is the standardized value. In this study, different hidden node sizes ranging from 5 to 40 were applied, and given the optimum number of hidden nodes (based on the minimum MSE), the best performed ANN was used for out-of-sample SPI prediction.

To reduce the risk of over-fitting which is a common problem in ANN modeling especially when the number of observations is small in comparison with that of features, early stopping [42] algorithm were utilized. To be comparable with the results of SVMs, an independent data set containing 80 percents of the original data was trained 20 times with different data divisions to training, validation and testing set. At the next step, the generalization ability of the ANN was considered on the rest of the data set. Since different random initial weights may produce different training results, thus the training over subsamples was performed at a fixed seed value [43]. The mean squared error (MSE) for both the 80 percents of the original data (training data) and the independent data (test data) was worked out, and the average MSE was regarded as the out-of sample generalization error in early stopping method.

It should be noted that, by application of SVM and ANN for the prediction of SPI, we do not intend to undermine the direct calculation of this index since most of the offered formula for the calculation of soil pollution indices (e.g., enrichment factor, contamination factor, geoaccumulation index) are simply enough to apply directly; however, since these indices have some shortcoming for example in the case of Nemerow’s index applied in this study, the influence of maximum value in calculations has been overemphasized and the weight of factors has not been taken into account as well. Therefore, sometimes some modifications are necessary to obviate these disadvantages such as introduction of entropy to calculate weights etc., making the problem more complex than usual situations and may incur unintentional errors during sub-index calculations. In these cases, the application of modeling procedures like SVM and ANN would be more beneficial. In this research, we used the basic Nemerow’s index formula to simplify the problem.

On the other hand, data analysis was done using the geostatistical methods, as described by Isaaks and Srivastava [44] and Goovaerts [45]. In linear geostatistics method, a normal distribution for the variable is desired in order to avoid distortions of data and low level of significance. In this study, the distribution of the data was tested for normality by the Kolmogorov–Smirnov (K–S) test. The logarithm transformation was performed on SPI for further analysis since these raw data sets did not follow a normal distribution pattern. Semivariogram model selections and model cross-validation were also done using the methods of Goovaerts [45]. Semivariogram was used to quantify the spatial variability of a regionalized variable, which relates dissimilarity between paired data values to the distance between each sample pair [44, 45]. The GS+(v.5.1) software was used to perform the ordinary kriging method, and mapping was done using the ArcGIS 10.1 software.

3 Results and discussion

The descriptive statistics of the analyzed heavy metals and the calculated SPI along with the associated Iranian standards for heavy metals in soil published by the Iranian Department of Environment have been given in Table 1.

Considering this table, the mean value of Ba (307.56 mg/kg) is higher than that of the assigned standard value. This element varied from 80 to 663.73 mg/kg which was roughly in the same range as that of Eriksson [46] in the agricultural soils of Sweden (383–778 mg/kg) but higher than the values of Brazilian’s soils (32.86–128.89 mg/kg) [47] and that was found in the soils of Buffalo, USA (50.9–553 mg/kg) [48]. The mean values of Pb, Zn, Ni, Co, Cd were lower than that reported by Esmaeili et al. [49]. in the industrial zone of Isfahan, Iran, which were 34.6, 111.5, 66.2, 14.7, 0.43 mg/kg, respectively. However, the average value of Cr (91.49 mg/kg) is higher than the mean detected value (85.9 mg/kg) in the latest study. Moreover, the mean values of Zn, Cu, Ni and Cr in this research were higher than that in the soils of China (in turn 58.9, 18.9, 20.8, 49.7 mg/kg) [50]. In addition, for other heavy metals like Cr, Ni and V, the mean concentrations are roughly near their standard values indicating a possible high risk associated with these heavy metals in short term. As it is obvious, the maximum values for most of these heavy metals have exceeded the standard level showing the gross pollution in some parts of the study area. For instance, level as high as 739.35 mg/kg has been detected for Cr which is roughly more than six times higher than that of the standard value. On the contrary, the calculated Nemerow’s synthetical pollution index (P n) has been rendered in the last column of Table 1. With respect to this table, and the classification criterions for polluted index of soil (Table 2), it can be concluded that on average most of the study area is located in precaution domain, whereas the maximum value as high as 9.23 shows that some parts are seriously polluted with heavy metals. The study area has been classified given this index, and the result has been illustrated in Fig. 1. This figure shows that the right side of the study area is the most polluted part.

Table 2 Classification criterion for soil pollution index
Fig. 1
figure 1

The results of interpolation of the soil pollution index in the study area using geostatistical methods

Referring to geological formations of the study area, the main lithologic units of this area are ophiolitic complex accompanied by Eocene–Oligocene volcaniclastic and basic rock units. The ophiolitic complex is the main body of ultramafic rocks. It has been proved that high concentrations of some elements such as Cr and Ni are due to presence of ultramafic rocks [51]. As mentioned earlier, the concentration of the above-mentioned heavy metals is near their standard values implying their possible geological source. On the other hand, mining activities and coal washing which are prevalent in the area (Fig. 2) are other sources that can be attributed for the elevated level of some heavy metals. In this field, Ardejani et al. [52]. in their study on Alborz Sharghi coal washing plant located about 55 km of Shahrood City, reported elevated levels of Fe, Mn, Zn, Cr and Co at a depth of 2 m from the top soil in the vicinity of this plant.

Fig. 2
figure 2

The observed versus predicted values of soil pollution index (SPI) for modeling with RBF kernel (a) and ANNs with early stopping (b)

Since the performance of learning machines is influenced by the parameter dimensions, so in the current research, the sensitivity of SVMs to the input parameters was considered. For the case of ANNs, the only considered parameter was the number of hidden nodes which has great impact on the predictive ability of the ANNs [43, 53, 54]. The results of the related parameters for linear, RBF and polynomial kernels have been presented in Tables 3, 4, 5, 6 and 7, respectively.

Table 3 The results of parameter optimization for linear kernel SVMs (–p stands for epsilon value)
Table 4 The results of optimization of epsilon value for RBF kernel SVMs (epsilon, –g, and regularization parameter, –C, were set to their default values)
Table 5 The results of optimization of gamma parameter for RBF kernel
Table 6 The results of optimization of regularization parameter for RBF kernel
Table 7 The results of optimization of polynomial degree for polynomial kernel

Considering Table 3, the best value of regulation parameter (C) for linear kernel was 3. Tuning of this parameter resulted in MSE of 0.014 and 0.017 for the training data, while R 2 of 0.985 and 0.988 for the training and test data was obtained during this process. On the contrary, the optimal values of ɛ, γ and C were 7, 0.2 and 0.00001, respectively (Tables 4, 5, 6, 7). Finally, the best polynomial degree was 4 for the polynomial kernel method. The results of parameter optimization show that the most sensitive parameter for the generalization ability of SVMs is γ. For instance, according to Table 5, by changing the value of γ from 0.00001 to 0.01, the R 2 of the test data set would reduce from 0.97 to 0.1. Since this parameter controls the amplitude of the kernel function, so the generalization ability of kernel hinges on it [55]. According to the previous studies, the regularization parameter (C) controls the trade-off between maximizing the margin and minimizing the training error. If C is too small, then insufficient stress will be placed on fitting the training data. If C is too large, then the algorithm will over-fit the training data [56, 57]. However, the results of this study showed that this parameter is not as important as that of gamma on the out-of-sample generalization of SVMs. On the other hand, ɛ-Insensitivity prevents the entire training set meeting boundary conditions and so allows for the possibility of sparsity in the dual formulation’s solution [55]. Although the performance of the three kernel functions was not that much different; however, since the best results have been obtained for the polynomial kernel, using the optimized parameters, the SVM was trained and tested with this kernel resulted in 15 support vectors (the points outside the ɛ-tube) out of 183 training samples.

On the contrary, the results of training with ANNs using early stopping (Table 8) indicate that the best generalization ability belongs to a neural network with 15 hidden nodes. As the number of hidden nodes increased, the generalization error decreased and for a neural network with 40 hidden nodes the average MSE of the testing set augmented to 1.91 compared with 0.01 for 15 hidden nodes, implying an obvious over-training of the model.

Table 8 The results of ANNs with different number of hidden nodes

As a whole, the results of ANNs are comparable with that of SVMs, however, the generalization error of SVMs is quite a bit better than that of ANNs with early stopping. The same results have been obtained by other researchers through comparison the performance of SVMs with ANNs [29, 55, 58]. To graphically show the performance of these two methods, the predicted values for each method have been plotted against that of the target SPI for the testing set and the results have been shown in Fig. 2 for RBF kernel’s and ANNs, respectively. The correlation coefficients for the RBF kernel and ANN with early stopping were 0.997 and 0.995, indicating roughly the same performance of these two methods. Our results are in accordance with that of Dibike et al. [59].

High dimensionality of the input space is often a serious problem associated with learning machines. A large training set that is able to provide a good distribution of high-dimensional data is essential for successful learning [43]. As the number of samples was significantly higher than the number of features (about 15 times that of features), so over-fitting due to the small data record was not a serious problem. One of the appealing features of SVMs is that the minimum found in the parameter space is always the global one [60]. That is to say, the problem of local minima which is common during training with ANNs is obviated in SVMs. Despite the different algorithms available for training a ANN, none of them can guarantee that the global rather a local minimum will be found in the training process [61]. Therefore, for cases in which there are quite comparable results for ANNs and SVMs, the usage of SVMs is preferable.

4 Conclusion

In this study, two learning machines (ANNs and SVMs) were evaluated and compared for predicting SPI with respect to the concentrations of heavy metals in a study area in Semnan Province, Iran. Since the number of samples was quite high (229 samples) in comparison with the number of features (12 heavy metals), so the two models could avoid the risk of over-fitting and their respective generalization ability was high and nearly the same, accordingly. As a whole, the results of ANNs were comparable with that of SVMs, however, the generalization error of SVMs is quite a bit better than that of ANNs with early stopping. Because of the fact that this is the first published literature on the usage of learning theory to model soil pollution, so, besides ANNs, SVMs can be an efficient modeling procedure in this field in feature studies.