1 Introduction

Tight gas is the term commonly used to refer to low-permeability reservoirs that produce mainly dry natural gas. Tight gas reservoirs are often defined as formations with permeability <0.1 millidarcy. Many “ultratight” gas reservoirs may have in situ permeability down to 0.001 MD [1]. Tight gas reserves constitute a significant percentage of the natural gas resources worldwide and offer tremendous potential for future reserve growth and production. Economical production from these resources depends on comprehensive understanding from reservoir and determining petrophysical parameters such as water saturation.

Water saturation is one of the most challenging petrophysical properties of a hydrocarbon reservoir that is mainly used to predict the volume of hydrocarbon in place and determine pay zones. Many researchers have probed various methods to measure water saturation [24]. This property can be measured directly from routine core analyses (RCAL) or estimated by petrophysical methods. There are some relationships for predicting water saturation in specific formations such as Archie equation, which is formulated for clean sand formations [5, 6]. These models are non-universal and nonlinear empirical relations that need to be fitted to real data. They can be applicable only for specified reservoirs which satisfies the model assumptions. These are the primary reasons for using artificial intelligence techniques like Decision Tree models, artificial neural network (ANN) and support vector machine (SVM) to predict water saturation. Employing these methods reduces the problems associated with costs and generalization of the empirical models of water saturation.

Decision tree forests are an ensemble learning technique used for classification, regression and other tasks, which operate by building a large number of decision trees at training process and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This method has been rarely used in reservoir characterization and petroleum engineering studies. Anifowose et al. [7] employed random forest method for prediction of permeability and porosity in an oil reservoir. They also employed this method for prediction of petroleum reservoir characterization [8].

In the past few years, the Tree Boost method has transpired as a robust method for predictive data mining. It has been extensively used for regression classification tasks, with continuous and categorical predictors. Detailed technical explanation of Tree Boost has been collected in the studies done by Friedman et al. [9]. The authors could not find the application of Tree Boost technique in previous petrophysical and reservoir studies.

An artificial neural network (ANN) is a parallel processor including neurons with ability of performing mathematical calculations through a learning algorithm. The knowledge is encoded in the interconnection weights between input, hidden and output layers [10]. A neural network is made of nonlinear activation functions. Its capability to control nonlinearity is important specially if the underlying distribution responsible for creating the input–output data is intrinsically nonlinear. The network learns by building an input–output mapping for the learning problem. This supervised learning underlies correction of interconnection weights by means of training samples including input signal and a corresponding desired response. The goal is to minimize the difference between the desired response and estimated response by the network in accordance to a proper criterion. The training of the network is repeated until the network reaches a predefined accuracy [10, 11].

Artificial Neural Networks have been extensively used in petroleum engineering studies. ANNs have proved their application in predicting petrophysical and reservoir properties [1254]. They also have been employed to predict water saturation by some researchers [5560].

Recently, Support Vector Machines (SVMs) have gained attention in regression and classification tasks due to their excellent generalization performance. The SVM formulation is based on the structural risk minimization (SRM) inductive principle where the empirical risk minimization (ERM) inductive principle and the Vapnik–Chervonenkis (VC) confidence interval are simultaneously minimized [6163]. There are at least three reasons for the success of SVM: Its ability to learn well with only a very small number of parameters, its robustness against the error of data and its computational efficiency. By minimizing the structural risk, SVM works well not only in classification but also in regression [64, 65].

SVM has gained popularity in petroleum engineering studies and have been used for prediction of reservoir and petrophysical properties [30, 37, 6676]. There are some studies focusing on prediction of water saturation by using SVM [38, 77].

Using Artificial Intelligence and Learning methods leads to an efficient and universal solution for obtaining reservoir properties for any location on the world. While experimental correlations are applicable to determined reservoir and may have many limitations, these methods are universal and by updating and setting their key parameters can be used in any reservoir.

In this research, four techniques including Decision Tree Forest, Tree Boost, multilayer perceptron neural network (MLP) and support vector machine (SVM) have been employed to predict water saturation of Mesaverde tight gas reservoir located in Uinta Basin, USA. These four methods have previously proved their brilliant performance in many fields of science and technology. Decision Tree Forest and Tree Boost were less employed in petrophysical studies, while Artificial Neural Network and Support Vector Machine have been used in many researches in recent years. The main reason of selecting these methods was to evaluate Decision Tree Forest and Tree Boost methods in predicting water saturation and comparing their performance with those of MLP Neural Network and SVM. Results obtained for different techniques have been compared to each other, and the best predictor for water saturation has been determined.

1.1 Decision Tree Forest

Decision Tree Forest consists of an ensemble of decision trees whose predictions are combined to make the overall prediction for the forest. A decision tree forest is similar to a Tree Boost model in the sense that a large number of trees are grown. However, Tree Boost generates a series of trees with the output of one tree going into the next tree in the series. In contrast, a decision tree forest grows a number of independent trees in parallel, and they do not interact until after all of them have been built. Schematic of a Decision Tree Forest model has been presented in Fig. 1.

Fig. 1
figure 1

Schematic of Decision Tree Forest

Both Tree Boost and decision tree forests produce high-accuracy models. Experiments have shown that Tree Boost works better with some applications and decision tree forests with others, so it is best to try both methods and compare the results. The Decision Tree Forest technique used here is an implementation of the “random forest” algorithm developed by Breiman [78].

Decision Tree Forest models are among the most accurate models yet invented. These models can be applied to regression and classification models. Decision tree forest models often have a degree of accuracy that cannot be obtained using a large, single-tree model. This method can handle hundreds or thousands of potential predictor variables.

Decision tree forests use the “out-of-bag” data rows for validation of the model. This provides an independent test without requiring a separate data set or holding back rows from the tree construction. About one-third of data rows are excluded from each tree in the forest, and each tree will have a different set of out-of-bag rows.

In many cases, decision tree forests do not have a problem with over-fitting. Generally, the more the trees in the forest, the better the fit. The randomization element in the decision tree forest algorithm makes it highly resistant to over-fitting.

The primary disadvantage of decision tree forests is that the model is complex and cannot be visualized like a single tree. It is more of a “black box” like a neural network.

1.2 Tree Boost method

Boosting is one of the most popular learning methods which combines many weak learners to create a single-strong learner [79, 80]. Boosting is a technique for improving the accuracy of a predictive function by applying the function repeatedly in a series and combining the output of each function with weighting so that the total error of the prediction is minimized. In many cases, the predictive accuracy of such a series greatly exceeds the accuracy of the base function used alone. The Tree Boost algorithm is functionally similar to decision tree forests because it creates a tree ensemble, but a Tree Boost model consists of a series of trees whereas a decision tree forest consists of a collection of trees grown in parallel. Tree Boost is also known as “Stochastic Gradient Boosting” and “Multiple Additive Regression Trees” (MART).

The Tree Boost algorithm used here was developed by Friedman [81] and is optimized for improving the accuracy of models built on decision trees. Graphically, a Tree Boost model can be represented as demonstrated in Fig. 2.

Fig. 2
figure 2

Schematic of a Tree Boost model

The first tree is fitted to the data. The residuals (error values) from the first tree are then fed into the second tree which attempts to reduce the error. This process is repeated through a series of successive trees. The final predicted value is formed by adding the weighted contribution of each tree.

Usually, the individual trees are fairly small (typically 3 levels deep with 8 terminal nodes), but the full Tree Boost additive series may consist of hundreds of these small trees. Tree Boost models often have a degree of accuracy that cannot be obtained using a large, single-tree model. Tree Boost models can handle hundreds or thousands of potential predictor variables. Irrelevant predictor variables are identified automatically and do not affect the predictive model. Tree Boost uses the Huber M-regression loss function [82] which makes it highly resistant to outliers and misclassified cases. The randomization element in the Tree Boost algorithm makes it highly resistant to over-fitting. Tree Boost can be applied to regression models and k-class classification problems.

The primary disadvantage of Tree Boost is that the model is complex and cannot be visualized like a single tree. It is more of a “black box” like a neural network.

1.3 Multilayer perceptron neural network

Multilayer perceptron (MLP) networks are currently the most widely used neural networks. MLP is a popular estimator to construct nonlinear models of data. It consists of an input layer one or more internal layers of hidden neurons and an output layer. They are also called multilayer feedforward networks (MLFF). The hidden layers are also called internal layers as they receive internal inputs. The network is provided with a training set of patterns having inputs and outputs. The learning algorithm for this type of network is called the back-propagation (BP) algorithm [83, 84]. Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning and is carried out through back propagation, a generalization of the least mean squares algorithm in the linear perceptron. Figure 3 demonstrates the architecture of an MLP model.

Fig. 3
figure 3

Multilayer perceptron neural network architecture

In Fig. 3, x p (N) is input variable, W hi (N h N) is input connection weight, net p (N h ) is Net input function, O p (N h ) is Activation function, W oh (MN h ) is output connection weight and y p (N) is output variable. MLP network generates nonlinear relationship between inputs and outputs by interconnection of nonlinear neurons. The nonlinearity is distributed throughout the network. It does not require any assumption about the underlying data distribution for designing the networks; hence, the data statistics do not need to be estimated. For an MLP network, the topology is important for the solution of a given problem, i.e., the number of hidden neurons and the size of the training data set. The network has a strong capability for function approximation, learning and generalization.

1.4 Support Vector Regression

Support Vector Machines (SVMs) are learning machines implementing the structural risk minimization inductive principle to obtain good generalization on a limited number of learning patterns [37, 63, 67, 85]. Structural risk minimization (SRM) involves simultaneous attempt to minimize the empirical risk and the VC (Vapnik–Chervonenkis) dimension [85]. The VC dimension of a set of functions is the size of the largest data set due to that the set of functions can scatter. VC theory has been developed over the last three decades by Vapnik and Chervonenkis [61] and Vapnik [62, 63]. This theory characterizes properties of learning machines which enable them to effectively generalize the unseen data [85].

In its present form, the Support Vector Machine was largely developed at AT&T Bell Laboratories by Vapnik et al. [8691]. The SV (Support Vector) algorithm is a nonlinear generalization of the generalized Portrait algorithm developed in Russia in the sixties [92, 93].

SVM is a learning system that uses a high-dimensional feature space. It yields prediction functions that are extended on a subset of support vectors. SVM can generalize intricate gray level structures with only a very little support vectors. A version of a SVM for regression has been proposed [91], which is called support vector regression (SVR). The model produced by SVR only depends on a subset of the training data, because the cost function for building the model ignores any training data that is close (within a threshold ɛ) to the model prediction [85].

2 Geological background

The data set for this study is obtained from Mesaverde tight gas sandstones located in Uinta Basin in USA. Mesaverde group sandstones represent the principal gas productive sandstone unit in the largest Western US tight gas sandstone basins including Washakie, Uinta, Piceance, northern Greater Green River, Wind River and Powder River.

The Mesaverde group is divided into the regressive deposits of Iles Formation and the overlying massively stacked, lenticular non-marine Williams Fork Formation. The Iles Formation comprises the lower part of the Mesaverde. It contains three marine sandstone intervals, the Corcoran, Cozzette and Rollins. The Williams Fork Formation extends from the top of the Rollins to the top of the Mesaverde. The lower part of the Williams Fork contains coals, and is commonly referred to as the Cameo coal interval. Most of the sandstones in the Williams Fork are discontinuous fluvial sands. The stratigraphy of the Mesaverde group is shown in Fig. 4 [94].

Fig. 4
figure 4

Cross section showing the stratigraphy of the Mesaverde group [94]

3 Data set

To measure the accuracy of the models, log and core information from four wells were used. These wells are cited in Uinta Basin. Database here is small in number of data points. Well 1 has a total of 190 data points, well 2 has 107 data points, well 3 has 67 data points and well 4 has 41 data points. It is worth mentioning that prediction methods generally perform better in presence of large number of training data points, but data points here are small in number and it may be a challenge for methods to present their ability to predict water saturation in presence of few number of data points. Data source is clear and has been calibrated before using in this research.

In this research, training and generalization capabilities of methods have been evaluated. To assess training capability of methods, the models were tested by employed data in training procedure. Table 1 represents the pattern of using data of various wells for evaluating training ability.

Table 1 Well order for evaluating training capability

Furthermore, models trained with training wells data were tested by other wells data to evaluate their capability in generalizing the relationships between various parameters of training data set into new data set in testing procedure. Table 2 describes the set of well data in evaluation of generalization capability of methods.

Table 2 Well order for evaluating generalization capability

There was not any geological preference for selecting training wells. Wells number 1 and number 2 were selected as training well, because they had enough number of data points for training procedure, while wells number 3 and number 4 because of small number of data points cannot be selected as training wells.

In training process, regression methods employed in this research got several parameters as input data and a scalar variable as output and then they tried to establish relations between input parameters in order to estimate values as close as possible to output values. In the testing process, the relations obtained in training were used to predict output variable.

In this research, input data consists of log data including gamma ray log (GR), a neutron porosity log (NPHI), a deep induction resistivity log (ILD), a bulk density log (RHOB) and a sonic travel-time log (DT) as input vectors. The scalar output is core-based water saturation. Table 3 represents each parameter range of values. Moreover, in Fig. 5 scatter plots of water saturation versus each well log values are demonstrated. Measurement scales of log data are cited in Table 3. It is not necessary that input measurement scale be within the range of output values, because output is water saturation which is a scalar parameter and have no scale.

Table 3 Range of parameters values
Fig. 5
figure 5

Scatter plots of selected petrophysical logs versus water saturation. a DT versus water saturation. b NPHI versus water saturation. c GR versus water saturation. d RHOB versus water saturation. e ILD versus water saturation

4 Methods

Regression models (Decision Tree Forest, Tree Boost, MLP and SVM) were constructed by using DTREG intelligent software. The number of trees for constructing Decision Tree Forest models was 400. Generally, the larger a decision tree forest is, the more accurate the prediction. Maximum tree levels of the Decision Tree Forest were chosen 100, which specify the maximum number of levels (depth) that each tree in the forest may be grown to. When a tree is constructed in a decision tree forest, a random subset of the predictor variables are selected as candidate splitters for each node. Two predictors were chosen as candidates for each node split. The regression methods were verified and validated in a manner that models trained with a well data were tested by using data from other well(s);then, predicted and actual values of saturation were compared by using error indexes.

For constructing Tree Boost models, 400 trees were generated in Tree Boost series. Each tree in the Tree Boost series had 10 levels of splits. The Tree Boost algorithm uses Huber M-regression loss function to evaluate error measurements for regression models [82]. This loss function is a hybrid of ordinary least squares (OLS) and least absolute deviation (LAD). For residuals less than a cutoff point, the squared error values are used. For residuals greater than the cutoff point, absolute values are used. Huber cutoff point was chosen 0.1. A tenfold cross-validation resampling technique was used to strike the right trade-off between over-fitting and under-fitting.

MLP neural network models constructed here had 4 layers (one input, two hidden and one output). An algorithm was used to automatically determine the number of neurons in hidden layers. This algorithm tries building multiple networks with different numbers of neurons in hidden layers and evaluates how well they fit by using cross-validation. Twelve neurons were selected for hidden layer 1, and 4 neurons were selected for hidden layer 2. A tenfold cross-validation method was used for validation. Sigmoid function was selected as activation function of hidden layers an output layer. The conjugate gradient method was used to find optimal network weights.

For SVM models, correct selection of the kernel function is so important. RBF, sigmoid and linear kernels are common kernel functions which have been employed in many researches previously [37, 66, 73]. In this research, at the first step, a collection of 320 data points from all 4 wells were used to train the SVM models built by sigmoid and RBF, and later, the models were tested against 85 data points. The best results were obtained by RBF kernel function. So, for constructing SVM models, the RBF kernel function was used. Tenfold cross-validation method was used for validation. The accuracy of an SVM model definitely depends on correct choice of the parameters C, ε and the kernel parameters. The problem of optimal parameter selection is more complex by the principle that a SVM model complexity depends on all mentioned parameters. In procedure of constructing a SVM model, the user has to choose the proper kernel function, and for the selected kernel, how to adjust the parameters. In order to find optimal parameter values, two methods including grid search and pattern search were used. A grid search tries values of each parameter across the specified search range using geometric steps. A pattern search starts at the center of the search range and makes trial steps in each direction for each parameter. If the fit of the model improves, the search center moves to the new point and the process is repeated. If no improvement is found, the step size is reduced and the search is tried again. The pattern search stops when the search step size is reduced to a specified tolerance. When using both grid search and pattern search, the grid search is performed first. Once the grid search finishes, a pattern search is performed over a narrow search range surrounding the best point found by the grid search. Hopefully, the grid search will find a region near the global optimum point and the pattern search will then find the global optimum by starting in the right region.

Different error statistics including correlation coefficient (r), root-mean-square error (RMSE) and average absolute error (AAE) were used to evaluate the accuracy of different models. The mathematical expressions of these error measures are demonstrated in Table 4.

Table 4 Error statistics formulas

5 Results and discussion

The error analysis for selection of SVM kernel is represented in Table 5. As it can be inferred from the Table 5, RBF kernel function relatively has the lowest values of error among various kernel functions, and so, it has been selected to be the kernel function of SVM models in this study.

Table 5 Error measures in different kernel functions of SVM

5.1 Training capability of models

In this section, capability of methods in training has been evaluated and it has been argued that which method has been well trained. In this regard, models based on Table 1 were tested by data which has been used in training process of them. Correlation coefficients of 4 employed methods (Decision Tree Forest, Tree Boost, MLP and SVM) among 5 regarded data sets in Table 1 have been compared in Fig. 6.

Fig. 6
figure 6

Correlation coefficient of Tree Boost, Decision Tree Forest, MLP and SVM in predicting water saturation from previously seen data

As it can be seen in Fig. 6, it is obvious that all employed methods have been well trained. SVM has gained best results in predicting water saturation, although MLP performs better in data set number 4. MLP has better training ability rather than Tree Boost and Decision Tree Forest. Furthermore, it can be mentioned that Decision Tree Forest has better training sufficiency than Tree Boost technique. Moreover, Training capability of these methods has been assessed by using AAE and RMSE error measurements. Values of mentioned errors are presented in Table 6. Error measures in Table 6 verify the comparison depicted in Fig. 6 and show that SVM and MLP have better training capability.

Table 6 Comparison of RMSE, and AAE error measures between SVM, MLP neural network, Decision Tree Forest and Tree Boost models in order to evaluate training capability

5.2 Generalization capability of models

In this section, capability of methods in generalization has been assessed and it has been investigated that which technique has the best ability in predicting water saturation by using input data which has not been introduced to model earlier. In this regard, models based on Table 2 were tested by data which have not been used in their training process. Correlation coefficients of 4 employed methods (Decision Tree Forest, Tree Boost, MLP and SVM) among 8 regarded data sets in Table 2 have been compared in Fig. 7.

Fig. 7
figure 7

Correlation coefficient of Tree Boost, Decision Tree Forest, MLP and SVM in predicting water saturation from previously unseen data

It can be well understood that SVM performs efficiently and better than other methods based on results showed in Fig. 7. It is also notable that in some data sets other methods has better correlation coefficient than SVM, as in data set 1 MLP and in data set 4 MLP and Decision Tree Forest represent better performance. Decision Tree Forest and MLP have similar results, and both of them can be applicable techniques for prediction of water saturation. It should be mentioned that MLP has relatively better performance rather than Decision Tree Forest. The weakest results are gained by Tree Boost models which reveal that although it has good training capability, it cannot be a reliable method in predicting water saturation based on previously unseen input data.

As it can be seen in Fig. 7, predictions made by all methods in data sets 5 and 8 have more accuracy rather than predictions made in other data sets. It should be regarded that in these two data sets, training data includes data gathered from two wells and provide more training data points, and as result, models in data sets 5 and 8 are well trained and it enables them to predict water saturation better than models trained with fewer data points.

Error analyses of generalization capability of methods also have been studied by employing AAE and RMSE error statistics. Values of these errors have been presented in Table 7.

Table 7 Comparison of RMSE, and AAE error measures between SVM, MLP neural network, Decision Tree Forest and Tree Boost models in order to evaluate generalization capability

Error measures in Table 7 verify the error analyses done by using correlation coefficient. Interpretation of Table 7 is detailed as below:

  • Data set 1 Decision Tree Forest in comparison with SVM, Tree Boost and MLP models has the lowest average error. SVM also has acceptable error measures, but MLP and Tree Boost have large values of errors.

  • Data set 2 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP and Decision Tree Forest also have acceptable error measures, but Tree Boost has large values of errors.

  • Data set 3 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP and Decision Tree Forest also have acceptable error measures, but Tree Boost has large values of errors.

  • Data set 4 MLP in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. SVM and Decision Tree Forest also have acceptable error measures, but Tree Boost has large values of errors.

  • Data set 5 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP and Decision Tree Forest also have acceptable error measures, and Tree Boost has better performance in comparison with previous data sets.

  • Data set 6 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP and Decision Tree Forest also have acceptable error measures, but Tree Boost has large values of errors.

  • Data set 7 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP also has acceptable error measures, but Tree Boost and Decision Tree Forest have large values of errors.

  • Data set 8 SVM in comparison with Decision Tree Forest, Tree Boost and MLP models has the lowest average error. MLP performs better than Decision Tree Forest and Tree boost, but these two methods also have acceptable error measures.

It can be understood that SVM has the best performance in predicting water saturation, because as shown in Figs. 6 and 7, Tables 6 and 7, SVM has the lowest values of error indexes including correlation coefficient, average absolute error and root-mean-square error. MLP and Decision Tree forest are moderate predictors, but Tree Boost cannot be regarded as a powerful method in predicting water saturation.

Scatter plots of core-based values and predicted values of water saturation by each method in data set 6 are presented in Fig. 8. Figures 9, 10, 11 and 12 represent the results of estimating water saturation by all four methods (Tree Boost, Decision Tree Forest, MLP and SVM) in data set 8.

Fig. 8
figure 8

Scatter plots of predictions made by different methods in data set 6. a Prediction made by Tree Boost. b Prediction made by Decision Tree Forest. c Prediction made by MLP. d Prediction made by SVM

Fig. 9
figure 9

Comparison of water saturation measured of core and predicted by Tree Boost in data set 8

Fig. 10
figure 10

Comparison of water saturation measured of core and predicted by Decision Tree Forest in data set 8

Fig. 11
figure 11

Comparison of water saturation measured of core and predicted by MLP in data set 8

Fig. 12
figure 12

Comparison of water saturation measured of core and predicted by SVM in data set 8

6 Conclusions

In this study, Support Vector Machine, Multilayer Perceptron Neural Network, Decision Tree Forest and Tree Boost methods were used to predict water saturation measures in Mesaverde tight gas sandstones located in Uinta Basin, USA. Also, performances of these methods were compared. Capabilities of methods in predicting water saturation were evaluated in two divided categories including training and generalization. The main conclusions of this study are as follow:

  • Support Vector Machine, Multilayer Perceptron Neural Network and Decision Tree Forest are reliable methods in predicting water saturation in the tight gas reservoirs.

  • Support Vector Machine has better efficiency in training and generalization rather than other methods.

  • Decision Tree Forest performs superior than Tree Boost in the prediction of water saturation, and it represents acceptable results in training and generalization tasks.

  • RBF is the best kernel function for SVM in prediction of water saturation.

  • Tree Boost cannot be considered as an accurate predictor because of its poor generalization capability.