Introduction

Digital modernization of the oil and gas industries, mainly in wireline logging techniques, has led to the accessibility of more reliable and accurate subsurface information rapidly and cost-effectively. Creating vital information by systematic investigation of significant datasets can assist in better decision-making. It also helps to improve operation efficiency, reliability, and productivity of any project as cost reduction and quality assurance are the primary focus for any industry. The recent developments and digital boom have significantly improved data interpretation and visualization capabilities through advanced statistical, artificial intelligence (AI), and machine learning (ML) methods, which help us to find out the discrete or concealed information that has led to the more straightforward observation of trends hidden inside these massive datasets (Guan et al. 2019; Hong Li et al. 2020; Doveton and Prensky 1992). These modern analytical techniques are rapid, cost-effective, and capable of retrieving more accurate information from enormous datasets available with any organization. The recent advancements in AI/ML techniques help data scientists and managers to drive meaningful insights from raw data and are exceptionally beneficial for the oil and gas industry (Al-Bulushi et al. 2012; Alkinani et al. 2019; Zanjani et al. 2020). Petrophysical analysis using well log interpretation and its correlation with several other techniques like geological field data, core analysis, and production data can assist in reservoir rock characterization. The petrophysical properties such as porosity and permeability are a complex blend of chemical and mechanical origin sedimentary processes (lithification, diagenesis); therefore, they demand precise analysis of geological and geophysical characters of rocks (Otoo and Hodgetts 2020; Skalinski and Kenter 2015; Zhong et al. 2020; Yang et al. 2020). The estimation of shale volume in the reservoir reveals the amount of clay in the reservoir formation. The increasing shale amount indicates the reduction of porosity as well as the permeability of the reservoirs. The porosity in any reservoir rock is estimated as a fraction of voids over the total rock volume, while permeability is defined as a measure of gas/fluid flow through a porous medium (Singh 2019). Permeability is a function of porosity, shale volume, water saturation, and other reservoir properties (Adeniran et al. 2019). But the accurate prediction of this internal engineering property of the reservoir rock is difficult and requires variable sets of data to analyze the effects of reservoir facies heterogeneity in subsurface geological conditions. Generally, Darcy’s law is used to measure the permeability of a drilled core (Wadsworth et al. 2020). This method gives a fairly accurate idea about the permeability of a core, but it is economically infeasible to extract a core from every drilled well of the field.

Previously published studies demonstrate the calculation of water saturation from well log interpretation for predicting permeability of the formation using conventional and the MLR technique only (Alger et al. 1963; Das and Chatterjee 2018; Morteza et al. 2014; Wendt et al. 1986). A range of statistical and computer-based algorithms like least squares support vector machine (LSSVM), imperialist competitive algorithm (ICA), and artificial neural network (ANN) were used in the investigation and prediction of permeability using well log data but mostly associated with errors during sensitivity analysis and low R2 score (Mohaghegh et al. 1994). Additionally, most of the previous studies either concentrate on a single variable (water saturation) for making correlations or employ ambiguous and unclean datasets (Ahmadi and Chen 2019; Wood 2020). The approach can be improved by employing simple yet powerful algorithms on complete, thorough, and clean datasets that have been treated for outlier removal and return relevant results between dependent and independent variable correlations. The reliable and straightforward relationships between porosity and permeability exist, but their application is limited to homogenous reservoir rocks due to consistent petrophysical properties. Therefore, the complex reservoir conditions are always associated with uncertain predictions which lead to poor subsurface correlation (Tixier 1949; Timur 1968; Coats and Dumanoir 1974; Donovan 1984). As a result, empirical and regression modeling techniques have become standard practices in the oil and gas industry for permeability estimation using well logs, especially for the wells having no reservoir cores. It also reduces human errors and the extended time required for laboratory measurements. The modern machine learning algorithms are cost-effective, have better predictive capabilities, and provide any required range of probabilities for data analytics.

The present study focuses on the precise computation of permeability using different regression modeling techniques, especially for scenarios where core data is not available. This study demonstrated the application of four machine learning techniques SLR, LR, MLR, and SVR to analyze and forecast the permeability of the Volve oil field petrophysical dataset having shale volume and porosity as independent parameters. The open-sourced datasets of Equinor’s Volve field (https://data.equinor.com) an oil field on the Norwegian Continental Shelf have been used in the present study. The field is located 200 km west of Stavanger at the southern end of the Norwegian sector (Fig.1). The operator Equinor and the Volve license partners, ExxonMobil and Bayern gas, have made the repository of all subsurface and operating data from this oil field available for research and analysis purposes. Petrophysical and coring data used in the current study is taken from two wells, i.e., F-15/9/19A and F-15/9B&BT2, due to the availability of selected parameters like initial Klinkenberg corrected core permeability, porosity, and shale volume. Python programs were developed to apply SLR, LR, MLR, and SVR methods and analyze core data of the abovementioned wells. Finally, the predicted outputs are compared with available test datasets to calculate the accuracy of designed regression models. These rapid and cost-effective soft-computing regression techniques show outstanding results of permeability forecasting with an accuracy of over 0.99 R2 scores (in the case of well F-15B&BT2), using Gaussian process regression including variables porosity and shale volume. A comparative analysis is also made between the four regression algorithms as mentioned above to estimate the best fit method for assessing large petrophysical datasets. The results indicated that SVR had the best peak performance compared to MLR, SLR, and LR, in decreasing bias. However, this does not mean that SVR is inherently the superior method compared to the other three in every situation.

Fig. 1
figure 1

Location map of the Volve field located at the southern end of the Norwegian Continental Shelf (Ravasi et al. 2015). Wells F-15/9/19A and F-15/9B & BT2 used in the present research work are situated in the Volve oil field shown by green color in the above map under block 15/9

The subsequent sections of this paper are showing the background of the study area and brief details of the adopted algorithms in the “Background of the study area and adopted algorithms” section. The details of the adopted methodologies and criteria of the machine learning model section with a process flow chart are given in the “Methodologies adopted” section. The next important section, “Result and discussion,” deals with interpretation, description, and comparison of achieved outcomes using various algorithms. Finally, the “Conclusion” section is kept at the end.

Background of the study area and adopted algorithms

The block 15/9 of the Volve field has proven commercial quantities of hydrocarbon discoveries. It lies 200 km west of Stavanger at the southern end of the Norwegian sector with an average water depth of 80 m. The field is situated 5 km north of the Sleipner Vest field. The Jurassic age Hugin formation acts as the central reservoir unit of this field (Lervik 2006; Otoo and Hodgetts 2020). The thickness of the Hugin formation is estimated to range between 5 and 200m with an approximate reservoir depth of 2700–3000 m, which can vary in different regions due to post-depositional erosional processes (Folkestad and Satur 2008). The first successful discovery well was drilled in Volve 1993, and the development and operation (PDO) plan was approved in 2005. Initially, field development was planned with the jack-up processing drilling facility and the vessel “Navion Saga” which was used for storing stabilized oil. According to the published literature and the reports of the Norwegian petroleum directorate, the hydrocarbon production from this field was started in 2008, and finally in 2016, decommission decision was taken after 8.5 years of successful operation life (www.equinor.com). This was twice more than as long as initially planned. Volve produced with a peak rate of about 56,000 barrels per day and delivered a total of 63 million barrels of oil with a recovery rate of 54% of reserve estimates (Sen and Ganguli 2019). As mentioned in the Introduction section that the different machine learning techniques are used to analyze the well logs of Volve oil field, the brief details of the used algorithms in current research work are as follows.

Simple linear regression (SLR)

It estimates the relationship between one or more independent variables and a dependent variable by minimizing the sum of the squares in the difference between the observed and predicted values. In the LR, a dependent variable and one or more independent variables relationship is modeled by fitting a linear equation. Herein, a single scalar predictor variable X is predicted using a simple scalar responsive vector Y(Table 1). This fits a linear model that will best minimize the residual sum of squares between the observed responses in the dataset and the responses predicted by the linear approximation (Uyanık and Güler 2013).

Table 1 Regression algorithms with equations and formulas

Lasso regression (LR)

The LR or least absolute shrinkage is a regression analysis method that employs both regularization and variable selection to attain maximum possible prediction accuracy and interpretability from the original data. It selects a reduced set of covariates for use in a model for higher accuracy (Liu et al. 2020).

Multi-linear regression (MLR)

It is an extended context of LR. It estimates the linear relationship between multiple explanatory (independent) variables and a single response variable. The only difference between SLR and MLR is the number of independent variables. The basis of MLR is highly dependent on the assumption of a linear relationship between both the dependent and independent variables (Pereira 2004; Uyanık and Güler 2013). No major correlation between the independent variables is assumed. Essentially, it is an extension of ordinary least squares that uses more than one independent variable.

Support vector regression (SVR)

The objective was to find a function f(x) which has at most ε deviation from the obtained targets for all the training data and at the same time is as flat as possible using SVR. It gives the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in a higher dimension) to fit the data (Jap et al. 2015). In contrast to ordinary least squares (OLS), the objective of SVR is to minimize the coefficient. Three different kernels are used to represent the trend in SVR, i.e., RBF, linear, and polynomial.

Methodologies adopted

The Volve field data is evaluated upon in several aspects to make it suitable for the comparative analysis. A flow chart is shown in Fig. 2 illustrating the methodology adopted in the current research work. The stage-wise flow of data collection from open-sourced Equinor data repository and its cleaning process of null values and outliers, the preparation of MLR models, and evaluated R2 scores comparison of prepared models for comparison purpose are highlighted.

Fig. 2
figure 2

Flow chart illustrating various stages of data collection, conditioning, and working methodology

The null values and the data outliers were removed using Scikit-Learn (or Sklearn) (Arnold et al. 2011). Data outliers are unknown data spikes, which are bizarrely different from other elements of the dataset. Z-score function from Scikit-Learn was used to remove data outliers, while null values were eliminated using .dropna() function as shown in Appendices 1, 2, and 3.

The comparative analysis between the SLR, LR, MLR, and SVR methods is used to predict the best technique for getting results with higher accuracy. The goal was to establish the data-oriented accurate estimation of petrophysical parameters from the raw datasets. Essentially, this boils down to establishing a statistical relationship between a response variable (Y) and an explanatory variable (X) (Wiener et al. 1991). This can be done by employing a regression analysis to model the distribution of variables concerning one another and derive a relationship. The type of model to be put to use depends on the distribution of Y for X on a plane. Continuous and normal distribution warrant the use of LR; a binary distribution, logistic regression; Poisson or multinomial distribution; log-linear analysis; and so on. With the help of modeling, we try to estimate the predictor variable(s) effect on the magnitude of a response variable.

The methods show a significant dependence on the porosity and shale volume of the system. Theoretically, permeability increases with an increase in porosity, and decreases due to a higher amount of shale volume, because the increase in shale volume accounts for the blockage of the path to the hydrocarbon flow, subsequently reducing the permeability of the system (Yao and Holditch 1993). Permeability is defined as the measure of the ease with which a fluid flows through a porous medium (Fossen and Bale 2007; Jia et al. 2019). It is a critical aspect to be accounted for in any reservoir analysis. Permeability data can be obtained in laboratories (core analysis), in reservoirs (pressure transient tests), and through well logs (Yao and Holditch 1993). However, the conventional methods of predicting permeability are time-consuming (add non-productive time) therefore considering economically unfavorable. Therefore, rapid and economically viable methods of AI/ML have been used to account for the results of the prediction of permeability using available datasets. This regression approach in the oil and gas sector, for permeability estimation, can be used to build high-accuracy fluid flow models. Ultimately, the results of predicted reservoir properties through conventional and ML approaches can be applied in the planning of the newly proposed well to enhance the geological chance of success (GCF)(Pereira 2004; Wendt et al. 1986).

ML model selection

Every model has its advantage and disadvantages; therefore, it is crucial to discern the type of regression algorithm to be used by subsequently plotting the data on a scatter plot. For simple linear data, LR algorithm can be used, but if the data is non-linear, data transformation can increase the model’s accuracy. In rare cases, if the data transformation fails for the non-linear selection model, then more complex models can be utilized. The workflow identifies the type of distribution from the scatter plot and recognizes if it resembles a known mathematical function. If the data resembles a linear function, the model utilized is linear, an exponential model for exponential curves, etc. The forward selection consists of fitting the data to a rudimentary mathematical model, evaluating the data fitness (commonly referred to as goodness of a fit), and eventually moving on to more complicated models to obtain comparatively better correlations. However, the backward selection also aims toward getting a model with a desired goodness of fit. Still, the difference between the two models lies in the fact that the backward selection begins with starting the most complicated model to fit the data and then simplifying it down as per need. This study employed a forward selection model. We started with relatively simple, two-dimensional models, including those formed by SLR and LR. We then moved on to more complex, three-dimensional models employing two and, in some cases, even three independent models variables. The problems in the oil and gas industries can be categorized as statistical problems or regression problems. The advanced mathematical approach empowers computers to have decision-making capability using AI/ML-based mathematical and statistical ways of approaching solutions (Y Liu and Chen 1999; Yang Liu et al. 2019; Zanjani et al. 2020). Understanding the permeability of a system is a critical aspect that needs to be considered for hydrocarbon exploitation and production. This involves the regressive analysis study of data from a drilled core, which is then extrapolated for the whole formation to develop the conceptual model of permeability (Letham and Bustin 2016; J. Li and Sultan 2017).

In the current research work above discussed, four methods (SLR, LR, MLR, SVR) are used to estimate Klinkenberg core corrected permeability using porosity and shale volume. Data from wells F-15/9-B&BT2 and F-15/9-19A were used to develop four regression models, and their R2 scores were computed to compare the goodness of fit. In addition to this, a correlation plot for the two wells was formed using available petrophysical data (Figs. 3 & 4). This correlation plot lends valuable insight into (i) how the different petrophysical datasets are related to one another, on a scale of –1 to 1, in which −1 depicts a robust negative correlation between the two parameters and 1 depicts a strong positive correlation, and (ii) the probability distribution of values for all individual variables (seen on the diagonal elements of the CORRPlot matrix). Figure 3 contains a correlation matrix of the data from well F-15/9-19A, between different variables logKL, PHIF, density-porosity (PORD), SW, and VSH. The matrix aims to show a correlation between these parameters, while the elements on the diagonal include individual distributions of the data. The lines shown in red represent an elementary trend between the two variables of a plot. In contrast, the number on the top left represents Pearson’s correlation coefficient for the two datasets. A positive value trending toward 1.0 represents a robust and direct relation, while a value tending towards –1.0 represents an inverse relation.

Fig. 3
figure 3

Correlation matrix of various parameters of well F-15/9-19A

Fig. 4
figure 4

Correlation matrix of various parameters of well F-15/9 B&BT2

It can be interpreted from the figures that logKL shows a relatively strong dependence on porosity, while inverse relationships of varying strength also exist against water saturation and shale volume. Data from Fig. 4, however, indicates the absence of a relationship against porosity but shows a strong negative correlation against shale volume (Tembely et al. 2021). On a similar line of analysis, the correlation matrix of the second well (F-15/9-B&BT2) is prepared and shown in Fig. 4. In the paper by Wendt et al. (1986), the authors depict the relation for the prediction of permeability to be dependent on porosity-related variables, shale volume, and water saturation in decreasing order of relevance to the trained dataset. It is also be discerned through the correlation matrix shown in Figs. 3 and 4, permeability data shows a positive correlation against petrophysical variables such as porosity and water saturation. In contrast, a negative correlation against shale volume is evident. The criteria for selecting the dataset is the involvement of variables that show a clear correlation with permeability (Figs. 3 & 4). Cleaning of the dataset for the removal of null values and the vast amounts of data outliers is recommended before making any interpretation.

The box plot shown in Fig. 5 resembles the dataset and its point distribution for different depth intervals through well F-15/9-19A. For a single interval, the original box represents the distribution of values, the and horizontal line dividing the boxes in two represents the median of that subset of data. At the same time, the top and bottom whiskers conveniently depict the upper and lower limits of data. Despite the spaced distribution through depths 3870–3880 m, the plot shows consistency in value distribution concerning depth. For optimal results, it is necessary to have a dataset tuned with data that has a narrow point distribution, with values for a single interval converging toward a singular value. While such datasets might not necessarily result from core permeability data reading, they can always be used as model datasets as a criterion to determine the dataset.

Fig. 5
figure 5

The box plot describing the distribution of the logarithmic permeability [Log(md)] points against depth for the well F-15/9-19A. The central line between the boxes represents the median of the data bin, and the whiskers representing the upper and lower limits

Result and discussions

A linear trend was observed between the horizontal porosity and Klinkenberg core corrected permeability for SLR and LR, whereas a planar trend has been established for SVR and MLR. The output received from all four methods is compared based on their R2 values. This statistical measure represents the percentage of variance for the dependent variable, which is explained by an independent variable in a regression model. The independent variables in our studies are horizontal porosity and shale volume, while the response variable is Klinkenberg core corrected permeability.

In the scatter plot for linear data establishing a correlation between horizontal porosity and Klinkenberg core corrected permeability, the permeability (in md) is represented as a logarithmic function due to the non-linearity of the data (Fig. 6). A point of interest is the high negative correlation of shale volume to permeability, which signifies a specific decrease in permeability seen with an increase in shale volume (Fig. 6). The low porosity could also be due to shale volume, which increases clay content in pores and negatively impacts porosity and permeability. It is also important to note that the majority of plotted points lie on a straight line, which is due to an increase in porosity with an increase in permeability (Singh 2019). The regression models created during this project were programmed using Python and its modules (Sci-kit Learn for regression, Matplotlib for plotting, and Pandas-Numpy-dlisio for data handling). A summary of Equinor’s published mathematical relationships for selective formations using MLR published by Equinor is shown in Table 2. Additionally, two-dimensional relations have been portrayed in Fig. 7. The original points have been outlined with a scatter plot, while the formulated correlation has been displayed with the help of a regression trend line (displayed in red). Similarly, three-dimensional relations have been displayed in Fig. 8.

Fig. 6
figure 6

Scatter plots. a Two-dimensional plot between the logarithm of Klinkenberg core corrected permeability and horizontal porosity. b Three-dimensional plot between the logarithm of Klinkenberg core corrected permeability and horizontal porosity-shale volume

Table 2 Equinor’s published relationships for selective formations using multivariate regression
Fig. 7
figure 7

Two-dimensional plot with trend line shown in red color. a Plotting with the SLR for horizontal porosity and logarithm of Klinkenberg core corrected permeability with the equation log(k)=−1.395 + [0.1671]* HPOR. b Plotting with the LR for horizontal porosity and logarithm of Klinkenberg core corrected permeability with the equation log(k)= −0.6574 + [0.1230]* HPOR

Fig. 8
figure 8

Three-dimensional plot with a trend plane shown as a translucent blue plane. a Plotting with the MLR for horizontal porosity-shale volume and logarithm of Klinkenberg core corrected permeability with the equation log(k)=[1.6728]+ [8.3827]*HPOR-[7.1119]*VSH. b Plotting with the SVR for horizontal porosity-shale volume and logarithm of Klinkenberg core corrected permeability with the equation log(k)= [2.4710] + [3.6827]*HPOR − [6.3984]*VSH

Figure 9 depicts a simple overlapping plot of permeability illustrated by core analysis (with applied overburden correction and correction for Klinkenberg effect), plotted on top of data predicted from two-variable [KLOGH vs PHIF and VSH] correlation used to predict permeability of the entire Volve field (Table 2) against increasing depth. It can be observed in Fig. 9a that sections A, C, D, and E show varying levels of deviation from the actual permeability data, as a consequence of the limitations of using a simple three-dimensional model for prediction, while sections B and F depict a high degree of similarity between the core data and predicted data. Figure 9b depicts a three-dimensional view of points predicted by the abovementioned correlation against actual data from core analysis. In unconventional reservoirs, the permeability and gas accumulation ability of shale gas reservoirs depend upon capillary pressure difference between sweet spots and surrounding rocks (Zheng et al. 2020), so in the presence of appropriate depth against log curves, accurate machine learning models can be created to predict permeability for similar shale formations throughout the field (Wen et al. 2020).

Fig. 9
figure 9

a Depth vs permeability curve showcasing the variation between predicted permeability and core permeability for well F-15/9-19A. A–F is described in the text part. b A three-dimensional scatter plot showcasing variation between points predicted via MLR vs points from core data analysis between permeability, shale volume, and porosity

Results of the comparison between the regression algorithms as mentioned above are tabulated in Fig. 10. However, it was also observed that the goodness of fit could be further boosted by the inclusion of a third independent parameter—water saturation. In an independently formed correlation between permeability, porosity, and shale volume using MLR on a minimal dataset of <700 data points, a relatively strong correlation was established between the response and the predictor variable of the R2 score, amounting to 0.79. However, a look at the correlation matrix directed use toward the presence of a relationship between the response variable and water saturation as shown in Figs. 3 and 4. Including water saturation data into said minimal dataset from well F-15/9-19A boosted the R2 score from 0.79 to 0.81, an effect that cannot be neglected considering the limitations provided by the data. The authors firmly believe that the inclusion of water saturation into regression analysis done to predict permeability can lead to a high accuracy bump in the predictions made using only two independent variables.

Fig. 10
figure 10

Comparison of the outputs of various regression models at different stages

The results effectively support the claims made by Wendt et al. (1986) that introducing relevant independent variables in the regression vastly increases the goodness of fit. This is made evident by a gap of effectively 0.10 that both MLR and SVR hold over Lasso and SLR in the “mean after 1000 Iterations” category. Considering LR, a considerable increase from the previous score of 0.659 to 0.665 was noticed once the value of the hyperparameter λ was decreased from 1.0 to 0.0001. According to Valenzuela et al. (2017), SVR results improve with data scaling; however, standard scaling this dataset with Sklearn’s standard scalar function resulted in a slight decrease in the model’s accuracy. However, it is also important to note that including larger sets of data and relevant variables will undoubtedly improve upon the model’s goodness of fit, and, at a certain point, it will lead to the condition of overfitting—resulting in a model with shallow bias, but a model that predicts values in an incredibly narrow margin, leading to high variance (or high margin of error in predicted data). This is why this study focuses on a maximum of two reservoir parameters.

Conclusions

All the models prepared during current research work show great potential for forecasting permeability using well log and core data, but SVR has shown better results among the chosen four algorithms; SLR, LR, SVR, and MLR. The performance of machine learning methods is greatly dependent on the quality of the input dataset; therefore, the performance of an algorithm must be examined for each specific dataset and problem. The following results can be concluded from the output of the current research work.

  1. 1.

    The inclusion of water saturation into conventional permeability prediction models can substantially boost prediction accuracy. The use of relevant petrological independent variables in the regression algorithm can significantly enhance the correctness of models and is recommended for industrial applications.

  2. 2.

    Despite explicitly being conceived for geophysical studies, LR fails to attain accuracies even remotely close to multivariable techniques like multiple and SVR.

  3. 3.

    A correlation plot depicting Pearson’s correlation coefficient between well log data variables including horizontal permeability, porosity, and water saturation was created, which can be used to describe the relationships between these variables and be used to derive an idea about their distribution.

  4. 4.

    Considering all the models included in the scope of this paper, it can be said that MLR remains the best contender for a general use-case in the industry; however, it is observed that under specific train-test size datasets, the accuracy of SVR can be boosted over that of MLR, implying the use of SVR in fringe cases that require high goodness of fit.