1 Introduction

As an irreplaceable strategic resource, petroleum plays a critical role in a country's power and economic development. Maintaining and increasing oil production has always been an important energy goal for nations. However, due to China's rapid economic growth over the past few decades, the country's consumption of petroleum has been steadily increasing. Nevertheless, as most of China's major oil fields have entered the middle and late stages of development, issues such as increased water content and reservoir damage have resulted in a decline in production capacity that is no longer sufficient to meet current energy demands [1]. This has led to a severe dependence on foreign oil and gas resources, with China's external oil dependency exceeding 70% in 2020. Large-scale oil imports could easily subject China to geopolitical risks, significantly threatening the country's energy security [2]. Therefore, implementing reasonable measures to increase oil production is imperative for China to address its energy gap, stabilize domestic economic development, and alleviate its energy crisis. However, with a wide variety of measures available and varying effectiveness, accurately predicting the effectiveness of such measures is crucial for oil field production.

In the field of measure effectiveness prediction, conventional methods such as the water flooding characteristic curve method and the Weng cycle method have limitations in their applicability due to various assumptions and complex formulas [3]. Although numerical simulations have been attempted to predict the effectiveness of measures, their applicability is restricted due to the complex mechanisms of measures to increase oil production and the expensive computations involved [4]. Research on machine learning-based measure effectiveness prediction is still in the exploratory stage, with a primary focus on production forecasting. There has been limited consideration of including process parameters in the evaluation of measures, as the limited sample size of measure wells restricts research in this direction to primarily fracturing methods [5].

In recent years, the revolutionary development of artificial intelligence (AI) technology has attracted widespread attention from various industries due to its powerful generalization ability and rapid response speed [6]. The petroleum industry has also accumulated a large amount of historical data in production, and machine learning has shown great potential in the field of petroleum engineering [7]. As an alternative data-driven approach, machine learning can extract information from a large amount of historical data and construct regression or classification prediction models [8]. Many supervised machine learning methods, including linear regression, support vector machines, neural networks, etc., have been used to predict production decline, optimize water injection schemes, characterize reservoir permeability, and generate complex geological facies [9].

Based on the above content, in order to accurately and rapidly predict the effect of oil recovery measures, this paper proposes a data-driven approach for predicting the effect of oil recovery measures. Advanced machine learning algorithms, including Random Forest (RF), Support Vector Regression Machine (SVR), and Extreme Gradient Boosting (XGBoost), are used to explore the influence of three types of data on the effect of oil recovery measures, namely geological static parameters, production dynamic parameters, and process parameters. A prediction model is built, and data augmentation is employed to address the problem of insufficient samples, which improves the quality of the sample dataset. In the hyperparameter tuning stage of the model, the Bayesian optimization algorithm is introduced to solve the problem of difficult manual parameter tuning and further improve the model accuracy. After comparative experiments, the XGBoost algorithm-based oil recovery measure effect prediction model is selected, and the accuracy in the test set can reach over 90%.

2 Methodology

2.1 Feature Engineering

Feature engineering is the process of taking the raw input data and creating new features. To make the raw data more informative, it selects, extracts, and transforms meaningful features from the raw data. Feature engineering involve various techniques, including data cleaning, data normalization, data scaling, data augmentation, data encoding, dimensionality reduction, and feature selection. The source data for this study is the actual recorded data from the oil field, which has poor data quality. Therefore, feature engineering is crucial in processing the data. In addition to common data cleaning, normalization, and correlation analysis, this paper also employs the SMOTE oversampling technique as a data augmentation method.

Synthetic Minority Over-Sampling Technique (SMOTE)

SMOTE [10] is an approach to the construction of classifiers from imbalanced datasets is described. It is used to address the problem of imbalanced class distribution in data by synthesizing data through a combination of over-sampling the minority class and under-sampling the majority class [11]. The specific steps are as follows:

(1) For each sample x in the minority class, calculate its k-nearest neighbors to all samples in the minority class using Euclidean distance as the metric. The formula is:

$$\begin{array}{c}d\left({s}_{l},{s}_{k}\right)=\sqrt{{\sum }_{j=1}^{m}{\left({s}_{lj}-{s}_{kj}\right)}^{2}}\end{array}$$
(1)

(2) Determine a sampling rate based on the imbalance ratio and set a sampling multiplier N. For each minority sample x, randomly select several samples from its k-nearest neighbors, denoted as xn.

(3) For each randomly selected neighbor xn, construct a new sample with the original sample according to the following formula.

$$\begin{array}{c}{x}_{\text{new }}=x+{\text{rand}}\left(\mathrm{0,1}\right)\times \left(\tilde{x}-x\right)\end{array}$$
(2)

2.2 Regression Prediction Algorithm

Support Vector Regression (SVR)

SVR [12] is a type of machine learning algorithm used for regression analysis. It is based on the Support Vector Machine (SVM) algorithm and is used to build models that can predict continuous output variables. The basic principle of SVR is to find a hyperplane in a high-dimensional space that best separates the data into different classes. In the case of regression, the hyperplane is used to predict the value of the outcome variable based on the input features. Therefore, the SVR problem can be formalized as:

$$\begin{array}{c}\underset{w,b}{min} \frac{1}{2}\parallel w{\parallel }^{2}+C\sum_{i=1}^{m} {l}_{\epsilon }\left(f\left({x}_{i}\right),{y}_{i}\right)\end{array}$$
(3)

In which, C is the regularization constant, lϵ is the ϵ-insensitive loss function. After introducing slack variables and Lagrange multipliers and taking partial derivatives, the formula of SVR can be expressed as:

$$\begin{array}{c}f\left(x\right)=\sum_{i=1}^{m} \left({\widehat{\alpha }}_{i}-{\alpha }_{i}\right)\kappa \left({x}_{i}^{T}x\right)+b\end{array}$$
(4)

where \(\kappa \left({x}_{i}^{T}x\right)\)=\(\mathrm{\varnothing }{\left({x}_{i}\right)}^{T}\mathrm{\varnothing }\left({x}_{j}\right)\) is the kernel function.

Random Forest(RF)

RF [13] is a popular ensemble learning algorithm used for classification, regression, and other machine learning tasks. The algorithm combines multiple decision trees to create a “forest” of trees that work together to make predictions. In regression problems, the output of each decision tree is averaged to obtain the final regression result [14]. The specific idea is as follows:

  1. (1)

    Assuming that the training dataset contains N data objects, a training dataset is constructed by randomly sampling M samples with replacement using the bootstrap method, where each sample is not completely identical to the others.

  2. (2)

    Assuming that each sample data has X features, a subset of x (x <  = X) features is randomly selected from all the features, and the best splitting attribute is chosen as the node to grow the CART decision tree, with k remaining constant during the tree growing process.

  3. (3)

    Repeat the above steps to build n CART trees, and obtain the final prediction by averaging the outputs of these decision trees.

$$\begin{array}{c}\widehat{f}=\frac{1}{B}\sum_{b=1}^{B} {f}_{b}\left({x}^{\mathrm{^{\prime}}}\right)\end{array}$$
(5)

eXtreme Gradient Boosting(XGBoost)

XGBoost [15] is a highly efficient gradient boosting decision tree algorithm that uses the ensemble idea - the Boosting idea - to integrate multiple weak learners into a strong learner through a certain method. Its algorithmic process is as follows:

  1. (1)

    Set the model to begin with a constant value:

    $$\begin{array}{c}{\widehat{f}}_{\left(0\right)}\left(x\right)=\underset{\theta }{{\text{arg}}min}\sum_{i=1}^{N} L\left({y}_{i},\theta \right)\end{array}$$
    (6)
  2. (2)

    Calculate the gradients and hessians:

    $$\begin{array}{c}{\widehat{g}}_{m}\left({x}_{i}\right)={\left[\frac{\partial L\left({y}_{i},f\left({x}_{i}\right)\right)}{\partial f\left({x}_{i}\right)}\right]}_{f\left(x\right)={\widehat{f}}_{\left(m-1\right)}\left(x\right)}\end{array}$$
    (7)
    $$\begin{array}{c}{\widehat{h}}_{m}\left({x}_{i}\right)={\left[\frac{{\partial }^{2}L\left({y}_{i},f\left({x}_{i}\right)\right)}{\partial f{\left({x}_{i}\right)}^{2}}\right]}_{f\left(x\right)={\widehat{f}}_{\left(m-1\right)}\left(x\right)}\end{array}$$
    (8)
  3. (3)

    Train a base learner on the training set by solving the following optimization problem:

    $$\begin{array}{c}{\widehat{\phi }}_{m}=\underset{\phi \in\Phi }{{\text{arg}}min}\sum_{i=1}^{N} \frac{1}{2}{\widehat{h}}_{m}\left({x}_{i}\right){\left[-\frac{{\widehat{g}}_{m}\left({x}_{i}\right)}{{\widehat{h}}_{m}\left({x}_{i}\right)}-\phi \left({x}_{i}\right)\right]}^{2}\end{array}$$
    (9)
    $$\begin{array}{c}{\widehat{f}}_{m}\left(x\right)=\alpha {\widehat{\phi }}_{m}\left(x\right)\end{array}$$
    (10)
  4. (4)

    Modify the model:

    $$\begin{array}{c}{\widehat{f}}_{\left(m\right)}\left(x\right)={\widehat{f}}_{\left(m-1\right)}\left(x\right)+{\widehat{f}}_{m}\left(x\right)\end{array}$$
    (11)
  5. (5)

    Output:

    $$\begin{array}{c}\widehat{f}\left(x\right)={\widehat{f}}_{\left(M\right)}\left(x\right)=\sum_{m=0}^{M} {\widehat{f}}_{m}\left(x\right)\end{array}$$
    (12)

2.3 Optimization Algorithm

An optimization algorithm refers to the process of minimizing or maximizing an objective function, subject to given constraints, by finding one or more optimal or near-optimal solutions. This paper introduces a tree-structured Bayesian optimization algorithm to tune hyperparameters of the production enhancement effect prediction model. This method solves the problem of obtaining the optimal prediction model through manual tuning, providing a more efficient and effective approach.

Tree-structured Parzen Estimator(TPE)

TPE [16] uses two density functions to define \(p(x|y)\):

$$\begin{array}{c}p(x|y)=\left\{\begin{array}{c}l(x)\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em} \, {\text{i}}{\text{f}} \, y<{y}^{*}\\ g(x)\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em} \, {\text{i}}{\text{f}} \, y\ge {y}^{*}\end{array}\right.\end{array}$$
(13)

In the above equation, \(l(x)\) is established using the observation space {x(i)} and the corresponding loss f(x(i)) is less than y*, while \(g(x)\) is established using the remaining observations. The TPE-based method relies on a value of y* greater than the best observed value of \(f(x)\), so that some points can be used to build \(l(x)\). TPE adopts expected improvement (EI) as the acquisition function. However, since it is impossible to obtain the posterior probability \(p(x|y)\), Bayesian formula is employed to transform the acquisition function:

$$\begin{array}{c}{{\text{EI}}}_{{y}^{*}}\left(x\right)={\int }_{-\infty }^{{y}^{*}} \left({y}^{*}-y\right)p(y|x)dy={\int }_{-\infty }^{{y}^{*}} \left({y}^{*}-y\right)\frac{p(x|y)p\left(y\right)}{p\left(x\right)}dy\end{array}$$
(14)

In this equation, y* represents a threshold value. Let γ = p(y < y*) denote a certain quantile used in the TPE algorithm to partition \(l(x)\) and \(g(x)\). The value of γ is in the range of (0, 1). The final simplified formula is:

$$\begin{array}{c}E{I}_{{y}^{*}}\left(x\right)=\frac{\gamma {y}^{*}\mathcal{l}\left(x\right)-\mathcal{l}\left(x\right){\int }_{-\infty }^{{y}^{*}} p\left(y\right)dy}{\gamma \mathcal{l}\left(x\right)+\left(1-\gamma \right)g\left(x\right)}\propto {\left(\gamma +\frac{g\left(x\right)}{\mathcal{l}\left(x\right)}\left(1-\gamma \right)\right)}^{-1}\end{array} $$
(15)

3 Experiment and Result

3.1 Introduction to the Dataset

This study collected data on all oil production enhancement measures implemented in a certain block of an oilfield from 2017 to the present, including acidification, unclogging, and water flooding. After selecting wells where the measures were effective and conducting data cleaning and correlation analysis, the sample size of wells subjected to acidification and unclogging measures was too small to support machine learning analysis. Therefore, this study ultimately chose water flooding measures as an example for oil production enhancement prediction. The sample database contains 147 wells that achieved oil production enhancement after water flooding measures were implemented. The input variables include eight geological static features, four production dynamic features, and two measure process features, as shown in Table 1.

Table 1. Feature presentation table.

According to statistics, there are three different chemical types A, B and C in 147 profile control measures samples, and the sample sizes of different classes of chemicals are significantly different, among which 96 are of type A, 33 are of type B and 18 are of type C. Unbalanced sample distribution has a great influence on the learning and prediction of machine learning model. Therefore, this study adopted the data enhancement method of oversampling to expand the data, so as to ensure the same sample size of the three measures. Generate new 410 after SMOTE oversampling and screening and use this new 410 as the data set for the forecast model.

3.2 Experimental Setting

To select the optimal predictive model for effects of oil-increasing measures, this study compared three commonly used machine learning algorithms for regression problems in petroleum engineering: Support Vector Regression (SVR), Random Forest (RF), and XGBoost. Predictive models were constructed for each algorithm and their performance was evaluated under multiple loss functions. The best-performing predictive model was determined, and an optimization algorithm was introduced to fine-tune the model's hyperparameters, further enhancing its predictive accuracy.

3.3 Result

Based on three different algorithms, data-driven predictive models were developed for oil-increasing measures using geological static parameters, production dynamic parameters, and process parameters as inputs, and post-measure oil production as output. The training and testing sets were divided in a 9:1 ratio. The prediction results of the different algorithms are shown in Fig. 1. By comparing the performance of the models using the same testing set, it can be observed that the predictive accuracy of XGBoost and RF algorithms are significantly higher than that of SVR algorithm.

Fig. 1.
figure 1

The effect comparison of different algorithms

To conduct a precise analysis of the prediction performance of RF and XGBoost algorithms, this study comprehensively evaluated their performance using three loss functions: Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R2), as shown in Table 2. From the numerical results, it can be seen that XGBoost algorithm outperforms the other two algorithms. Therefore, XGBoost was selected for further research.

Table 2. Evaluation of three algorithms

The above model obtained the optimal results by manually adjusting the hyperparameters after determining their approximate range using grid search. However, manual tuning of hyperparameters can hardly result in the best combination of model parameters, and there is still room for improvement in hyperparameter performance. Therefore, this study introduced a Bayesian optimization algorithm based on Tree-structured Parzen Estimator (TPE) to optimize the hyperparameters of the XGBoost prediction model. The final optimization result returned the maximum value of R2, and the optimized parameters are shown in Table 3.

Table 3. Hyperparameter optimization results

The performance of the TPE-XGBoost model for predicting the effect of enhanced oil recovery measures, incorporating the optimization algorithm, is shown in Fig. 2 and Table 4. It can be observed that the introduction of the optimization algorithm improves the performance of the predictive model, with the optimized model outperforming the non-optimized model under all three loss functions. The final predictive accuracy (R2) can exceed 90%.

Table 4. The effect after hyperparameter optimization
Fig. 2.
figure 2

The effect after hyperparameter optimization

4 Conclusion

This article proposes a data-driven method for predicting the effects of oil-increasing measures based on the TPE-XGBoost algorithm. This method first enhances the data samples to some extent, which alleviates the problem of insufficient sample size. At the same time, the model comprehensively considers three types of features: geological static parameters, production dynamic parameters, and measure process parameters, deeply mining their relationship with the effect of increasing oil, and automatically optimizing the model hyperparameters to achieve the prediction of daily oil production after the measures, which significantly improves the prediction accuracy compared with other algorithms and can reach over 90%. However, the current research is still limited by the insufficient quality of on-site data. In future research, in addition to obtaining high-quality data from the source, high-level feature engineering will also be the next research focus. In addition, incorporating economic indicators into machine learning is also a future direction of research.