1 Introduction

Although there is a growing interest for sustainable energy, for the next decade, oil and gas will still remain as the basis for global energy consumption. With the depletion of shallow oil and gas resources, the exploration and development of deep oil and gas resources become popular [1,2,3,4,5]. The increasing well depth greatly raises the drilling costs, which brings about the necessity to improve drilling efficiency and rate of penetration (ROP). Accurate prediction of ROP is the first step for optimizing drilling parameters and improving drilling efficiency [6,7,8]. ROP prediction has been the subject of drilling engineering for more than half a century, which can be divided into two categories: physics-based models and data-driven models.

The traditional physics-based ROP models were widely used because of their simplicity and clear physical background. Through extensive experimental data analysis, Bingham [9] discovered that weight on bit (WOB), rotation per minute (RPM), and wellbore diameter had substantial impacts on ROP and put forward a basic ROP equation. Later researchers worked to refine this equation and proposed a number of modified models [10,11,12,13,14,15,16,17]. More factors were considered, including formation strength, formation compaction, bit wear, hydraulic parameters, bottom hole pressure, inclination angle, dogleg degree, cuttings bed thickness and concentration, etc. Physics-based ROP models are devoted to establishing an explicit mathematical relationship between drilling parameters and ROP through mechanism analysis and laboratory experiments. However, the physical-based ROP models suffer from certain limitations. Due to the high reciprocity and nonlinearity among various relevant parameters, it is difficult to accurately predict ROP with traditional polynomial equations. In addition, the models have undetermined coefficients that are related to bit wear, wellbore cleaning, and rock mechanics. In ultra-deep wells, these deficiencies become more intolerable and make the physics-based ROP models poorly applicable.

In recent years, machine learning technology has advanced rapidly. The data-driven models do not rely on theoretical analysis and show the advantages such as high flexibility for input parameters and strong ability of fitting complex nonlinear relationships. ROP prediction using machine learning is receiving increasing attention [18,19,20,21,22]. Since Bilgesu et al. [23] published the first paper on ROP prediction applying artificial neural network (ANN) in 1997, researchers have done a lot of works on ROP prediction with ANN [24]. ANN is an information processing system that imitates the structure and function of neural networks in human brain. For a data set, ANN can learn and capture the unique relationship between input and target parameters, and has the characteristics of fault tolerance, high efficiency, and great adaptability [25]. The powerful nonlinear mapping ability of ANN provides a good solution for ROP prediction. Multilayer perceptron (MLP) and extreme learning machine (ELM) neural networks were widely adopted [26, 27]. In order to boost ROP prediction accuracy, researchers started to optimize the ANN structure (the number of hidden layers, the number of neurons in each layer, etc.) [28,29,30]. In addition, other popular machine learning methods, such as random forest (RF) and support vector machine (SVM), are also utilized to build ROP prediction models [31,32,33]. Christine et al. [31] compared five machine learning methods (RF, ANN, SVM, ridge regression, and gradient elevator) in ROP prediction.

From the perspective of machine learning, ROP prediction is a regression task. Although neural networks are competitive for regression tasks, there are challenges to build high-performance ROP prediction models [19, 34,35,36]: (1) Due to the complex underground condition and instrumental errors, there are deviations and noise data in the original data set. The quality of the original field data set is poor, so it is difficult to train an accurate and reliable model with the original data. In previous publications, R2 > 0.8 could be regarded as a good accuracy [24, 37, 38]. (2) As a result of the local uplift and subsidence, the corresponding depth of the same formation in different wells varies. To complete the ROP prediction of the whole well, the traditional solution needs to build multiple models according to the geological formations, which is extremely time-consuming [30]. It is necessary to build a single model that both considers geological heterogeneity and can predict ROP of the whole well sections. (3) With the continuous automation of drilling engineering, the big data generated by drilling operations are more abundant in diversity and quantity. As the data size greatly increases, there are more hidden information that needs to be elucidated. The traditional neural networks have simple structure, which cannot fully perceive the latent relationships in the big data.

To deal with above challenges, this paper constructs a new intelligent ROP prediction model. Savitzky–Golay (SG) algorithm is adopted to filter the data, which ameliorates the data quality. The logging data are clustered. The geological formations with similar rock characteristics are classified into the same cluster. The cluster results replace the well logging data and act as a single feature parameter for formation characteristics. To fully dig the hidden information in drilling big data and achieve high-precision prediction, a 23-layer deep residual neural network (ResNet) is built for the final ROP prediction.

For the remaining contents, in Sect. 2, the frame of the proposed model is introduced. The sub-processes of noise reduction, formation clustering, and ResNet are presented in detail. The model evaluation indexes are also described. Field drilling data of ultra-deep reservoir in Xinjiang are collected to train and test the model. Section 3 displays the results and discussion, including the test results, ablation experiment, performance comparison with other machine learning methods, and a case study of real-time application. After that, a conclusion is drawn.

2 Material and Methods

2.1 Input Parameters for ROP Prediction

There are many mechanic, hydraulic and chemical parameters that affect ROP. Generally, field parameters can be divided as mud logging parameters and well logging parameters based on data source. According to previous researches [24, 37, 38], a total of 20 parameters that are closely related to ROP are selected, as shown in Table 1. Since all the ultra-deep wells in the target reservoirs are vertical, deviation angle and azimuth are not included in Table 1.

Table 1 Input parameters for ROP prediction

2.2 Model Framework

Figure 1 summarizes the structure and workflow of the model, which is composed of data noise reduction, formation clustering (K-means), and ResNet. It is abbreviated as DKR model. First, the input drilling data undergo the noise reduction process to eliminate the outliers and missing values. Then, the well logging parameters are clustered using K-means algorithm. The formations with similar rock characteristics are classified into the same cluster. The cluster results replace the well logging parameters and provide a single feature parameter for formation characteristics. It helps to enhance the correlation between formation characteristics and ROP. Last, a ResNet with optimized structure is trained to correlate feature parameters with ROP.

Fig. 1
figure 1

Framework of DKR model

2.3 Data Noise Reduction

Data noise reduction aims to improve data quality and ensure the reliability of the model. Drilling data are collected from multiple sources (mud logging and well logging in this work). Due to facility or human factors, field drilling data are inevitably interfered by noise signals, which reduces the model accuracy and increases the training time [39]. Appropriate noise reduction should be performed to improve the consistency and integrity of the data.

In this paper, SG filter is adopted, which is a filtering algorithm that realizes polynomial fitting of local interval through least square convolution [40]. It has been widely used for data smoothing and noise reduction [41, 42]. It can maintain the shape and width of the original signal while removing the noise. The process of SG filter is shown in Fig. 2. Within a sliding window of 2m + 1 continuous data points (width), the least square fitting with a certain fitting order k is conducted, and the fitting curve value at the center of the sliding window is taken as the filtered value. The window continues to move, and the above process is repeated to fulfil the filtering of all data points.

Fig. 2
figure 2

The process of SG filter

The basic principle of SG filter is as follows:

$$ s_{j}^{*} = \frac{{\mathop \sum \nolimits_{i = - m}^{m} C_{i} S_{j + 1} }}{N} $$
(1)

where S is the original signal; s* is the signal after noise reduction; Ci is the noise reduction coefficient for the i-th time; N is the sliding window width for 2m + 1 data points; j is the j-th point in the data set. Two parameters need to be determined when applying SG filter: sliding window width N and local polynomial fitting order k. Reasonable selection of N and k can reduce signal distortion and ensure filtering quality [43].

2.4 Formation Clustering

The purpose of formation clustering is to arrange geological formations that share similar rock properties. Due to the local geological movement, the same formation may appear at various vertical depth or even disappear for different wells. To guarantee the accuracy, previous machine learning method has to establish multiple ROP models for all the formations (Fig. 3). In this work, well logging data are used to cluster the formations. After dividing the formations with similar rock characteristics into the same cluster, the well logging parameters are replaced by cluster results as the input parameters for ROP prediction. It helps to strengthen the correlation between formation feature and ROP, and facilitate the model training. Most importantly, the ROP prediction of the whole well can be realized with a single model.

Fig. 3
figure 3

Traditional ROP prediction strategy for different formations in the same well

The process of formation clustering (unsupervised learning) is shown in Fig. 4. Specifically, K-means algorithm is applied. The algorithm requires to specify the number of clusters k and initial cluster centers in advance. Euclidean distance is used to evaluate the similarity between data points. The similarity is inversely proportional to the distance. The location of each cluster center is updated according to the similarity between the data points and the cluster center.

Fig. 4
figure 4

Formation clustering with K-means algorithm

The essence of K-means algorithm is to divide the unlabeled data set X (Eq. 2) into k (k < m) clusters \(C = c^{{({1})}} ,c^{{({2})}} , \ldots ,c^{(k)}\).

$$ X = \left[ \begin{gathered} x^{(1)} \hfill \\ x^{(2)} \hfill \\ \vdots \hfill \\ x^{(m)} \hfill \\ \end{gathered} \right],x^{(i)} \in R^{n} $$
(2)

where x(i) is the data points in the data set.

Its framework is as follows:

1) Randomly select k cluster center points \(\mu_{1} ,\mu_{2} , \ldots ,\mu_{k} ;\)

2) For each data point x(i), decide the cluster it belongs to:

$$ c^{(i)} : = \arg \mathop {\min }\limits_{j} \left\| {x^{(i)} - \left. {\mu_{j} } \right\|} \right.^{2} $$
(3)

where c(i) is the cluster closest to x(i), i < k;

3) For each cluster j, recalculate the center μj;

$$ \mu_{j} : = \frac{{\sum\nolimits_{i = 1}^{m} {1\left\{ {c^{(i)} = j} \right\}x^{(i)} } }}{{\sum\nolimits_{i = 1}^{m} {1\left\{ {c^{(i)} = j} \right\}} }} $$
(4)

4) Repeat step 2) and 3) until convergence. Distortion function Eq. (5) represents the square sum of the distance from each data point to its cluster center. When J reach the smallest value, clustering is convergent. Because of the non-convexity of Eq. (5), local convergence may happen. So the clustering process should be conducted for multiple times to ensure consistency.

$$ J(c,\mu ) = \sum\limits_{i = 1}^{m} {\left\| {x^{(i)} - \left. {\mu_{{c^{(i)} }} } \right\|} \right.}^{2} $$
(5)

In this paper, the elbow method [44] is used to determine a proper k value.

2.5 Deep Residual Neural Network

For big data of drilling, simple neural networks are inadequate in learning ability and generalization ability, and more complicated networks have been proposed. A deeper and larger network is believed to better capture the complex nonlinear relationship between feature parameters and ROP. With the increase in network layers, it can extract more features and capture intricate relationships. However, the increased complexity of network is not always in favor of the prediction accuracy. On the other hand, there are growing risks of over-fitting, gradient vanishing, and gradient explosion. Traditional neural networks are unable to deal with the big data in ROP prediction.

In view of the successful applications of ResNet in solving complicated regression problems with big data [45,46,47], this work uses ResNet [48] to meet the challenges. ResNet adds shortcut connections in the network, which directly transfer the output of the previous layer to a subsequent nonadjacent layer through identity mapping. By modifying the network structure, ResNet can assure model stability and accuracy with the increased network layers.

The residual learning in ResNet is shown in Fig. 5. Suppose that the input of a neural network is x and the expected output is H(x). It is difficult to directly train the model to find the relationship between x and H(x). In the residual learning scenario, the input x is directly transferred to the output as the initial result through shortcut connection, and the output becomes H(x) = F(x) + x. If F(x) = 0, then H(x) = x, which is an identity mapping. Thus, the learning objective of ResNet is no longer a complete output, but the difference between the target value H(x) and x, residual F(x) = H(x) − x. The training target is to minimize the residual to 0. This jump structure of residual breaks the convention that in the neural network the output of one layer can only be used as the input for the next neighboring layer. The output of previous layer can cross several layers and directly serve as the input of a subsequent layer. Therefore, with the increase in network layers, the accuracy and stability of model do not decline. The workflow of ResNet is demonstrated in Fig. 6. The input includes mud logging parameters and formation cluster label.

Fig. 5
figure 5

ResNet residual learning

Fig. 6
figure 6

Workflow of ResNet ROP prediction

2.6 Evaluation Index

The model performance is evaluated by mean absolute error (MAE), mean absolute percentage error (MAPE), and coefficient of determination (R2). MAE is the absolute error between the predicted ROP ypre and the actual ROP y:

$$ {\text{MAE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} |y_{i}^{{{\text{pre}}}} - y_{i} | $$
(6)

where n is the number of data points; \({y}_{i}^{\mathrm{pre}}\) is the i-th predicted ROP, m/h; \({y}_{i}\) is the i-th real ROP, m/h.

MAPE measures the relative error between ypre and y:

$$ {\text{MAPE}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} \frac{{\left| {y_{i}^{{{\text{pre}}}} - y_{i} } \right|}}{{y_{i} }} $$
(7)

For a high-performance model, MAE and MAPE should be small.

R2 evaluates the fitting performance of the regression model. The closer R2 is to 1, the higher the degree of model fitting. R2 is calculated by:

$$ R^{2} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{n} (y_{i}^{{{\text{pre}}}} - y_{i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i}^{{{\text{pre}}}} - \frac{1}{n}\mathop \sum \nolimits_{i = 1}^{n} y_{i} } \right)^{2} }} $$
(8)

2.7 Field Data

This work collects 243,000 sets of data from 40 ultra-deep wells in NY reservoir, Xinjiang Province, China as the original data set, which covers all the main geological formations for the reservoir. The typical geological stratifications and well structure of the target reservoir are illustrated in Fig. 7. Part of the data set are presented in Tables 2 and 3. To avoid data leakage, the division of training data set and test data set is conducted for the wells instead of the data points. The data from 32 wells are used as training data set, while the data of remaining 8 wells comprise test data set.

Fig. 7
figure 7

Typical geological stratifications and well structure in the target reservoir

Table 2 Part of the drilling data collected from the target reservoir
Table 3 Part of well logging data collected from the target reservoir

3 Results and Discussion

3.1 Model Performance and Analysis

The structure and main parameters of DKR model are listed in Table 4. N and k (SG filter) are decided by trial and error. The optimal K (formation clustering) is determined to be 23 by elbow method. The data value is obtained after several tests. In ResNet, a total of 7 residual blocks are constructed, and each residual block contains three weight layers. This is the optimal network structure obtained through multiple tests.

Table 4 Key parameters and network structure for DKR model

Figure 8 shows an example of noise reduction using SG filter. It can be seen that the general trends of the processed curves are consistent with the original data. The abnormal values are significantly reduced, and the curves are smoothed. This is beneficial for model training.

Fig. 8
figure 8

Result of noise reduction

To evaluate the effect of formation clustering, the standard deviation and variation coefficient of rock compressive strength are calculated for 4 continuous major formations (K1l, K1h, K1q, and J3k) in the reservoir. The standard deviation and coefficient of variation indicate the dispersion degree of data. Greater standard deviation/coefficient of variation means a more disperse data distribution. The results are presented in Table 5. The 4 formations are classified as 6 different clusters (Cluster 1, 3, 4, 6, 8, and 9). Before clustering, the variation coefficients of 4 formations are all larger than 0.1 (weighted average 0.1253), which is statistically moderate-variant. After clustering, the variation coefficients are all less than 0.1 (the weighted average 0.079), which shows a weak variation. The weighted average of variation coefficient is lowered by 36.95%. The data in the same cluster are more homogeneous in terms of rock property. The results of remaining formations are similar. In this case, the influence of rock characteristics on ROP can be regarded as identical in the same formation cluster. The cluster label substitutes well logging parameters as the input for ResNet, which effectively reduces the input parameters for ROP production, and facilitates model training.

Table 5 Standard deviation and variation coefficient of compressive strength before and after clustering

In order to validate the generalization ability of DKR model, 10 repetition of training and test are conducted. In each experiment, the training wells and test wells are re-divided. The prediction deviations vary in each repetition test, and the 10 test results are averaged for the wells in the test set, which are shown in Table 6. Note well1 to well8 only represent the sequence in test set, not a certain well. The maximum MAE of the DKR model is 0.66 m/h, the minimum MAE is 0.48 m/h, and the average MAE is 0.55 m/h. The maximum MAPE is 12.77%, the minimum is 9.82%, and the average is 11.34%. The absolute deviation and relative deviation are both tolerable. The maximum R2 is 0.94, the minimum R2 is 0.89, and the average R2 is 0.92. The overall fitting accuracy is good. The repetition experiments demonstrate that DKR model can achieve accurate and stable ROP predictions. Despite of small fluctuations, MAE, MAPE, and R2 are maintained at satisfactory values.

Table 6 Average test results of DKR model

Violin plots (Figs. 9 and 10) are applied to analyze the error distribution of DKR model. Violin Plot combines the characteristics of box plot and density plot. It displays data distribution and its probability density. The AEs (absolute error) generated by DKR model during the tests are shown in Fig. 9. The box plots indicate the 25 and 75% quantiles. For the test wells, most of the AEs are less than 1 m/h, which concentrate around the medians (0.50–0.71 m/h). The distribution density decreases from the peak value near the median. The occurrence of large AE is rare. The maximum AE is between 3.01 and 3.68 m/h. The APEs (absolute percentage error) of DKR model are shown in Fig. 10. The medians of relative error range from 9.8 to 13.1%, validating a fair model accuracy. Most of the APEs are distributed in between 5 and 15%. The maximum APE ranges from 34 to 41%. There are no extreme outliers in the plots, indicating that the model has good generalization ability and stability.

Fig. 9
figure 9

AE distribution of DKR model

Fig. 10
figure 10

APE distribution of DKR model

Figure 11 depicts the relationship between predicted ROP and real ROP values. In the DKR model, the determination coefficients are in the range of 0.89 ~ 0.94, indicating that the proposed model precisely fits the nonlinear relationship between the ROP and the feature parameters. In conclusion, the repetition test results prove that DKR model has satisfactory performance for ROP prediction in ultra-deep wells, with high prediction accuracy, good robustness, and strong generalization ability.

Fig. 11
figure 11

Comparison of predicted ROP and real ROP

3.2 Ablation Experiment

An ablation experiment in machine learning involves removing certain elements from the dataset or model in order to better understand its behavior. In this section, the ablation experiments of DKR model is performed to evaluate the contribution of each sub-process. Specifically, following combinations are tested: (1) K-means + ResNet, where the data noise reduction is removed; (2) Data noise reduction + ResNet, where the formation clustering is removed; (3) Data noise reduction + K-means + ANN, where ResNet is replaced by a traditional 23-layer ANN. Similarly, 10 repetition tests are conducted for each combination, and the average test results are shown in Table 7. It is demonstrated that the removal of any part of DKR model leads to certain deterioration, while the substitution of ResNet with ANN results in the largest impact.

Table 7 Results of ablation experiment

Figure 12 shows the AE and APE of ROP prediction with and without noise reduction. Without noise reduction, the median, quartile range, and 95% confidence of the AE and APE obviously grow. MAE and MAPE increase by 12.73 and 9.0%. R2 is reduced from 0.92 to 0.90. It reveals that data noise reduction plays an important role in building a high-precision ROP model. By data noise reduction, the accuracy of the ROP prediction can be improved to a certain extent.

Fig. 12
figure 12

Distributions of AE and APE of ROP prediction model with and without data noise reduction

Figure 13 shows the AE and APE of ROP prediction with and without formation clustering. Both of the models use SG filter for noise reduction and the ResNet with same structure. Similar to Fig. 12, the median, quartile range, and 95% confidence interval of errors have significantly increased when formation clustering is removed. R2 reduces to 0.87. The AE and APE of DR model are 0.69 m/h and 13.47%. In comparison, the AE and APE of DKR model are 0.55 m/h and 11.34%. It proves that formation clustering is a crucial contributor for the accuracy of DKR model.

Fig. 13
figure 13

Distributions of AE and APE of ROP prediction model with and without formation clustering

The effect of replacing ResNet with an equal-layer ANN is shown in Fig. 14. The AE and APE distribution of the DKA model are significantly larger than those of DKR model. The AE and APE of DKA model are 0.81 m/h and 15.55%, increased by 47.27 and 37.13%. R2 declines to 0.84. It demonstrates that conventional ANN cannot solve the problem that the model accuracy decreases with the increased network layers. Otherwise, ResNet successfully overcomes this challenge.

Fig. 14
figure 14

Distributions of AE and APE of ROP prediction model with ResNet and ANN

3.3 Comparison with Other Machine Learning Methods

The performance of the proposed ROP prediction model (DKR) is compared with three benchmark machine learning models (Back-Propagation neural network (BP), SVM, and RF) [24, 49,50,51,52,53,54]. To ensure consistency, noise reduction and formation clustering are also performed for the three models. And the model parameters are the optimized values after trial and error. Each model has been trained and tested for 10 times, and the results are averaged. It can be seen from Fig. 15 that DKR model has better performance than BP, SVM, and RF. The AE and APE of DKR are the lowest, while its R2 is the highest. BP shows a decent accuracy, but SVM and RF perform poorly on ROP prediction in ultra-deep wells.

Fig. 15
figure 15

Comparison of DKR and other machine learning models

3.4 Real-time Field Application

Recently, the trained DKR model was applied for a real-time ROP prediction of a new ultra-deep well (HT-X) in the reservoir. The prediction started from a measure depth of 5700 m, and ended at 7600 m. The test results are shown in Fig. 16. Figure 16a displays measured ROP and predicted ROP along the measure depth. It can be seen that the predicted ROP curve closely follows the measured ROP curve, despite that the peak and bottom values are less prominent in the predicted curve. The smooth change of predicted ROP curve is the result of SG filtering. Figure 16b shows the AE and APE distributions of ROP prediction. The AE is mostly distributed in the range of 0–0.2 m/h, while the APE is generally located in the range of 5–15%. The maximum and average AEs of the DKR model are, respectively, 0.77 and 0.17 m/h, corresponding to APEs of 30.3 and 11.2%. There are only a few of unsatisfactory large errors in the whole ROP prediction process. Figure 16c depicts the relationship between predicted and measured ROPs. The high R2 of 0.91 indicates the prediction accuracy of DKR model is great. The field test validates the feasibility of applying DKR model to real-time ROP prediction in drilling ultra-deep wells.

Fig. 16
figure 16

Results of real-time ROP prediction in HT-X well using DKR model

4 Conclusion

This paper proposes a new ROP intelligent prediction model for ultra-deep wells. The model utilizes mud logging and well logging parameters that are closely related to ROP as inputs and incorporates several innovative techniques, including data noise reduction using SG filter, formation clustering using K-means algorithm, and a ResNet-based neural network for prediction. The model is tested on field data from an ultra-deep reservoir in northwest China and is found to have high prediction accuracy, good robustness, and strong generalization ability. The average MAE, MAPE, and R2 are 0.55 m/h, 11.34%, and 0.92. Ablation experiments demonstrate the importance of each sub-process, while comparison with other mainstream machine learning models confirms the superiority of the proposed model. Notably, the proposed model with ResNet outperforms the conventional ANN model by successfully overcoming the challenge of decreased accuracy with increased network layers. Furthermore, a field test validates the model’s feasibility for real-time ROP prediction in drilling ultra-deep wells (MAPE = 11.2%, R2 = 0.91). Overall, the proposed ROP prediction model based on formation clustering and ResNet shows great potential for further field application in ultra-deep wells, making it a highly novel and competitive approach to ROP prediction.