Keywords

1 Introduction

Tool condition is considered to be one of the most important factors in determining production quality, productivity and energy consumption [1]. As a widely used tool wear prediction technique, machine learning algorithm can establish a nonlinear mapping relationship between features and tool wear. Guan et al. [2] performed empirical modal decomposition of acoustic emission signals, and the feature vector composed of the autoregressive model of each modal function is extracted. Suo et al. [3] uses the multi-resolution wavelet method to analyze the milling force signal, extracts the energy and covariance of component into the BP neural network to achieve tool wear prediction.

Dang et al. [4] collected vibration signals during the machining process, automatically extracted features based on 1DCNN (one-dimensional convolutional neural network), and used extreme learning machine to predict tool wear. Zhou et al. [5] used Hilbert-Huang transform to extract tool wear features and predict too wear and tool remaining life based on LSTM (long short term memory) for tool wear and remaining tool life. Zhang et al. [6] carried out wavelet packet transform on milling force signals and used the energy in different frequency bands as feature vectors to predict tool wear state by using sparse auto-ecoding network model. Although the above single type of deep learning models can effectively extract spatial and has good prediction effect, it cannot mine information from both spatial and temporal dimensions.

Yan et al. [7] proposed a long short-term memory convolutional neural network (LSTM-CNN) model, which uses LSTM and CNN to extract features of vibration and milling force signals from both sequence and multi-dimensional aspects. The mapping relationship between features and tool wear improves the prediction accuracy of tool wear. An et al. [8] proposed a hybrid model of CNN-SBULSTM (CNN with Stacked Bi-directional and Uni-directional LSTM), using CNN to extract features from internal from internal controller and external sensor signals, which performs features dimensionality reduction. Li et al. [9] proposed a 1DCNN-LSTM hybrid model which made full use of the learning ability of 1DCNN and the time series analysis ability of LSTM to fully mine the information related to tool wear state in vibration and acoustic emission signals, and achieved good tool wear recognition results.

In this paper, the process signals are collected based on the multi-sensor fusion system, and the time domain, frequency domain and time–frequency domain features are extracted from the original signals to reduce the influence of noise. The adaptive moment estimation algorithm is used to optimize the 1DCNN-LSTM model, and the information related to tool wear is mined from both temporal and spatial dimensions in the tool wear feature data set to improve the accuracy of tool wear prediction.

2 Theoretical Basis of 1DCNN-LSTM Model

The 1DCNN-LSTM network model structure is shown in Fig. 1. In the prediction model, the useful information in the feature sequence of the input layer is fully excavated by two convolutional layers and two pooling layers alternately, and then the time series information of the output features of pooling layer 2 is extracted by LSTM. Finally the tool wear prediction value is obtained by the fully connected layer.

Fig. 1
The D C N N - L S T M network structure depicts the input, followed by convolution 1, pooling 1, convolution 2, pooling 2, and L S T M, fully connected with the output.

1DCNN-LSTM network structure

In order to improve the prediction accuracy of the model and the convergence speed of the iterative solution, each feature column and the actual wear value in the optimal feature combination are normalized:

$$x^{\prime}=\frac{x-\min(x)}{\max\left(x\right)-\min(x)}$$
(1)

The standardized features are input into the convolutional layer of 1DCNN for deep feature extraction. The convolution process is shown in Fig. 2. Taking a one-dimensional cutting feature sequence of length N as an example, with a convolution kernel size of T × 1 and a step size of S, the resultant length G after convolution is calculated as:

$$G=\frac{N-T}{s}+1$$
(2)
Fig. 2
An illustration of three stages of the convolution process, wherein one dimensional cutting feature sequence and kernel size are depicted.

Convolution process

In the process of convolution kernel sliding, the convolution kernel with the input data to obtain the feature results, and the calculation formula is shown below:

$${X}_{j}^{l}=f\left(\sum_{i=1}^{{M}^{l-1}}\left({X}_{i}^{l-1}*{K}_{ij}^{l}\right)+{b}_{j}^{l}\right) i=1,2,\dots {M}^{l}$$
(3)

where \(X_{j}^{l}\) is the \(j\)th feature map of the lth layer, \(M^{l}\) is the input array for computing the jth output, \(X_{i}^{l - 1}\) is the ith input in the \(l\)−1th layer, \(K_{ij}^{l}\) is the convolution kernel of the lth layer, \(b_{j}^{l}\) is the bias of the \(j\)th of the \(l\)th layer, and \(f\left( \cdot \right)\) is the activation function.

To avoid overfitting caused by overabundance neurons, maximum pooling is added after convolution to retain important feature information and improve training efficiency. The specific calculation formula is as follows:

$$P(j)=\underset{t\in {K}_{j}}{\max}(q(t))$$
(4)

where \(P\left( j \right)\) is the jth feature value after pooling, \(K_{j}\) is the jth pooling domain, and \(q(t)\) is the element value of the convolutional feature in the jth pooling domain before pooling.

The output of the 1DCNN is fed into the LSTM neural network for modeling and extracting temporal information. The LSTM neural network introduces a series gate mechanism based on RNN (recurrent neural network) to obtain long-term memory and alleviate the gradient disappearance and explosion problems, and its cell structure is shown in Fig. 3.

Fig. 3
The L S T M memory cell unit structure is depicted where C subscript t minus 1 and C subscript t are marked at top-left and right, and h subscript t minus 1 and h subscript t are marked at bottom left and right.

Structure of LSTM memory cell unit

LSTM selectively processes information by combining current information and cell state history information using ingates, forgetting gates and output gates. At each moment t, the ingates and forgetting gates combine the output value ht−1 of the previous moment and the input xt of the current moment to obtain the input coefficient it, the forgetting coefficient \(f_{t}\), and the candidate cell state \(\stackrel{\sim }{{C}_{t}}\) after the activation function. At each moment t, the ingates and forgetting gates combine the output value \(h_{t - 1}\) of the previous moment and the input xt of the current moment to obtain the input coefficient it, the forgetting coefficient \(f_{t}\), and the candidate cell state \(\stackrel{\sim }{{C}_{t}}\) after the activation function. The cell state \(C_{t}\) at the current moment is obtained by combining the information filtered by ft oblivion from the cell state \(C_{t - 1}\) at the previous moment and the information filtered by it from the candidate cell state \(\stackrel{\sim }{{C}_{t}}\). The output coefficients ot calculated by \(h_{t - 1}\) and xt through the activation function after the cell state update. The updated cell state \(C_{t}\) is multiplied with ot after the activation function to obtain the predicted value of the current moment ht. The formulae for each threshold, internal memory cell, memory and candidate state are shown below:

Ingate (Threshold):

$${i}_{t}=\sigma ({w}_{i}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{i})$$
(5)

Oblivion Gate (Threshold):

$${f}_{t}=\sigma ({w}_{f}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{f})$$
(6)

Output Gate (Threshold):

$${o}_{t}=\sigma ({w}_{o}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{o})$$
(7)

Internal memory unit (long-term memory):

$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\tilde{C }}_{t}$$
(8)

Predicted values (short-term memory):

$${h}_{t}={o}_{t}*\tanh({C}_{t})$$
(9)

Candidate state (new knowledge inducted):

$${\tilde{C }}_{t}=\tanh(({w}_{c}\cdot \left[{h}_{t-1},{x}_{t}\right]+{b}_{c}))$$
(10)

where: \(w_{i}\), \(w_{f}\), \(w_{o}\), \(w_{C}\) are the parameter matrices to be trained; \(b_i\), \(b_{f}\), \(b_{o}\), \(b_{C}\) are the bias terms to be trained; σ the sigmoid activation function with the output interval [0, 1]; tanh is the activation function with the output interval [−1, 1].

The output of the LSTM is used as the input of the fully connected layer, and the predicted value is obtained after the fully connected layer, which realizes the mapping of features to tool wear values. The calculation formula is as follows.

$${x}^{l}=\sigma \left({w}^{l}{x}^{l-1}+{b}^{l}\right)$$
(11)

where \(w_{l}\) is the weight of the \(l\)th layer and \(b_{l}\) is the bias of the \(l\)th layer.

Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are used as the two evaluation indicators of model prediction accuracy, which are calculated as follows.

RMSE:

$${P}_{rmse}=\sqrt{\frac{1}{n}\sum_{k=1}^{n}{\left({y}_{k}^{pre}-{y}_{k}\right)}^{2}}$$
(12)

MAE:

$${P}_{mae}=\frac{1}{n}\sum_{k=1}^{n}\left|{y}_{k}^{pre}-{y}_{k}\right|$$
(13)

where \(y_{k}^{pre}\) is the predicted value of milling cutter wear, \(y^{k}\) is the true value of milling cutter wear, and \(n\) is the number of samples.

3 Milling Tool Wear Prediction Based on 1DCNN-LSTM

The workpiece is fixed on the dynamometer through the fixture, and the vibration sensor is installed on the side of the workpiece to facilitate accurate acquisition of vibration and force signals. In the whole process of tool wear, there are 300 times of cutting, and finally 300 sets of milling force and vibration signal data are obtained. During the cutting process, the time domain, frequency domain and other characteristic data extracted from the sensor signal are usually time series data which have obvious spatial local and time dependence characteristics [10]. CNN has strong data mining capabilities, and can achieve good prediction results even with less preprocessing. However, it assumes that all inputs and outputs are independent, and correlations between features at different moments are ignored, leads to performance degradation in processing time series data. LSTM can effectively deal with the long-term dependence problem, which exactly makes up for the deficiency of CNN in dealing with time series data [11].

1DCNN is mainly used to deal with one-dimensional time series data, so 1DCNN is chosen to process the input feature sequence in this paper. Based on the local feature extraction ability of 1DCNN and the time dependence of LSTM, this paper designs a tool wear prediction model based on 1DCNN-LSTM. Taking the selected optimal feature combination sequence and the measured actual tool wear value as the input of the model, the nonlinear relationship between the feature and tool wear is established by training the model, and the tool wear prediction is finally realized.

The total number of layers in the network of the model is 8, mainly containing 2 layers of convolution, 2 layers of pooling, 1 layer of LSTM and 1 layer of fully connected layers. The input layer is the optimal feature combination sequence Ak, \(A_{k}\) = {\(A_{k1}\), \(A_{k2}\), …, \(A_{kd}\)}, k = 1, 2, …, 300, denoting the number of tool walks, and d is the number of feature dimensions in the optimal feature combination. In order not to miss feature information during convolution, the step size is set to 1. The size of the convolution kernel determines the weight distribution in the convolution process, and 3 × 1 convolution kernel is chosen, and the number of filters chosen in this paper is 16. The pooling domain size is chosen from the commonly used 2 × 1, where the step size is 2. The number of hidden layer neurons in LSTM is the same as the dimension of feature vectors in the input model.

Train the entire model with Categorical-Cross entropy Loss, and the cross entropy error is calculated as follows:

$$Loss(\theta )=-\sum_{i=1}^{n}{y}_{i}^{pre}\log {y}_{i}$$
(14)

where \(y_{i}^{pre}\) is the predicted value of milling cutter wear, \(y^{i}\) is the true value of milling cutter wear, \(\theta\) is the network parameter, and \(n\) is the number of samples.

The algorithm performs the minimum solution along the direction where Loss(θ) decreases the fastest and achieves a fit between the predicted and true values after several iterations. Since the traditional gradient descent algorithm is easy to fall into the local optimal and the update speed is slow when solving the model, the ADAM algorithm is used to optimize the model.

ADAM uses the first-order and second-order moment means of the gradient to perform adaptive learning rate calculation and parameter update, respectively, ensuring that the learning rate of each parameter is dynamically adjusted within a determined range, making the parameter changes are relatively stable. The model calculates the error of each network layer by back propagation to achieve an accurate finding of the optimal solution for each network layer parameter θ. Let ε be the step size and take the value of 0.001; the moment estimation index decay rates ρ1 and ρ2 are usually set to 0.9 and 0.999, respectively. The constant δ = \({10}^{-8}\) is introduced to make the calculated value stable with error limit e = 10–8.The computational flow of the ADAM algorithm is shown in Table 1.

Table 1 Flowchart of Adam's algorithm

4 Analysis of Different Tool Wear Models

In order to verify and compare the prediction effect of the 1DCNN-LSTM tool wear prediction model proposed in this paper, the other two methods are used for comparison:

  1. (a)

    LSTM model with the optimal feature combination sequence as input, denoted as LSTM;

  2. (b)

    1DCNN-LSTM model with three-way force and vibration raw signal as input (Raw Data 1DCNN-LSTM, denoted as RD-1DCNN-LSTM).

4.1 Comparison of Model Stability

The LSTM model and 1DCNN-LSTM model were established in Python 3.8.2 and Tensorflow 2.0.1 frameworks, and the same training parameters were set for the three model methods: the optimizer for the training model was Adam, the basic learning rate was 0.001, the batch size was 30, and the number of iterations was 1000. The models were fully trained with the feature data, and then the stability of the model training process was verified by the feature data sets of Experiments 1 and 2, and the variation of the loss function values of the three models RD-1DCNN-LSTM, LSTM, and 1DCNN-LSTM are shown in Figs. 4, 5 and 6.

Fig. 4
A two-part graph plots loss function value versus epoch with two fluctuating trends representing loss on training in the first part, and training and testing set in the second part.

Variation of loss function values for RD-1DCNN-LSTM model: a ap = 0.1 mm, b ap = 0.2 mm

Fig. 5
A two-part graph plots loss function value versus epoch with two fluctuating trends representing loss on training and testing set in the first and second parts.

Variation of loss function values for LSTM model: a ap = 0.1 mm, b ap = 0.2 mm

Fig. 6
A two-part graph plots loss function value versus epoch with two fluctuating trends representing loss on training and testing sets in the first and second parts.

Variation of loss function values for 1DCNN-LSTM model: a ap = 0.1 mm, b ap = 0.2 mm

It can be seen from three figures, it reach a small value of loss function within 1000 iteration cycles. The loss on the training set is very small from the loss on the test set, both floating around 0.58. Therefore, the training process of the models is normal and the training parameters are reasonable. Among them, the RD-1DCNN-LSTM model tends to be stable after the iteration period reaches 600, and the LSTM model tends to be stable after the iteration period reaches 300. However, the model proposed in this paper is in a stable state only after 20 iterations, indicating that the proposed model is easy to train and can obtain high efficiency. The change trends of the loss function values in the two Experiment are similar, indicating the universality of the model.

4.2 Comparison of Model Prediction Effects

The tool wear prediction effects of the three models, RD-1DCNN-LSTM, LSTM and 1DCNN-LSTM are shown in Figs. 7, 8 and 9.

Fig. 7
A two-part graph plots the tool wear versus cutting number for R D-I D C N N-L S T M model, wherein an ascending trend represents the true value, another rising trend close to true value represents predicted value, and trend around 0 represents error.

Prediction effect of RD-1DCNN-LSTM model: a ap = 0.1 mm, b ap = 0.2 mm

Fig. 8
A two-part graph plots the tool wear versus cutting number for L S T M mode, wherein an ascending trend represents the true value, another rising trend close to true value represents predicted value, and trend around 0 represents error.

Prediction effect of LSTM mode: a ap = 0.1 mm, b ap = 0.2 mm

Fig. 9
A two-part graph plots the tool wear versus cutting number for I D C N N-L S T M, wherein an ascending trend represents the true value, another rising trend close to true value represents predicted value, and trend around 0 represents error.

Prediction effect of 1DCNN-LSTM model: a ap = 0.1 mm, b ap = 0.2 mm

From Figs. 7, 8 and 9, it can be seen that the RD-1DCNN-LSTM model has the worst prediction effect among the three models. It is because that the interference of various factors such as environment makes the original data set contains a large amount of invalid redundant information, and the multi-domain features contain less interference factors. The prediction effect of the 1DCNN-LSTM model is better than that of the single LSTM model with smaller error values, indicating that the proposed model has stronger learning ability. In order to better prove the effectiveness of the proposed method, the evaluation criteria MAE and RMSE and the time cost of model operation are calculated as shown in Table 2.

Table 2 RMSE and MAE of different models on the data set

As shown in the Table, compared with the LSTM and RD-1DCNN-LSTM models, the MAE of the 1DCNN-LSTM model was reduced by 7.0 and 11.4, with a decrease rate of 52.9% and 65.0% (experiments No. 1 and 2 mean), RMSE decreased by 10.1 and 15.3, with a decrease rate of 54.9% and 65.3% (experiments No. 1 and 2 average), and the runtimes decreased by 2839.4 and 1565.9 s, with a decreased rate of 75.4% and 62.8% (experiments No. 1 and 2 mean), indicating that the proposed model can predict tool wear more effectively. Therefore, the validity and feasibility of the model can be shown from the time cost and evaluation criteria.