1 Introduction

Nowadays, manufacturing industries are undergoing tremendous evolution and ongoing technological growth, with the primary goal of ensuring the reliability of manufacturing systems. In this regard, monitoring tool conditions has turned into an important task in the field of metal cutting. Although cutting tools account for less than 4% of the total machining cost [1], their failure can result in a machine breakdown of 10 to 40% [2], which increases the overall cost of downtime. Under high temperatures and stress, cutting tools are subjected to severe frictional processes with the working piece. This leads to a critical condition called tool wear, which gradually affects the surface quality of the machined piece until either the milling machine or the workpiece is damaged. An effective prognostic and health management (PHM) system for tool health monitoring is, then, essential to enable informed actions aimed at enhancing cutting tool utilization and preventing catastrophic situations resulting from tool failure. Elattar et al. [3] classified prognostic methods into three main categories: physics-based, statistical reliability-based, and data-driven-based techniques. Physical model-based methods rely on modeling the system dynamics model that reflects its behavior and incorporates the degradation process. However, even if these methods are the most well-known in terms of precision and accuracy, the complexity of real-world industrial systems makes the conversion of the physical behavior into an analytical representation highly challenging [4]. Statistical reliability-based approaches, as the name implies, rely on statistical models, such as Weibull and Poisson distributions. They are solely based on historical life-cycle data, which includes the failure rate of a list of components belonging to the same category and specifications as the monitored part. Even though these techniques are simple to implement, they have a major restriction in that they require a large amount of historical data [5], making them inaccurate for newly manufactured components. Data-driven methods are the most widely used and developed prognostic category, owing to their ease of deployment, which does not require an analytical representation of the system’s behavior. PHM-data-driven prognostic process is realized in three major steps [4]:

  • Data acquisition

  • Data processing

  • Model learning and RUL prediction

In the data acquisition step, the acquisition system records the numerical data provided by sensors or transducers. Different types of sensors can be used depending on the critical component to be monitored, where, in the milling process, sensor selection is driven by the condition that can be related to tool wear, such as vibrations, cutting forces, acoustic emissions, and power consumption of the spindle [6]. The obtained data is processed, in the next step, to extract critical features with which the degradation level of the cutting tools is estimated and the wear state is evaluated.

Undoubtedly, this is the most challenging step regardless of the type of monitoring sensor. For instance, the vibration and acoustic emission data may contain structural and bearing vibration information. So, the data becomes complex, non-linear, and usually overwhelmed by noise [5].

It is reported, in related works [7, 8], that features can be extracted from time, frequency, and time-frequency domain analysis. Features derived, independently, from the time or/and frequency domains capture specific signal content while potentially losing information about the status of cutting tool degradation due to the non-stationarity of real signals and the sensitivity to the experimental conditions affecting milling cutters [9].

Time-frequency analysis is becoming a research hotspot since the used techniques, such as Hilbert-Huang transform (HHT), wavelet transform (WT), wavelet packet transform WPT, and empirical wavelet transform (EWT), can extract the dynamic health condition of milling cutters [10]. Numerous work has been carried out for this purpose. Benkedjouh et al. [9] combined continuous wavelet transform (CWT) with blind source separation (BSS) to estimate the remaining useful life (RUL) of the cutting tools. The CWT was used to separate the signals into coefficients for taking a certain scale of wavelet coefficients for BSS. The energy of the segregated signals is computed to produce the health indicator. Liao et al. [11] worked on monitoring tool wear conditions, proposing a hybrid hidden Markov model (HHMM) based on CWT. To begin, an internal dynamic hidden Markov model (HMM) is created to capture the WT’s dynamic dependency at various frequencies with instantaneous time resolution. After that, an external HMM for continuous monitoring is built, which aggregates the WT dependencies and portrays the dynamic worsening of the tool wear state in long term. Shen et al. [12] implemented a strategy for spindle power signal-based online tool condition monitoring, in which a system dedicated to the signal acquisition is constructed, and then, Hilbert-Huang transform (HHT) is applied to extract suitable features that indicate the tool condition. Segregeto et al. [13] employed WPT and machine learning paradigms to estimate tool wear in the turning of Inconel 718. The WPT was applied to vibration, cutting force, and acoustic emission signals to extract appropriate features. The correlation coefficient method is then used to select the most relevant features, and finally, a machine learning model based on an artificial neural network was employed to estimate the flank wear.

Despite the high capabilities, of the aforementioned methods, for signal decomposition, each has its limitation. HHT lacks from mathematical basis. WPT or CWT restricts the extraction of the HIs from a set of frequency bands constructed from a prescribed subdivision scheme of the Fourier axis. By contrast, EWT provides a self-adaptive decomposition of this axis concerning the frequency content of the analyzed signals. However, the contact of the cutting tool with the workpiece will certainly excite the machine structure, creating structural resonant frequencies that will be added to those of the electric motor components. These frequencies are known to have high energy. In this context, as the EWT first detects the local maxima, in the spectrum, to construct filters, the aforementioned frequencies, and the noise will greatly influence the results. Thus, isolating a frequency band in which the cutting tool wear is manifesting is tedious.

With the aim of developing a richer filter bank, a new segmentation model of the Fourier axis is proposed in this paper, denoted as the empirical wavelet packet decomposition (EWPD), to obtain comprehensive frequency band information, overcoming the traditional segmentation structure issues [14, 15]. Besides, a new health indicator is constructed to represent the degradation state of a collection of CNC milling cutters that rely on a clever selection of the time-domain features from the proposed filter bank. Ultimately, RUL estimation is performed based on the constructed HIs.

For prognostic and model learning, extensive surveys have been devoted to developing an accurate predictive model capable of indicating the RUL of the monitored component. Prognostic uses a variety of machine learning (ML) algorithms, namely artificial neural networks (ANN), support vector machine (SVM), support vector regression (SVR), fuzzy logic (FL), and deep learning (DL) techniques [4, 5, 16]. Due to their outstanding performance, deep learning DL methods, including gated recurrent unit (GRU), convolutional neural network (CNN), and long short-term memory (LSTM) network, have lately become the most advanced techniques. In a study evaluated by Zhao et al. [16], who reviewed the recent research on machine health monitoring using deep learning models, in which they employed a variety of ML and DL techniques for monitoring cutting tool wear conditions, they acknowledged that, even though DL approaches can automatically learn from raw data, the lack of data samples, the complexity of the operational environment, and the presence of noise in data necessitate feature extraction prior to a deep learning model. Aghazadeh et al. [17] introduced a methodology for tool condition monitoring that uses WPT for the processing stage. Later, spectral subtraction is carried out to remove the noise, and then several health indicators are generated. Finally, CNN is used to predict the tool wear state. Zhou et al.’s [18] research focused on predicting a tool’s remaining useful life under variant working conditions. First, relevant wear characteristics were extracted from the data using HHT. Next, the working conditions and the features were combined to create an input matrix, which captured the spatiotemporal relationship under various working conditions. Finally, the input matrix was loaded into an LSTM predictive model to predict the tool’s remaining useful life. Wu et al. [19] introduced a tool wear prediction model that employs a two-step approach: singular value decomposition (SVD) for feature extraction and bidirectional long short-term memory (Bi-LSTM) for time series analysis. SVD was used initially to reduce the noise in the raw signal, thereby shortening the input signal length for the Bi-LSTM network and simplifying its complexity. The processed data was then fed into the Bi-LSTM network to capture time series variations across current and previous sampling periods in order to predict tool wear. Shah et al. [20] worked on tool wear prediction in face milling using a singular generative adversarial network (SinGAN) and an LSTM network. Firstly, the Morlets wavelets were used to construct the scalogram from the acoustic emissions and vibration signals. Next, the appropriate wavelet functions were defined using the relative wavelet energy (RWE). To extract the feature vector, SinGAN was employed to generate more scalograms, and then image quality parameters were extracted to construct the vector. Finally, the extracted feature vector was used to train the LSTM model to predict tool wear. Different LSTM models were used for the sake of comparison: vanilla, stacked, and Bi-LSTM.

Long short-term memory (LSTM) networks have gained popularity as a powerful deep learning architecture for handling sequential data in several industrial engineering applications [21,22,23]. This algorithm is adopted in this work as a prediction model to estimate the RUL of each cutter, where the extracted HIs are provided to the LSTM network as inputs. Experimental results show unequivocally the efficacy of the proposed approach.

The remainder of this essay is organized as follows: Sect. 2 provides an overview of the EWT theory along with a description of the proposed EWPD, Sect. 3 summarizes the long short-term memory network principal, Sect. 4 introduces the 2010PHM data set, Sect. 5 describes the proposed procedure and discusses the acquired results, and finally, Sect. 6 brings this work to a close.

2 Empirical wavelet packet decomposition EWPD

This section presents the developed EWPD method. Here, the raw signal spectrum is segmented based on a new segmentation model to acquire a series of demodulation frequency bands using the theory of EWT.

Fig. 1
figure 1

Empirical wavelet filter bank [25]

2.1 Empirical wavelet transform

In 2013, a fully adaptive signal-analysis approach called empirical wavelet transform (EWT) has been introduced by Jerome Gilles [24], within the context of non-stationary signal decomposition. The main objective of this method is to decompose a signal into several modes to extract the most useful information that the signal contains. These modes represent the amplitude modulation-frequency modulation (AM-FM) components obtained traditionally after filtering the signal with an adaptive wavelet filter bank. Traditionally, the wavelet group is built by dividing the signal’s Fourier spectrum \( \left[ 0,\,\,\pi \right] \) into N continuous segments by identifying local maxima and then taking support boundaries \(\omega _{n}\) at the intermediate points between subsequent maxima. An example of the filters is shown in Fig. 1 [25].

\( {{\Lambda }_{n}}=\left[ {{\omega }_{n-1}},\,\,{{\omega }_{n}} \right] \) denotes each segment, which clearly demonstrates that \(U_{n=1}^{N}~{{\Lambda }_{n}}=\left[ 0,\,\,\pi \right] \). A transition phase \( T_{n} \) with a width of \( 2\tau _{n} \) is centered around each \( \omega _{n} \) in the Fourier axis (Fig. 1 (blue dotted areas)).

The empirical wavelet scaling function and the empirical wavelets are defined as low-pass and band-pass filters by Eqs. 1 and 2 [26], respectively.

$$\begin{aligned} {{\hat{\phi }}_{n}} \left( \omega \right) =\left\{ \begin{aligned}&1\begin{matrix} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} {} &{} \,\,\,if~\left|{\omega }\right|\le ~~{{\omega }_{n}}-~~{{\tau }_{n}} \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\&\cos \left[ ~\frac{\pi }{2}\beta \left( \frac{1}{2{{\tau }_{n}}} \left( \omega -{{\omega }_{n}}+{{\tau }_{n}}\right) \right) \right] \begin{matrix} {} &{} {} &{} \,\,\,if~{{\omega }_{n}}-{{\tau }_{n}}\le \left|{\omega }\right|\le {{\omega }_{n}}+{{\tau }_{n}} \\ \end{matrix} \\&0\begin{matrix} {} &{} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} otherwise &{} \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\ \end{aligned} \right. \end{aligned}$$
(1)
$$\begin{aligned} {{\hat{\psi }}_{n}} \left( \omega \right) =\left\{ \begin{aligned}&1\begin{matrix} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} {} &{} \,\,\,if~~{{\omega }_{n}}+~~{{\tau }_{n}}\le ~~\left|{\omega }\right|\le ~~{{\omega }_{n+1}}-~~{{\tau }_{n+1}} \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\&\cos \left[ ~\frac{\pi }{2}\beta \left( \frac{1}{2{{\tau }_{n+1}}} \left( \omega -{{\omega }_{n+1}}+{{\tau }_{n+1}}\right) \right) \right] \begin{matrix} {} &{} {} &{} \,\,\,if~{{\omega }_{n+1}}-{{\tau }_{n+1}}\le \left|{\omega }\right|\le {{\omega }_{n+1}}+{{\tau }_{n+1}} \\ \end{matrix} \\&\sin \left[ ~\frac{\pi }{2}\beta \left( \frac{1}{2{{\tau }_{n}}} \left( \omega -{{\omega }_{n}}+{{\tau }_{n}}\right) \right) \right] \begin{matrix} {} &{} {} &{} \,\,\,if~{{\omega }_{n}}-{{\tau }_{n}}\le \left|{\omega }\right|\le {{\omega }_{n}}+{{\tau }_{n}} \\ \end{matrix} \\&0\begin{matrix} {} &{} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} otherwise &{} \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\ \end{aligned} \right. \end{aligned}$$
(2)

To ensure that the aforementioned functions form a tight frame, the right choice of \( \tau _{n} \) is important. The simplest way to choose this parameter is to make it proportional to \( \omega _{n}: \tau _{n}= \gamma \omega _{n} \) where \( 0<\gamma <1 \).

\( \beta (x) \) is an arbitrary function defined as follows [26]:

$$\begin{aligned} \beta \left( x \right) =\left\{ \begin{aligned}&0\begin{matrix} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} {} &{} \,\,\,if~x\le 0 \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\&\beta \left( x \right) +\beta \left( 1-x \right) =1\begin{matrix} {} &{} \forall x\in \left[ 0,1 \right] \\ \end{matrix} \\&1\begin{matrix} {} &{} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} \,\,\,if~x\ge 1 \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\ \end{aligned} \right. \end{aligned}$$
(3)

The most commonly used function that satisfies these properties is given by the following:

$$\begin{aligned} \beta \left( x \right) ={{x}^{4}}\left( 35-84x+70{{x}^{2}}-20{{x}^{3}} \right) \end{aligned}$$
(4)

Equations 5 and 6 define the approximation and detail coefficients that are obtained from the inner product of the processed signal with the scaling and empirical wavelet functions.

$$\begin{aligned} \mathcal {W}_{f}^{\varepsilon }\left( 0,t \right) =\langle f,{{\phi }_{1}}\rangle ~=\mathop {\int }^{}f\left( \tau \right) \overline{{{\phi }_{1}}\left( \tau -t \right) }d\tau \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} \mathcal {W}_{f}^{\varepsilon }\left( n,t \right) =\langle f,{{\psi }_{n}}\rangle ~=\mathop {\int }^{}f\left( \tau \right) \overline{{{\psi }_{n}}\left( \tau -t \right) }d\tau \\ \,\,\begin{matrix} {} &{} \begin{matrix} {} &{} {} &{} {} &{} \begin{matrix} {} &{} ={{\left( \hat{f}\left( \omega \right) \overline{{{{\hat{\psi }}}_{n}}\left( \omega \right) } \right) }^{\vee }} \\ \end{matrix} \\ \end{matrix} \\ \end{matrix} \\ \end{aligned} \end{aligned}$$
(6)

By inverting the EWT, the signal is reconstructed as follows:

$$\begin{aligned} {{f}}\left( t \right) =\mathcal {W}_{f}^{\varepsilon }\left( 0,t \right) *{{\phi }_{1}}\left( t \right) +\sum _{n=1}^{N} \mathcal {W}_{f}^{\,\varepsilon }\left( n,t\right) *{{\psi }_{n}}\left( t \right) \end{aligned}$$
(7)

Equation 7 shows that the input signal is decomposed into empirical modes, \( f_{k} \), which are given by the following:

$$\begin{aligned} {{f}_{0}}\left( t \right) =\mathcal {W}_{f}^{\varepsilon }\left( 0,t \right) *{{\phi }_{1}}\left( t \right) \end{aligned}$$
(8)
$$\begin{aligned} {{f}_{k}}\left( t \right) =\mathcal {W}_{f}^{\,\varepsilon }\left( k,t \right) *{{\psi }_{k}}\left( t \right) \end{aligned}$$
(9)
Fig. 2
figure 2

Binary tree structure (a), 1/3 binary tree structure (b), and EWPD structure (c)

Fig. 3
figure 3

EWPD segmentation for level 1 and level 2

2.2 Empirical wavelet packet decomposition

Conventional segmentation models like binary tree [14] and 1/3 binary tree proposed by Antoni [15], shown in Fig. 2 a and b respectively, have an insufficient segmentation accuracy. Many frequency bands, particularly those around the red lines in the prior figures, cannot be explored due to subdivision constraints. Consequently, the signal characteristics are not fully considered. To overcome this shortcoming, we propose the EWPD that provides a new segmentation model of the Fourier axis. This is represented in Fig. 2c.

The EWPD can be explained through the following steps:

  1. 1.

    Introduce the two parameters, the scale h, and the initial boundary values IBV. \(h \in \mathbb {N}^+\) controls the number of wavelet filters constructed by Eqs. 1 and 2 using the IBV, given in Eq. 10, to fix the limit of each segment \(\omega _n\).

    $$\begin{aligned} {IBV}_{h}^{j}= \left( ~\frac{j }{h+1}\right) \pi \left( \frac{{L}_{s}}{2{\pi }}\right) \end{aligned}$$
    (10)

    where j is the boundary index. It is given as \(j = [1,2,3,...,h]\), and \( L_{s} \) is the signal length. It is worth noting that for a given scale value, let us take \(h = 10\), a set of IBV is calculated for every \({j} \in (1,2,.., 10)\)

  2. 2.

    Adapt the initial set of boundaries to the analyzed signal by computing some neighborhood, as \([{IBV}_{h}^{j}-\epsilon , {IBV}_{h}^{j}+\epsilon ] \). The objective behind this action is to preserve the information contained in the signal, i.e., by avoiding amplitude attenuation of harmonics in cases where the computed IBV matches with their frequencies. The neighborhood is chosen to be 20 times the frequency step, defined as \(20 * \frac{L_{s}}{F_s}\), where \(F_s\) is the sampling frequency.

  3. 3.

    Detect the global minima in each neighborhood as \(min(FFT(f({IBV}_{h}^{j}-\epsilon , {IBV}_{h}^{j}+\epsilon )))\), where \({\ f}\) is the processed signal. Then, redefine the boundary \(\omega _n\) at the corresponding frequency. The segmentation prescribed in the above steps is illustrated in Fig. 3.

  4. 4.

    Build the wavelet filter bank based on the new Fourier segments. This operation is capable of extracting a predefined number of modes \({f}_{h}^{k}\left( t \right) \) from the signal \({f}\left( t \right) \) in each level, and thus,

    $$\begin{aligned} {f}\left( t \right) = \sum _{k=1}^{{N}_{m}} {f}_{h}^{k}\left( t \right) \end{aligned}$$
    (11)

    k represents the mode reference, and \({N}_{m}\) is the total number of modes in each level, where \({N}_{m}=h+1\).

Fig. 4
figure 4

Shematic of the LSTM network [28]

3 Long short-term memory network

Researchers have been exploring deep learning DL techniques recently, owing to their superior performance and efficiency in a variety of fields. Recurrent neural network (RNN) is one of the deep learning techniques that is specifically designed for addressing sequential problems, where the primary objective is to memorize long-term dependencies. However, throughout long time series, RNNs fail to achieve their main goal due to the vanishing or exploding gradient problem. The RNN-LSTM network [27] was developed to overcome this problem. Unlike classic RNNs, the LSTM unit is fitted with a cell and a gating system as illustrated in Fig. 4 [28], which control the information flow (cell state, output), enabling the network to learn from large sequences of data. The gating unit has three gates: the forget gate, the input gate, and the output gate.

Based on the input \( x_{t} \) and the previous cell’s hidden state \( h_{t-1} \), the forget gate refreshes the memory cell by altering the weight of the self-loop cell state in the following way:

$$\begin{aligned} {{f}_{t}}={\sigma } (\mathcal {W}_{f}{{x}_{t}}+ \mathcal {U}_{f}{{h}_{t-1}}+{{b}_{f}}) \end{aligned}$$
(12)

As depicted in the equation below, the input gate is in charge of commanding the information intended for the memory cell:

$$\begin{aligned} {{i}_{t}}={\sigma } (\mathcal {W}_{i}{{x}_{t}}+ \mathcal {U}_{i}{{h}_{t-1}}+{{b}_{i}}) \end{aligned}$$
(13)

The output gate determines the weight of the LSTM cell output:

$$\begin{aligned} {{o}_{t}}={\sigma } (\mathcal {W}_{o}{{x}_{t}}+ \mathcal {U}_{o}{{h}_{t-1}}+{{b}_{o}}) \end{aligned}$$
(14)

The input activation vector updates the weight of the memory cell as follows:

$$\begin{aligned} {{\tilde{c}}_{t}}={tanh} (\mathcal {W}_{c}{{x}_{t}}+ \mathcal {U}_{c}{{h}_{t-1}}+{{b}_{c}}) \end{aligned}$$
(15)

Finally, the cell state of the LSTM unit is updated as established in Eq. 16, then the unit output is given by Eq. 17:

$$\begin{aligned} {{c}_{t}}= {{f}_{t}}{{c}_{t-1}}+ {{i}_{t}}{{\tilde{c}}_{t}} \end{aligned}$$
(16)
$$\begin{aligned} {{h}_{t}}= {{o}_{t}}{{tanh}}({{c}_{t}}) \end{aligned}$$
(17)

where \(\mathcal {W}\) and \(\mathcal {U}\) are the input and recurrent weight matrices for the gating unit and the cell respectively. b denotes the biased vectors of each gate and the cell as well. \({f}_{t}\), \({i}_{t}\), and \({o}_{t}\) represent the forget, input, and output gates activation vectors. \({c}_{t}\), \({\tilde{c}}_{t}\), and \({h}_{t}\) indicate the internal cell state which is the long-term state [29], the cell update, and the current output which is the short-term state [29], respectively. \(\sigma \) and tanh are the sigmoid and the tangent hyperbolic activation functions.

The LSTM network’s previously described gating architecture provided it with strong learning potential for sequential problems, making the model adaptable for processing, forecasting, predicting, and classifying time series data.

Fig. 5
figure 5

Dataset collection process for the 2010PHM data challenge [30]

Table 1 Operating conditions

4 Experimental setup

This section includes a presentation of the dataset used in this study as well as the experimental configuration for the proposed method.

4.1 Dataset details

The high-speed CNC milling machine platform used to evaluate the effectiveness of the proposed method is illustrated in Fig. 5 [30]. The data is presented on the “prognostic data challenge 2010” database [31]. In this experiment, a 3-mm ball nose tungsten carbide end mill (stainless steel, HRC52) [32] with three flutes was tested. Table 1 lists the cutting parameters.

The database is constructed using three sensor types. A Kistler quartz three-axis dynamometer was mounted in the machining table, to measure the forces induced by the cutting tool in X, Y, and Z directions. Three Kistler piezo accelerometers were installed in the workpiece to get the three-axis vibrations generated by the milling process. Last, a Kistler acoustic emission (AE) sensor was fixed as well in the working piece, to get the high-frequency stress wave, which represents the surface movement of the working piece in the machining process [32]. Besides, the actual tool’s flank wears were measured, offline, using a microscope after each cut cycle. This information serves as a benchmark to evaluate the predicted RUL. For more details, please refer to [31].

4.2 Experimental configuration

Six identical cutters, whose characteristics are mentioned previously, submitted to invariant milling conditions were employed in the experiment. The data set collected is divided into six records (given as C1 to C6), each of which contains 315 files, indicating the cut cycles for each cutting tool. Each of these data acquisition files is in the (.csv) format and contains seven channels. Each channel represents a time series related to a single sensor acquisition for over 200 thousand acquisition points (different lengths for each cut). Table 2 defines the channels.

As regards the real cutting tool wear measured by the Microscope, they are available on the experiment platform [31] for only three of the six cutters used throughout the milling process which are C1, C4, and C6. Subsequently, only their corresponding recordings were considered for analysis. Indeed, the wear value is given for each flute after finishing each surface. As the quality of the machined surface is important, the maximum wear across the flutes is determined as the wear limit one could safely achieve for any flute. Figure 6 displays the wear after each cut of C1’s flutes, with the maximum wear highlighted.

Table 2 Organization of the data acquisition file
Fig. 6
figure 6

Wear across cutter 1’s flutes (a). Maximum wear across cutter 1’s flutes (b)

5 Methodology, results, and discussion

This section outlines the strategy employed for determining the cutting tool’s degradation behavior and remaining useful life. It gives, also, a discussion of the obtained results.

Figure 7 illustrates the flowchart of the proposed method, serving as a visual guide for discussing the subsequent steps: feature extraction, feature evaluation, selection, and RUL estimation.

Fig. 7
figure 7

Flowchart of the proposed method

5.1 Feature extraction

Our method’s primary purpose is to monitor the cutting tool’s degradation in terms of flank wear. To do that properly, the data acquisition stage is the first focus, as it is crucial in the execution of the monitoring. Therefore, the data measured by the three sensor types was first considered and processed by the EWPD. For each cutter, i.e., C1, C4, and C6, the measured signals are decomposed into several modes for a total of 29 levels. An example of the decomposition is presented in Fig. 8. Although the level number is user-defined, it is recommended to set it proportional to the sampling frequency. The higher it is, the more a higher level number can be chosen. In this way, incomplete information coming from the very narrow-band filters is abandoned.

Fig. 8
figure 8

Empirical wavelet packet decomposition process

Following the decomposition, the focus shifts to feature extraction, aimed at acquiring valuable and meaningful information about the cutter’s degradation. In this paper, a new feature extraction scheme is developed and detailed in Algorithm 1.

Algorithm 1
figure a

Feature extraction.

The foundational aspects of the algorithm rely on statistical measures, commonly called time-domain features (TDFs), calculated for each mode along the 29 levels. Certainly, using all TDFs may seem impractical for use. Therefore, only root mean square, standard deviation, and variance are chosen whose equations are given in Table 3. In short, starting from the premise that monotonic HIs reflect more of the cutting tool degradation and especially help the deep learning model learn quickly and efficiently. The computed features, from the empirical modes, for each cutting cycle, are stacked in a vector, defined as \(\textit{TF}\!~^{i}\), from which we proceed to a clever selection of features of interest. This is achieved by simply tracking the minimum value in the vector \(\textit{TF}\!~^{i}\) that is just slightly greater than that in the vector \(\textit{TF}\!~^{i-1}\). Figure 9 presents the processing results of the acoustic emission, vibration, and force signals. Figure 9 a–c depicts the RMS-EWPD-derived features. The STD-EWPD-derived features are represented in Fig. 9 d–f, while the VAR-EWPD-derived features are drawn in Fig. 9 g–i. It can be seen that the resulting HIs show no random fluctuations and misleading trends. They exhibit a monotonically increasing behavior throughout the cutting tools’ lifetime that is easy to model by LSTM for RUL estimation.

5.2 Feature evaluation and selection

When creating a predictive model, it is necessary to reduce the number of input variables, which helps lower the modeling’s computing cost and boosts the model’s performance. Thus, this step is intended to evaluate the obtained features and quantify their performance, to select the most appropriate sensor results that provide the best presentation of the cutter’s degradation. To do so, Coble J. B. [33] has proposed fundamental metrics defining an ideal prognostic parameter, such as monotonicity M and trendability T. The following equations [34] formalize these metrics:

$$\begin{aligned} M\!=\!{\text {mean}}\left( \left|\frac{ \text{ positive } (\text {diff}(y_{i}))\!-\!\text{ negative } (\text{ diff } (y_{i}))}{n-1}\right|\right) \end{aligned}$$
(18)

Where, \(i=1,2,3,...,n\). n is the length of the indicator y.

$$\begin{aligned} T=\min \left( \left|{\text {corrcoef}} \left( y_{i}, y_{j}\right) \right|\right) \end{aligned}$$
(19)

where \(i,j=1,2,3,...,m. m\) is the total number of monitored systems.

The term “monotonicity” refers to the indicator’s primary positive or negative trend [34]. This metric is considered one of the most precise measures for assessing the efficacy of the health indicator since the cutting tool’s degradation is related to wear progression, an irreversible phenomenon. A suitable condition indicator typically has a monotonic trend as a system gets closer to failure [35].

Table 3 Time-domain features

Trendability is defined as the degree to which a population of systems’ parameters have the same underlying shape and can be characterized by the same functional form [35].

Fig. 9
figure 9

RMS-EWPD features derived from acoustic emission, vibration, and force sensors (ac), SD-EWPD features (df), and VAR-EWPD features (gi)

Each of these measures would have a value ranging from 0 to 1, with 1 representing an optimal score on that statistic and 0 denoting that the condition indicator is inappropriate for the depiction of the component degradation [35].

The results in Tables 4, 5, and 6 reveal that all the extracted degradation indicators exhibit a fully monotonic trend \((M=1)\), making them appropriate for representing the degradation phenomenon. Besides that, trendability results clearly illustrate that acoustic-emissions-sensor-derived features are more trendable in all cases, indicating that they are better correlated with the progression of the flank wear. Consequently, acoustic emissions sensor-based results are selected to be employed for the remainder of this work.

Next, for comparison purposes, the same data was processed by the WPT method, and naturally, the resulting modes are analyzed with the same statistical measures. The WPT-derived features, represented in Fig. 10, exhibit spurious fluctuations. The trends are very difficult to model, so poor RUL estimates are expected. The average values of monotonicity and trendability for both methods’ health indicators are shown in Table 7. Pearson’s correlation coefficient (r) between the actual wear and HIs is also applied to each method to indicate the most correlated features with tool wear that are not inflated by other cutting circumstances. The average value of Pearson’s correlation is presented in Table 7. The results show that, regardless of cutting parameters, the recommended features have the strongest correlation with wear conditions.

Table 4 RMS-EWPD features fitness
Table 5 SD-EWPD features fitness
Table 6 VAR-EWPD features fitness
Fig. 10
figure 10

RMS-WPT-derived features (a), SD-WPT-derived features (b), and VAR-WPT-derived features (c)

Table 7 Performance comparison metrics for WPT-derived features and EWPD-derived features

5.3 Remaining useful life estimation

As the RUL estimation is the major target in this paper, this section is intended to explain the adopted strategy to determine the cutting tools RULs using the LSTM network.

Many studies claim to do prognostics by feeding directly the raw data, usually with a high dimension, into the deep learning model. But working with such data is difficult because it can be computationally expensive. In contrast, feature extraction may help to reduce the complexity of the data, eliminate irrelevant and redundant information, and make the model focus on the most important information that can contribute to a better RUL prediction.

As shown in Fig. 11, RUL estimation can be summarized in two main steps:

  • Preparing input data for regression using LSTM

  • Building the LSTM network

Fig. 11
figure 11

Diagram of RUL prediction by LSTM

Concerning the input data, six combinations were explored to validate the model’s performance and ensure its ability to generalize to different data distributions. They are explained in Table 8. Before training, three HIs for each cutter, i.e., RMS-EWPD, SD-EWPD, and VAR-EWPD-derived indicators, are stacked into an input matrix of the shape of (315, 3). These data are normalized with the min-max scaler, i.e., they are transformed so that they lie in the range [0, 1]. Normalizing data is a pre-processing step that is commonly used in deep learning. It helps improve the model’s convergence during training, reduces the risk of overfitting, and improves the model’s generalization ability. In the case of large data, it helps to ensure that the model is not biased towards any particular feature and can learn the relationships between the features and the target more efficiently [36]. The min-max scaler is given in Eq. 20:

$$\begin{aligned} {{HI}_{scaled}=\frac{\textit{HI}-min\left( \textit{HI}\right) }{max\left( \textit{HI}\right) -min\left( \textit{HI}\right) }} \end{aligned}$$
(20)

The LSTM architecture is designed to receive input data in a three-dimensional shape, i.e., the number of samples, time steps, and features [37]. So that the network can analyze the input data in the proper sequence and identify patterns in the data.

The normalized data are reshaped to three dimensions, specifically (samples, time steps, features), to match the algorithm’s requirements. As a result, the input matrix is reshaped to 315, 3, and 1, where the number of rows in the original input matrix is treated as the sample number, the number of columns in the same matrix is treated as the number of time steps, and they are stacked in one feature.

In the second step, the model is built, with comprehensive architectural specifications provided in Table 9. The weights of the cell update and the gating unit are adjusted in the training state using Adam optimizer [38]. For iteratively updating network weights, this algorithm is employed as a substitute to typical stochastic gradient descent such as AdaGrad and RMSProp that keep all weight updates at the same learning rate [39]. This internal optimization algorithm is known to be reliable in the selection of hyperparameters [40].

A loss evaluation function is performed after each training epoch using root mean square error (RMSE). This error function is typically reserved for regression prediction models [37], and it is defined as follows:

$$\begin{aligned} R M S E=\sqrt{\frac{1}{K} \sum _{i=1}^{K}\left( {R_{i}-P_{i}}\right) ^{2}} \end{aligned}$$
(21)

where \({R}_{i}{} { and}{P}_{i}{} \) are the real and predicted values respectively.

The RMSE decreases over the training until obtaining the best possible fitted results.

The predicted remaining useful lives of the cutters C1, C4, and C6 by the LSTM network, trained by the proposed HIs, together with the real RULs are represented in Fig. 12. It can be seen that the predictions follow a similar pattern as ground truths and have close correspondence. These results are confirmed by comparison with the RULs, estimated by LSTM and trained with WPT-derived features. The accuracy (Acc) (Eq. 23) metric, the mean absolute error (MAE) (Eq. 22), and the root mean square error RMSE were used for error quantification.

Table 8 Training and testing sets configuration

As shown in Table 10, for each cutter, both RMSE and MAE values appear to be lower by using the proposed methodology. Since lower values indicate a better fitting of the ground truth, we can infer that the proposed methodology delivers the best results. The accuracy results confirm our method’s suitability for RUL prediction with results close to 1.

$$\begin{aligned} M A E=\frac{1}{K} \sum _{i=1}^{K}\left|{R_{i}-P_{i}}\right|\end{aligned}$$
(22)

and

$$\begin{aligned} {{A}_{cc}}=\frac{1}{K} \sum _{i=1}^{K}\exp \left( -\frac{|{R}_{i}-P_{i}|}{{R}_{i}}\right) \end{aligned}$$
(23)

where \({R}_{i}{} { and}{P}_{i}{} \) are the real RUL and the predicted RUL respectively.

Table 9 Model architecture details
Fig. 12
figure 12

RUL prediction of cutting tools

6 Conclusion

In this paper, a signal processing approach was introduced, complemented by artificial intelligence for tracking cutting tools’ degradation and estimating their remaining useful lives. The approach, termed empirical wavelet packet decomposition, provides a novel segmentation technique of the signal’s Fourier spectrum, enabling a more comprehensive exploration of frequency bands. Leveraging these segmented sub-bands, a novel and robust health indicator is constructed. Furthermore, a widespread reference method called WPT was used for comparison. The observations of this study can be summarized as follows:

  • The proposed EWPD exhibits superior segmentation performance when compared to traditional binary trees and 1/3 binary tree structures, allowing for a more precise highlighting of the signal’s unique characteristics and properties.

  • The proposed feature extraction scheme, designed to exploit richer filter banks, yields features that consistently exhibit a monotonically increasing trend throughout the entire lifespan of the cutting tools.

  • The acoustic-emissions-sensor-derived and selected features were found to be better correlated with the progression of the cutting tool flank wear; thus, they are considered the HIs.

  • The LSTM network was used to learn the HIs and estimate the RULs. The predicted RULs accurately follow the ground truth patterns, proving the high performance of the proposed method.

The developments in this paper show that a thorough focus on feature extraction and selection is important to ensure the construction of an effective health indicator that is easy to model. While the proposed method exhibits promising results, some limitations should be addressed, for example, the decomposition level in EWPD needs to be adaptively determined. Moreover, there is a need for conducting additional experiments across diverse datasets to validate its generalization.

Table 10 Comparison of prediction performance using two different inputs for the LSTM network