1 Introduction

Machines play a crucial role in our daily lives and are integral to various activities. In industrial production, all machines rely on bearings for their proper functioning, specifically using rolling bearings to support axial and radial loads. Bearing faults contribute significantly, accounting for 40%–45% [1, 2] of machine failures. To prevent sudden breakdowns, a condition-based machine health monitoring system is essential, particularly for the early detection of faults in rotating elements. The system utilizes either vibration or acoustic signals generated by the operation of rolling elements. Extensive research has been conducted in fault diagnosis and real-time machine health prognostics, initially focusing on statistical characteristics of vibration signals in the time and frequency domains [3]. However, challenges with nonlinear and nonstationary signals led researchers to employ advanced techniques such as the Ensemble Empirical Mode Decomposition (EEMD). Li Hua et al. [3, 4] presented a solution to the problems of optimal IMF band selection and enhanced denoising by introducing an improved EEMD method that incorporates improved adaptive resonance technology (IART). Zair et al. [5] presented a new method for multi-fault diagnosis in rolling bearings, combining fuzzy entropy of empirical mode decomposition, principal component analysis, and a self-organizing map neural network. Wavelets [6,7,8,9,10], known for their adaptability to signal shapes, have become a vital tool for analyzing nonlinear and nonstationary signals, providing multiresolution decomposition and noise suppression capabilities. Many researchers [10,11,12,13,14,15] have been exploring ways to analyze vibration signals for fault detection and classification by converting them from 1 to 2D representations. This allows them to utilize image processing and machine learning techniques for analysis. Vibration signals are converted to grayscale images, spectral images, or other 2D formats. Features are then extracted and used for classification with techniques like SVM, K-nearest neighbors, and artificial neural networks. Nowadays, intelligent fault detection using deep learning approaches is widely accepted in machinery health monitoring systems [16,17,18]. Sharma et al. [16] introduced an automated seizures classification technique using nonlinear higher-order statistics and a deep neural network with a sparse autoencoder. Sun et al. [17] presented a deep neural network approach for induction motor fault diagnosis using a sparse autoencoder for feature learning and fault classification. Zair et al. [18] presented a novel unsupervised deep learning methodology for fault diagnosis that surpasses the limitations of traditional feature extraction methods. Their approach, combines an autoencoder, t-SNE, and a multi-kernel convolutional neural network, achieves higher accuracy in diagnosing bearing defects compared to conventional techniques. An optimized softmax classifier was proposed by Gao et al. [19] for accurate classification of ultrasonic signals. The results demonstrate high classification accuracy and strong robustness. This paper proposes a data-driven approach to fault classification, determining whether faults occur in the inner race, outer race, or on the ball of the rolling bearing element. This classification is achieved through the utilization of shearlet transform, autoencoder, and a softmax classifier.

The following sections of the paper are structured in the following manner: Sect. 2 presents an introduction to the theoretical background. Section 3 outlines the methodology proposed for the study. Section 4 delves into the discussion of the experimental results, and Sect. 5 provides the concluding remarks of the paper.

2 Theoretical background

2.1 Autoencoder

The fundamental idea behind an autoencoder is to encode input data into a lower-dimensional representation and then decode it back to its original form.

The encoder takes the input data and maps it to a lower-dimensional latent space representation. This mapping is achieved through a series of hidden layers with decreasing dimensions, ultimately compressing the input into a condensed representation. The decoder then takes this compressed representation and attempts to reconstruct the original input data with some loss. Figures 1 and 2 illustrate the block diagram and architecture of an autoencoder.

Fig. 1
figure 1

Block diagram of autoencoder

Fig. 2
figure 2

Architecture of autoencoder

During training, the goal is to minimize the reconstruction error, which is typically measured using a loss function such as mean squared error (MSE) or binary cross-entropy.

The loss function quantifies the difference between the original input and the reconstructed output [16,17,18].

The input \({\text{X}} \in {\text{R}}^{{\text{D}}}\) is defined in D dimension vector space. Encoder [18] maps the input to latent space represented by

$${\text{h}} = \upphi \left( {{\text{WX}} + {\text{b}}} \right)$$
(1)

where,\({\text{h}} \in {\text{R}}^{{\text{d}}} \;{\text{ and}}\; {\text{d}}\; {\text{dimensional}}\; {\text{Vector}}\;{\text{space}},\, {\text{d < D}}\),

\(\upphi\) is the activation function of encoder. The weight matrix \({\text{W}} \in {\text{R}}^{{{\text{d}} \times {\text{D}}}}\) and b is the bias value.

The decoder maps the latent space \({\text{h}}\) back to input space \(\widehat{{\text{X}}}\) at the cost of loss.

$$\widehat{{\text{X}}} = \uptheta \left( {{\text{W}}^{\prime } \cdot {\text{h}} + {\text{b}}^{\prime } } \right)$$
(2)

here, \(\uptheta\) is the activation function of decoder.

The choice of activation functions for both the encoder and decoder is deliberate, aiming to prevent the reconstructed signal from being an exact copy of the original input. Instead, the aim is to generate a reconstructed signal that serves as an approximation, effectively addressing the overfitting problem common in Artificial Neural Networks (ANNs).

The formulation of the cost error function incorporates the loss function and regularization, as depicted in Eq. (3) [16, 17].

$$\upvarphi \left( {\text{e}} \right) = \frac{1}{2}\parallel{\text{X}} - \widehat{{\text{X}}}\parallel^{2} + \Omega \left( {{\text{h}}, \,{\text{X}}} \right)$$
(3)

In this context, the first term \(\frac{1}{2}\parallel{\text{X}} - \widehat{{\text{X}}}\parallel^{2}\) represents the loss function, while the second term \(\Omega \left( {{\text{h}}, {\text{X}}} \right)\) serves as regularization, aiming to deter memorization or overfitting. This regularization can be expressed as follows [16]

$$\Omega \left( {{\text{h}},{\text{X}}} \right) = \lambda \sum\limits_{{{\text{i}} = 1}}^{{\text{p}}} \parallel{\nabla _{{{\text{x}}_{{\text{i}}} }} } \cdot {\text{h}}_{{\text{i}}}\parallel^{2}$$
(4)

The weight decay parameter, denoted by \(\uplambda\), is applied alongside the difference operator \(\nabla_{{\text{x}}}\) which operates on the random variable \({\text{X}}\) associated with the \({\text{i}}^{{{\text{th}}}}\) node.

2.2 Softmax classifier

The softmax classifier, considered a linear classifier, generates output as a probability distribution across all potential classes. It utilizes the cross-entropy function to adjust the layer weights, where cross-entropy serves as a loss function quantifying the deviation between predicted and actual outputs. In the context of probability distributions, the formal definition of cross-entropy [19] between two distributions is expressed by Eq. (5).

$$\mathcal{L} \left( { {\text{X}},\,\widehat{{\text{X}}} } \right) = - \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{n}}} {\text{x}}^{\left( {\text{i}} \right)}\log( \widehat{\text{x}}^{\left( {\text{i}} \right)})$$
(5)

Here, \({\text{X}},\, \widehat{{\text{ X}}}\) denote the original input signal and its approximate output signal respectively. ‘n’ represents the total number of sample points in both the original and reconstructed output signals. Equation (6) provides the probabilistic output from the softmax classifier for K classes of input data and weights.

$${\text{P}}\left( {{\text{Y}} = {\text{K}}|{\text{X}}} \right) = \frac{{\exp \left( {{\text{W}}_{{\text{K}}}^{{\text{T}}} \cdot {\text{X}}} \right)}}{{\sum\nolimits_{{\text{i}}}^{{\text{N}}} {\exp } \left( {{\text{W}}_{{\text{K}}}^{{\text{T}}} \cdot {\text{X}}} \right)}}$$
(6)

here, symbol (·) is the dot product and N is the total number of classes. The softmax classifier, implemented in the final stage of the system, categorizes the input signal by leveraging the dimensionally reduced code generated by the autoencoder.

2.3 Shearlet transform

The shearlet transform extends the concept of wavelet transforms by incorporating directional sensitivity, making it particularly useful for tasks such as edge detection, image denoising, and texture analysis.

Shearlet transform employs shearing operations to capture directional information effectively. These operations involve stretching and compressing data along different directions, allowing the transform to adapt to the local structure of the data. By applying shearing transformations at various scales and positions, the shearlet transform can provide a multiscale representation of the input signal or image.

Shearlets are formed through a combination of parabolic scaling, shearing, and translation applied to a small set of generating functions. At finer scales, they primarily occupy narrow and directionally oriented ridges, conforming to the parabolic scaling principle, where the squared length approximately equals the width.

The continuous shearlet system is generated [20,21,22] by a function \(\Psi \in \mathcal{L}^{2} \left( {{\mathbb{R}}^{2} } \right)\)

$${\text{SH}}_{{\text{CONT }}} \left( \Psi \right) = \Psi_{{\text{a,s,t }}} = {\text{a}}^{3/4} \Psi \left( {{\text{S}}_{{\text{s }}} {\text{A}}_{{\text{a}}} \left( {. - {\text{t}}} \right)} \right)$$
(7)

where \(S_{{\text{s}}} ,A_{a}\) are shear and dilation matrix, responsible for forming shearlet system. The parameters \({\text{a}}, {\text{s}} \;{\text{and}}\; {\text{t}}\) are the dilation, shear and translation factors and \(\left( {{\text{a}},{\text{s}},{\text{t}}} \right) \in \left( {{\mathbb{R}} \times {\mathbb{R}} \times {\mathbb{R}}^{2} } \right)\).

Mathematically, dilation matrix and shear matrix can be defined by

$${\text{A}}_{{\text{a }}} = \left[ {\begin{array}{*{20}c} {{\text{a}}^{1/2} } \\ 0 \\ \end{array} \begin{array}{*{20}c} 0 \\ {{\text{a}}^{ - 1/2} } \\ \end{array} } \right] \, \& \, \, {\text{S}}_{{\text{s }}} = \left[ {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} \begin{array}{*{20}c} {\text{s}} \\ 1 \\ \end{array} } \right]$$
(8)

The discrete shearlet system is derived through the discretization of parameter set of a continuous shearlet system [24–25] and is characterized by

$${\text{SH}}\left( \Psi \right) = \Psi_{{\text{j,k,c}}} = 2^{{3{\text{j}}/4}} \Psi \left( {{\text{S}}_{{\text{k }}} {\text{A}}_{{{2}^{{\text{j}}} }} \left( {{.}{ - }{\text{c}}} \right)} \right)$$
(9)

where, \(\left( {{\text{j}},{\text{k}},{\text{c}}} \right) \in \left( \mathcal{L} \times \mathcal{L} \times \mathcal{L}^{2} \right)\) are the scale parameter, shearing parameter and cone parameter respectively.

A three scale shearlet system is presented in Fig. 3. The unique fan-like pattern of the system provides it with directional sensitivity, making it capable of capturing directional details efficiently. It can be observed that the number of shearing factors within the shearlet system increases with increase in frequency support. This means that as the frequency support expands, the system becomes more capable of capturing higher frequency details. However, as the frequency support increases, the support in the spatial domain decreases. This trade-off between frequency and spatial support is a characteristic of the shearlet system.

Fig. 3
figure 3

Frequency tiling in shearlet system

In the shearlet transform, the images at different scales are represented by shearlet coefficients. These coefficients capture the multi-directional and multiscale information of the input image. The directional information associated with an image such as Barbara can be visualized using shearlet transform. Figure 4 shows the average information of the image using shearlet parameters (j,k,c) as (0,0,0) while Fig. 5 shows the shearlet transform of this image at two scales with four shearlet filters at different orientations at each scale.

Fig. 4
figure 4

Shearlet coefficients of Barbara with shearlet parameter (j,k,c) as (0,0,0)

Fig. 5
figure 5

Shearlet coefficients of Barbara at two scales and four orientations at each scale with shearlet parameters (j,k,c) as (a). (1,-1,1),(1,0,1),(1,1,1),(1.0,2) (b). (2,-1,1),(2,0,1),(2,1,1), (2.0,2)

Shearlets are a multiscale framework that allows for the efficient encoding of anisotropic features in multivariate problem classes. Shearlet toolbox [23] has been downloaded from https://shearlab.math.lmu.de/software.

3 Proposed methodology

The proposed methodology is depicted in Fig. 6. The 1D vibration data obtained from a machinery fault simulator can be transformed into a 2D grayscale image by slicing the 1D vibration signal. Each slice of the 1D vibration data contains a certain number of sample points, where the number of samples per shaft rotation is determined by the sampling rate (fs) and the shaft rotation speed (fr). The formula to calculate the number of samples per shaft rotation is n = fs / fr. If the length of the 1D signal is represented by l, then the total number of slices is given by m = l/n.

Fig. 6
figure 6

Proposed methodology

This means that the 1D vibration signal is converted into a 2D vibration data matrix, denoted as D(m,n). In the second step, the 2D matrix D(m,n) is further transformed into a gray-level image, represented as I(m,n).

The 2D vibration images undergo a shearlet transform to remove noise and enhance the textures present in the vibration images. These enhanced vibration images are then inputted into an autoencoder.

In the autoencoder, the encoder section compresses the images, treating them as features to train the neural network. The purpose of this compression is to extract important information from the images.

Following this, a softmax classifier, which is a type of deep neural network, acts as the classifier in the fault diagnosis task. It analyzes the condensed features extracted by the autoencoder and distinguishes between healthy and faulty bearing signals based on learned patterns.

4 Experimental results and discussions

Vibration data was generated using a machinery fault simulator to simulate both healthy and various faulty conditions. These vibrations were captured using a highly sensitive triaxial accelerometer, specifically the model 356A16. The sensitivity of this device is 10.2 mV/(m/s2). To record the vibration signals, a 4-channel DAQ (Data Acquisition) system was used, and the signals were stored in a personal computer. The experimental setup, depicted in Fig. 7, illustrates how the machinery fault simulator was configured to capture the vibration signals from different bearings, i.e. faulty and non-faulty.

Fig. 7
figure 7

Experimental Setup of machinery fault Simulator (MFS)

The specifications of the various parts of the MFS are furnished in Table 1.

Table 1 Specifications of experimental setup

The vibration data was sampled at a rate of 25.6 kHz, and the shaft speed was set at 600 rpm and 1200 rpm. Three load conditions were considered: no load, medium load, and heavy load. The experimental setup involved capturing vibration signals from various bearings under different load conditions and rotating speeds. The vibration signals as depicted in Fig. 8, initially represented in 1D, undergo conversion into 2D gray level images sized at [256,256].

Fig. 8
figure 8

1D vibration signal for Healthy and faulty bearings at no load

Following this, a shearlet transform is applied to these raw images. Subsequently, noise within the images is mitigated, resulting in enhanced texture. The textures become visible and distinct. This transformation acts as a filter, effectively removing the noise and enhancing the underlying details. Figure 9 illustrates both the original and denoised vibration images of healthy bearings and those with various faults, including inner race (IR), outer race (OR), and ball faults. Four images with dimensions of [256, 256] are inputted into an autoencoder, resulting in a matrix sized at [1024, 256]. This matrix is then passed through the autoencoder, that includes a single hidden layer comprising of 25 nodes. Through encoder, the input matrix is transformed into a corresponding code of dimensions [1024, 25], as illustrated in Fig. 10.

Fig. 9
figure 9

Fault classification using softmax classifier

Fig. 10
figure 10

2D raw and denoised vibration images for healthy and faulty bearings (IR fault, OR Fault and Ball fault)

Figure 11 displays a confusion matrix demonstrating a classification accuracy of 75.5% under the condition of 600 rpm and heavy load when using raw vibration images.

Fig. 11
figure 11

Confusion matrix for raw vibration images at 600 rpm and heavy load

However, employing denoised vibration images leads to a substantial enhancement in classifier performance, resulting in an impressive overall accuracy rate of 99.4% as illustrated in Fig. 12.

Fig. 12
figure 12

Confusion matrix for denoised vibration images at 600 rpm and heavy load

In the analysis, it was observed that all healthy and OR Fault samples were detected accurately. For IR Fault and Ball Fault, out of the 256 samples, 252 samples of IR Fault and 254 samples of Ball Fault were diagnosed successfully. The diagnostic accuracy rates were 100% for healthy and OR Fault, while for IR Fault and Ball Fault, the rates were 98.4% and 99.2% respectively.

Similarly, in Fig. 13, under the condition of 1200 rpm and heavy load, it was observed that 256 out of 256 samples were successfully recognized for both IR Faults and OR Faults. Furthermore, out of 256 healthy bearings samples, 230 were diagnosed successfully, while 26 samples were misclassified as ball faults. In the case of ball faults, 237 out of 256 samples were accurately recognized, but 19 samples were misclassified as healthy bearings.

Fig. 13
figure 13

Confusion matrix for denoised vibration images at 1200 rpm and heavy load

Thus the classification accuracies for healthy and ball faults were determined to be 89.8% and 92.6% respectively. Furthermore, a perfect accuracy of 100% was achieved for IR faults and OR faults.The classifier exhibits an overall 91.3% classification accuracy for raw images at 1200 rpm under heavy load conditions, which can be elevated to 95.6% through denoising of images.

This enhancement suggests that the denoising process effectively enhances the quality of the input data, resulting in more accurate and reliable classification outcomes. The drastic improvement in classifier performance highlights the importance of preprocessing using the shearlet transform in optimizing the performance of machine learning models applied to vibration data analysis.

Table 2 and Table 3 showcase the performance of the classifier at two different rotating speeds 600 rpm and 1200 rpm and three load conditions. The classifier exhibits relatively low accuracies, ranging from 55.5% to 76.2%, when utilizing raw vibration images at 600 rpm across various load conditions. Likewise, at 1200 rpm, its performance varies between 58.4% and 91.3% for raw vibration images.

Table 2 Fault classification accuracies of raw and denoised vibration images of bearings at 600 rpm
Table 3 Fault classification accuracies of raw and denoised vibration images of bearings at 1200 rpm

However, when denoised vibration images are employed, there is a significant improvement in performance. Specifically, accuracy reaches 100% for both no load and medium load conditions at both 600 rpm and 1200 rpm. Conversely, under heavy load conditions, the accuracies notably remain high, achieving 95.6% and 99.4% for denoised vibration images at 600 rpm and 1200 rpm, respectively.

5 Conclusion

This manuscript presents a novel methodology for effective fault diagnosis of bearings in mechanical systems. The proposed approach combines the use of shearlet transform, autoencoder, and softmax classifier to handle the dynamic nature of the machinery and to enhance the diagnostic accuracy.

By transforming vibrational signals into 2D images and enhancing the image textutre using the shearlet transform, the intricate details of the underlying mechanical conditions are captured. The enhanced images undergo compression using an autoencoder to extract important information and create a condensed feature space. These features are then fed to a softmax classifier, which acts as the fault diagnosis classifier.

Experimental results demonstrate the robustness and efficacy of the proposed methodology. When utilizing raw vibration images, the classifier achieves relatively low accuracies. However, when denoised vibration images are employed, there is a significant improvement in performance. Accuracy reaches 100% for both no load and medium load conditions at both 600 rpm and 1200 rpm, while under heavy load conditions, the accuracies remain high, achieving 95.6% and 99.4% respectively.

The proposed methodology proves effective in handling varying speed and load conditions, making it suitable for real-world operating scenarios. The high classification accuracy achieved across diverse operating conditions demonstrates the potential for practical implementation and maintenance of mechanical systems.