Keywords

1 Introduction

Vibration analysis is widely employed in industrial applications. The fault vibration signal generated by the interaction between a damaged area and a rolling surface occurred regardless of the defect type. Therefore, vibration analysis can be employed to detect all types of faults, either localized or distributed. General signal based intelligent diagnosis methodology includes three steps as follows: signal acquisition; feature extraction; and fault recognition [1]. To extract representative features from the complex and non-stationary noisy signal, numerous vibration signal processing approaches have been developed such as statistical analysis, Fourier transform, wavelet transform, empirical mode decomposition (EMD). The feature set obtained from the extraction step often have a high dimension. To enhance the performance of the diagnosis system, dimension reduction methods are used such as principle component analysis (PCA), sequential feature selection (SFS). In the last step, machine learning algorithms such as artificial neural network (ANN), support vector machine (SVM), k-nearest neighbor (k-NN) are exploited to classify the faults.

Performance of machine learning based classification mainly depends on the feature extraction step. But there is no standard for extracting features because of the requirement of expert knowledge. Thus, for every specific fault diagnosis task, feature extractor must be redesigned manually.

Deep learning is a branch of machine learning based on algorithms that attempt to model high level abstractions of data. CNN is a deep learning algorithm with hierarchical neural networks whose convolutional layers alternate with sub-sampling layers, following with a full connection layer. A CNN primarily mimics the human visual system, which can efficiently recognize the parterns and structures in a visual scenery [2]. As a result, nowadays CNNs are successfully applied in many areas relating to image processing such as face recognition, object recognition, hand written recognition, video analysis, etc.

Recently, 2-D representations of signals have been exploited in various studies of fault diagnosis [3], where time-domain signals are converted to 2-D gray level images. In order to extract the texture information from the converted images, 2-D feature extraction methods are used such as gray-level co-ocurrence matrix (GLCM) [4], global neighborhood structure (GNS) map [5], etc. By exploiting the texture information, the 2-D image based fault diagnosis can achieve high accuracy, but the performance still depends mainly on handcrafted feature extraction.

Motivated by the efficient performance of CNN in image classification and the ability of representing signals in 2-D data, this paper proposes a bearing fault diagnosis algorithm using CNN which do not require any feature extractor. Gray-level images converted from the 1-D vibration signal are fed into CNN for classification. Experiments are carried out with the bearing data from the Case Western Reserve University Bearing Fault Database [6]. Comparison with other machine learning based fault diagnosis approaches (ANN, SVM, k-NN) is carried out to show the effectiveness of the proposed method.

This paper is organized as follows. The proposed fault diagnosis method is explained in detail in Sect. 2. Section 3 describes the implementations and performances. Finally, conclusions are drawn in Sect. 4.

2 The Proposed Bearing Fault Diagnosis Method

To exploit the efficient of CNN in image classification, the vibration signals are converted to gray images. Then vibration images are feed into CNN for classification. The block diagram of the proposed method is shown in Fig. 1.

Fig. 1.
figure 1

Block diagram of CNN based bearing fault diagnosis

2.1 Vibration 2-D Gray Image Construction

In the data conversion process, the amplitude of each sample in the vibration signal is normalized to range from 0 to 1. And the normalized amplitude of each sample becomes intensity of the corresponding pixel of the M × N corresponding image. The conversion between normalized amplitude of sample and corresponding pixel can be described as the following equation [3].

$$ P\left[ {i,j} \right] = A\left[ {\left( {i - 1} \right)*M + j} \right] $$
(1)

where \( i = 1:N; j = 1:M; \) \( P\left[ {i,j} \right] \) is intension of corresponding pixel \( \left( {i,j} \right) \) in the \( M \times N \) vibration gray image. A[.] is the normalized amplitude of the sample in the vibration signal. Number of pixel in the vibration image equals to number of sample in the vibration signal.

2.2 Image Classification with CNN

Typical CNN consists of four types of layers: convolution layer, sub-sampling layer, full connection layer and output layer. The network layers are arranged in a feed-forward structure: each convolution layer is followed by a sub-sampling layer. The last sub-sampling layer is followed by a full connection layer, which finally followed by the output layer.

At convolutional layer, the previous layer feature maps are convolved with learnable kernels and put through the activation function to form the output feature map. Each kernel is used at every position of the input. CNN exploits sparse connectivity by making the kernel smaller than the input and enforcing a local connectivity pattern among neurons of adjacent layers. Each output map may combine convolution with multiple inputs maps. However, for each output map, the input maps are convolved with distinct kernels. Each kernel is used at every position of the input. The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set.

Each convolution layer is followed by a sub-sampling layer. A sub-sampling layer produces down-sampled versions of the input maps, progressively reduces the spatial size of the representation. That helps to decrease the number of parameters and computation in the network. Moreover, sub-sampling layer makes the representation become invariant with a small translation of the input. If there are N input maps, there will be exactly N output maps, although the output maps will be smaller.

The full connection layer is a traditional feed-forward neural network, neurons in this layer have full connections to all activations in the previous layer. The purpose of the full connection layer is to use the features from previous layer for classifying the input image into various classes. The final layer in a CNN is output layer, using the softmax as the activate function.

With three architectural ideals: local receptive fields, weight sharing and sub-sampling, CNN has many strength [7, 8]. First, feature extraction and classification are integrated into one structure and fully adaptive. Second, the network extracts 2-D image features at increasing dyadic scales. Third, it is relatively invariant to geometric, local distortions in the image.

3 Experimental Implementation

3.1 Test-Bed Specification

The bearing data were obtained from the Case Western Reserve University Bearing Fault Database [6]. Motivation of this choice is the fact that a public database which accessible to the research community allows a fair comparison of the performance of the proposed algorithms.

Vibration signals of six operating conditions as following: normal condition, inner race fault, ball fault, outer race fault at position 6, 3 and 12 o’clock position.

3.2 Experiment 1: Fault Diagnosis with CNN

At first, the vibration signals are split into non-overlapping segments. The length of the segments is chosen based on two criterions: (i) long enough to capture localized features of the signal and (ii) as short as possible to reduce the computation time. In this experiment, the length of each segment is selected as 441 samples. Signals from fan end are considered. 270 samples/condition are generated, thus for six conditions we have 270 × 6 = 1620 segments.

In the next step, vibration images are constructed by the method described in Sect. 2.1. Since the length of each vibration signal segment is 441, we select the size of vibration image \( M = 21, N = 21 \) pixel, \( \left( {M \times N = 21 \times 21 = 441} \right) \). By that way, we obtained an image set includes 1620 gray images with size \( 21 \times 21 \) pixel. Then the image set is split into two sets: training set (1080 images) and test set (540 images). The vibration images for six bearing conditions are shown in Fig. 2.

Fig. 2.
figure 2

Vibration gray image of bearing under six conditions: (a)-Normal, (b)-Inner race, (c)-Ball, (d)-Outer Race 6 o’clock, (e)-Outer Race 3 o’clock, (f)-Outer Race 12 o’clock

The configuration of CNN classification is as follow:

  • first layer: convolution layer, 30 kernels, each kernel has size \( 6 \times 6 \), stride step 1, ReLU activate function;

  • second layer: sub-sampling layer with filter size \( 2 \times 2 \), stride step 1;

  • third layer: convolution layer, 60 kernels, each kernel has size \( 3 \times 3 \), stride step 1, ReLU activate function;

  • fourth layer: sub-sampling layer with filter size \( 2 \times 2 \), stride step 1;

  • full connection layer;

  • output layer: six output (corresponding to 6 conditions need to be classified).

3.3 Experiment 2: Fault Diagnosis with Conventional Machine Learning

To make the comparison, we consider a machine learning based fault diagnosis approach. In first step, the vibration signals are split into non-overlapping segments as the same way used by Experiment 1 in Sect. 3.2.

In feature extraction step, we consider statistical features and Wavelet Packet analysis. Statistical features are obtained from both time domain and frequency domain. Ten statistical features in the time domain: root mean square (RMS), square root the amplitude (SRA), kurtosis value (KV), skewness value (SV), peak-to-peak value (PPV), crest factor (CF), impulse factor (IF), margin factor (MF), shape factor (SF), and kurtosis factor (KF). Three statistical features in frequency domain: frequency center (FC), RMS frequency (RMSF) and root variance frequency (RVF). Signal from both end (drive end and fan end) are considered. The total number of statistical features is \( \left( {10 + 3} \right) \times 2 = 26 \) features, i.e., ten statistical features from time domain, three statistical features from frequency domain, taken at both the drive end fan end of the motor of the test-bed.

In this experiment, wavelet packet analysis is used to extract features from the time-frequency domain. The analysis procedure in [9] is used. The mother wavelet is Daubechies 4, and refining is done down to the fourth decomposition level. With a tree depth of 4, 16 final leaves were obtained and consequently, \( 16 \times 2 = 32 \) features were taken for both drive end and fan end.

After the feature extraction step, we obtained a feature set with size \( 1620 \times 58 \), i.e., 1620 samples, each sample has \( 26 + 32 = 58 \) features.

In next step, to reduce the dimension of the feature set, we use the sequential forward selection (SFS) [10]. SFS starts with an empty set and then test each candidate together with the already-selected features. After applying SFS algorithm, we obtain a reduced feature set with size \( 1620 \times 4 \). In the classification step, artificial neural network (ANN), support vector machine (SVM) and k-nearest neighbor (k-NN) are used.

3.4 Comparison and Analysis

In this part, we make a comparison between the classification results of two above experiments. The first experiment was our proposed fault diagnosis method using CNN. The second experiment used conventional machine learning based fault diagnosis includes following step: feature extraction, dimensional reduction, machine learning classification.

The first classifier, CNN used vibration images to dignose the bearing faults. For the next three classifiers, the reduced feature set with only 4 features was used to train ANN, SVM and k-NN. The last three classifiers are ANN, SVM, and k-NN which trained by the original feature set with 58 features. In all case, after being trained with the same training set (1080 samples, 180 samples for each class), all classifiers were fed the same test set (540 samples, 90 samples for each class) to evaluate the performance.

In this comparison, four standard criteria are used to evaluation the experiment results [11]: accuracy (acc), specificity (tnr), fallout (fpr) and miss rate (fnr).

From the Table 1, we can see some important points are:

Table 1. Classification results
  • the classifiers ANN, SVM, k-NN wihout using SFS, which used full feature set (58 features) don’t have good performance;

  • using SFS, the performace of those classifiers are all enhanced by using the reduced feature set (4 features);

  • the CNN classifier using 2-D gray level image has very good performance (100% accuracy), does not use feature extractor and feature selection.

4 Conclusion

In this study, we proposed a novel approach based on CNN to classify the faults of rolling element. By converting 1-D vibration signal to 2-D images and exploiting CNN image classification technique, the proposed method has two main strong points:

  • firstly, do not require feature extraction and feature selection process which have big effect on the classification accuracy and usually require the expert knowledge about the system.

  • secondly, gives high accurate classification of bearing faults, validated through the experiments with real data.