1 Introduction

In recent years, the issues related to health monitoring in the mechanical industry urgently need to be addressed. Some scholars have begun to spend time monitoring the health status of structures such as beams and trusses [1, 2]. Of course, due to the development of automated machinery and equipment, people are increasingly interested in fault diagnosis of their components, such as bearings [3,4,5]. At present, bearing fault diagnosis mainly includes three categories: signal processing and analysis, traditional fault diagnosis methods based on feature extraction, and self-extraction feature diagnosis methods based on deep learning [6]. In the actual research process, the three types of methods play their respective advantages and complement each other. From the perspective of signal processing, Li y et al. proposed a new time–frequency analysis (TFA) post-processing algorithm called local maximum high order time iterative synchrosqueezing (LHTIS) [7], and proved the effectiveness of this method by analyzing and processing fault signals. Haiyang Pan et al. [8] proposed multi-class fuzzy support matrix machine and successfully applied it to roller bearing fault diagnosis. In one study, VMD parameters and kernel fuzzy c-means (KFCM) were optimized respectively, and then the bearing fault types of small samples were identified [9]. Another study proposed a new method for bearing fault diagnosis based on wavelet packet transform and convolutional neural network optimized by simulated annealing algorithm [10]. Another study proposed ensemble self-taught learning convolutional auto-encoders (STL-CAEs) [11], which can effectively solve the problem of few labeled data. The method of combining dempster-shafer (DS) evidence theory with support vector machines (SVM) has also appeared in other bearing fault diagnosis research [12]. A new fault diagnosis method called RSG is proposed in the literature [13]. Yang S et al. [14] combined two-dimensional convolutional neural network (2DCNN) feature extractor and random forest (RF) classifier to establish a fault diagnosis model for the problem of high-speed bearings in offshore wind turbines. And the experimental results were 99.5% when 700 training samples and 300 test samples were input to the model. Literature [15] intercepted 30 training samples and 30 test samples respectively, and then calculated and mixed the time-domain and frequency-domain features of the samples. Finally, the deep neural network was used to identify the fault type with an accuracy of 99.1%. In [16], the weighted signal difference average (WSDA) as a new fitness function was proposed to optimize VMD, and a one-dimensional neural network was used for rolling bearing fault diagnosis. In the experiment with 5000 samples as the training set and 1000 samples as the test set, the accuracy of bearing fault diagnosis is 99.6%. In [17], the few-shot learning method was successfully applied to the fault diagnosis of rolling bearings, and the experimental verification was carried out under the mixed working conditions. The results show that when the number of training samples is 60, 200, 900, and 19,800, and the number of test samples is 75, the accuracy rates are 82.8%, 94.32%, 98.55%, and 99.77%, respectively.

It can be seen from previous research that deep learning is widely used in the field of bearing fault diagnosis. Of course, this is also due to its advantages in feature extraction and classification. However, the accuracy of deep learning is often at the cost of increasing the number of training samples. In addition, some studies are only carried out for a single working condition, and the effect of applying the proposed model to other working conditions remains to be verified. Based on the above analysis, the WHO-VMD-CCWT-EFF is proposed in this paper. In order to verify the usability and universality of the model, the CWRU dataset [18] and Paderborn dataset [19] are used for various single and multiple working condition experiments. The experimental results indicate that WHO-VMD-CCWT-EFF can achieve good results with only 10 training samples and 90 test samples. The main contributions of this paper are summarized as follows:

  1. 1.

    A correlation coefficient weight threshold denoising method is proposed to denoise the fault signal decomposed by VMD.

  2. 2.

    To extract better classification features, an entropy feature fusion method is proposed and the new bearing fault diagnosis method named WHO-VMD-CCWT-EFF is verified in the experiment.

  3. 3.

    A new deviation metric is used to measure the stability of the model and validated in various experiments.

  4. 4.

    On the basis of completing the experiment of single working condition, the experiment of mixing multiple working conditions is carried out, and good results are obtained. And the WHO-VMD-CCWT-EFF is still applicable and stable in the case of a small amount of data.

The rest of this paper is organized as follows: Section 2 introduces the theoretical basis related to the model. Section 3 introduces the framework of the model. The CWRU dataset and the Paderborn dataset are used for the experiments and analyses in Sect. 4. Finally, Sect. 5 gives the conclusion.

2 Theoretical backgrounds

2.1 WHO-VMD

VMD [20] is a non-recursive and adaptive signal processing algorithm proposed based on algorithms such as empirical mode decomposition (EMD). It aims to dissect the original signal in the frequency domain and decompose it into intrinsic mode function (IMF) with limited bandwidth and center frequency. WHO is a meta-heuristic optimization algorithm proposed by Iraj Naruei [21]. Similar optimization algorithms, such as the improved grey wolf optimization (IGWO), ant lion optimizer (ALO), and marine predator algorithm (MPA), have also been well applied to various structural detection. This article selects the WHO to optimize VMD parameters [22,23,24].

The algorithm is mainly inspired by the special behavior of wild horses that is different from other animals, that is, foals leave their parent groups before puberty to join other parent groups to avoid mating between relatives. In addition to mating behaviors, wild horses renew their position through social behaviors such as grazing behavior, group leadership, and exchange and selection of leaders.

In the process of VMD decomposition of bearing fault signal, it is found that the number of IMFs in VMD, that is, the number of decomposed layers \(k\) and the value of multiplication factor α will directly affect the decomposition effect. To select the relatively optimal parameters, the parameter \( \left( {k,\alpha } \right)\) of VMD is optimized by WHO, as shown in Fig. 1.

  • Step1: Set the parameters of WHO. The total number of wild horses N = 30, the maximum number of iterations Max_iter = 30, the crossover ratio PC = 0.13, the percentage of stallions in the group population PS = 0.2, the number of stallions Nstallion = N*PS, the number of foals in each group Nfoal = (N-Nstallion)/Nstallion. The parameters to be optimized \(\alpha \in \left[ {100,{ }2000} \right]\), \(k \in \left[ {4,{ }8} \right]\), \(\alpha ,k \in Z\).

  • Step2: Create populations, select leaders, and calculate fitness function values.

  • Step3: Search and update according to grazing behavior if Rand > PC, otherwise update by mating behavior. It should be noted that Rand is a random number with uniform distribution in the range [0, 1].

  • Step4: Group leaders as well as stallions are updated, respectively.

  • Step5: Determine whether the number of iterations is reached, if so, output \(\left( {k,\alpha } \right)\), otherwise return to step 3.

Fig. 1
figure 1

Flow chart of WHO-VMD algorithm

2.2 Power spectrum entropy

The power spectral entropy [25] represents the uncertainty of signal energy under power spectral partitioning, which is a quantitative description of the complexity of signal energy distribution in the frequency domain. In actual industrial environments, bearing fault signals are collected in environments with different noise sources, resulting in complex frequency components of bearing fault signals. In order to effectively reflect the fault characteristics contained in each frequency component, the power spectrum entropy is used as the fitness function of the optimization algorithm. When the entropy value of the power spectrum is small, the frequency components in the signal are simple, and the power spectrum is concentrated on some frequency components, which can reflect the characteristics of the fault signal. In addition, the power spectral entropy values of each IMF component in the same fault state after VMD decomposition are relatively stable, and the power spectral entropy values vary in different fault states. This further proves that the power spectral entropy is suitable as the fitness function of the optimization algorithm.

Step1: Define the original fault signal sequence as \(x\left( t \right) = \left\{ {x\left( 1 \right),x\left( 2 \right),x\left( 3 \right), \ldots x\left( L \right)} \right\}\)

$$ P\left( i \right) = \frac{{\left| {x\left( w \right)} \right|^{2} }}{2\pi L} $$
(1)

where \(L\) is the length of the signal, \(P\left( i \right)\) is the power spectrum of the signal. \(x\left( w \right)\) is the Fourier transform of the signal.

Step 2: Obtain the power spectral density distribution function by normalization:

$$ p\left( i \right) = \frac{P\left( i \right)}{{\mathop \sum \nolimits_{i = 1}^{N} P\left( i \right)}}\quad i = 0,1,2....N $$
(2)

where \(N\) is the number of frequency components in the Fourier transform.

Step 3: Define the power spectrum entropy through the power spectral density distribution function as:

$$ H = - \mathop \sum \limits_{i = 1}^{N} p\left( i \right){\text{log}}p\left( i \right) $$
(3)

2.3 Correlation coefficient

The correlation coefficient is a description of the similarity between two random signals or deterministic signals. After the VMD decomposition of the bearing fault signal, the correlation between each IMF and the original bearing fault signal can be judged by calculating the correlation coefficient value. Then, it can be inferred from the correlation coefficient whether the IMF contains the main features of the original signal. Generally speaking, the closer the absolute value of the correlation coefficient is to 1, the higher the degree of correlation between the two, and the more obvious the features of the original signal contained in the IMF. The correlation coefficient \(R_{k}\) between the k-th IMF and the original signal is defined as:

$$ R_{k} = \frac{{E\left( {u_{k} \left( t \right)f\left( t \right)} \right) - E\left( {u_{k} \left( t \right)} \right)E\left( {f\left( t \right)} \right)}}{{\sqrt {D\left( {u_{k} \left( t \right)} \right)} \sqrt {D\left( {f\left( t \right)} \right)} }} $$
(4)

where \(f\left( t \right)\) is the original signal and \(u_{k} \left( t \right)\) is the k-th IMF. E and D represent expected values and variance.

2.4 Entropy features

2.4.1 RCMDE

RCMDE [26, 27] was first proposed and applied to biomedical signals in 2017. RCMDE is improved by multi-scale and coarse-graining based on dispersion entropy. The specific calculation steps are as follows:

Step 1: For the sequence X of length \(N\), divide it into segments of length τ. The average value of each segment is calculated and arranged to obtain a coarse-grained sequence.

$$ x_{k,j}^{\tau } = \frac{1}{\tau }\mathop \sum \limits_{{b = k + \tau \left( {j - 1} \right)}}^{k + \tau j - 1} X_{b} ,1 \le j \le \left\lfloor\frac{N}{\tau }\right\rfloor,1 \le k \le \tau $$
(5)

where \(x_{k}^{\tau } = \left\{ {x_{k,1}^{\tau } ,x_{k,2}^{\tau } ,...} \right\}\) is the k-th coarse-grained sequence at the τ scale.

Step 2: Map the time series \(x_{k,j}^{\tau }\) to \(y_{k,j}^{\tau }\) by Eq. (6)

$$ y_{k}^{\tau } = \frac{1}{{\sigma \sqrt {2\pi } }}\mathop \int \limits_{ - \infty }^{{x_{k}^{\tau } }} e^{{\frac{{ - \left( {t - u} \right)^{2} }}{{2\sigma^{2} }}}} dt $$
(6)

where \(u\) represents the mathematical expectation of sequence \(y_{k}^{\tau }\), and \(\sigma \) represents the variance of sequence \(y_{k}^{\tau }\).

Step 3: Map the time series \(y_{k}^{\tau }\) to \(Z_{j}^{c}\) by Eq. (7)

$$ Z_{j}^{c} = {\text{Round}}\left( {c \cdot y_{k}^{\tau } + 0.5} \right) $$
(7)

where \({\text{Round}}()\) represents the rounding function, and \(c\) represents the number of categories.

Step 4: Calculate the embedding vector by Eq. (8).

$$ z_{i}^{m,c} = \left\{ {z_{i}^{c} ,z_{i + d}^{c} , \cdots ,z_{{i + \left( {m - 1} \right)d}}^{c} } \right\} $$
$$ i = 1,2, \cdots ,N - \left( {m - 1} \right)d $$
(8)

where \(m\) is the embedding dimension and \(d\) is the time delay.

Step 5: Calculate the dispersion patterns and its corresponding probability. Assuming that \(z_{i}^{c} = v_{0}\), \(z_{i + d}^{c} = v_{1}\), and \(z_{{i + \left( {m - 1} \right)d}}^{c} = v_{m - 1}\), the dispersion pattern corresponding to \(z_{i}^{m,c}\) is \(\pi_{{v_{0} v_{1} \cdots v_{m - 1} }}\). Calculate the probability corresponding to the dispersion pattern according to Formula (9).

$$ p\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right) = \frac{{{\text{Number}}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)}}{{N - \left( {m - 1} \right)d}} $$
(9)

Step 6: Calculate the average value \(\overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)\) of the probability of the dispersion pattern, and obtain the RCMDE value through \(\overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)\).

$$ {\text{RCMDE }}(x_{k}^{\tau } ,m,c,d,\tau ) = - \mathop \sum \limits_{\pi = 1}^{{c^{m} }} \overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)\ln \left( {\overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)} \right) $$
(10)

2.4.2 RCMFDE

RCMFDE is based on the study of Azami H et al. [28, 29] for dispersion entropy. The fluctuation dispersion entropy is superior to the dispersion entropy in that it takes into account the volatility of the time series while maintaining stable performance and less computation. Like RCMDE, RCMFDE obtains the dispersion pattern by formula (58). Calculate the probability corresponding to the dispersion pattern according to Formula (11).

$$ p\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right) = \frac{{{\text{count}}\left\{ {\begin{array}{*{20}c} {i|i \le N - \left( {m - 1} \right)d,} \\ {z_{i}^{m,c} {\text{pattern }}\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \\ \end{array} } \right\}}}{{N - \left( {m - 1} \right)d}} $$
(11)

Among them, \({\text{count ()}}\) is the number of maps from \(z_{i}^{m,c}\) to \(\pi_{{v_{0} v_{1} \cdots v_{m - 1} }}\).

Calculate the average value of the dispersion pattern probabilities at scale τ, and the RCMFDE is obtained through \(\overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)\).

$$ \overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right) = \frac{1}{\tau }\mathop \sum \limits_{k = 1}^{\tau } p_{k} $$
(12)
$$ E_{{{\text{RCMFD}}}} \left( {x_{k}^{\tau } ,m,c,d,\tau } \right) = - \mathop \sum \limits_{\pi = 1}^{{\left( {2c - 1} \right)m - 1}} \overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right) \cdot \ln \left[ {\overline{p}\left( {\pi_{{v_{0} v_{1} \cdots v_{m - 1} }} } \right)} \right] $$
(13)

2.4.3 RCmvMFE

RCmvMFE [30] is a tool proposed in 2017 to analyze the complexity of multi-channel signals. The detailed description of RCmvMFE is as follows:

  • Step 1: For a multivariate signal \(Y = \{ y_{k,b} \}_{,b = 1}^{C}\) containing \(p\) signals with a length of \(C\), the coarse-grained operations are performed to obtain a time series, represented as \(z_{\alpha }^{\left( \beta \right)} = \left\{ {x_{\alpha ,k,i}^{\left( \beta \right)} } \right\}\), where \( \beta\) is the time series scale.

    $$ x_{\alpha ,k,i}^{\left( \beta \right)} = \frac{1}{\beta }\mathop \sum \limits_{{b = \left( {i - 1} \right)\beta }}^{i\beta + \alpha - 1} y_{k,b} \quad 1 \le i \le \left \lfloor \frac{C}{\beta } \right \rfloor = N,1 \le k \le p,1 \le \alpha \le \beta $$
    (14)
  • Step 2: The multivariate embedded reconstruction is used.

    $$ X_{m} \left( i \right) = \left[ {x_{1,i} ,x_{{1,i + \tau_{1} }} , \ldots ,x_{{1,i + \left( {m_{1} - 1} \right)\tau_{1} }} ,x_{2,i} ,x_{{2,i + \tau_{2} }} , \ldots ,x_{{2,i + \left( {m_{2} - 1} \right)\tau_{2} }} , \ldots ,x_{P,i} ,x_{{P,i + \tau_{P} }} , \ldots ,x_{{P,i + \left( {m_{P} - 1} \right)\tau_{P} }} } \right] $$
    (15)

    where \(M = \left[ {m_{1} ,m_{2} ,. . .m_{p} } \right]\), \(\tau = \left[ {\tau_{1} ,\tau_{2} ,. . .\tau_{P} } \right]\) are the embedding dimension and delay time, respectively, \(n = {\text{max}}\left\{ M \right\} \times {\text{max}}\left\{ \tau \right\}\), \(i = 1,2, . . .N - n\).

  • Step 3: Calculate the distance between \(X_{m} \left( i \right)\) and \(X_{m} \left( j \right)\), where \(i \ne j\).

    $$ d\left[ {X_{m} \left( i \right),X_{m} \left( j \right)} \right] = \mathop {{\text{max}}}\limits_{l = 1,2, \ldots ,m} \left\{ {\left| {x\left( {i + l - 1} \right) - x\left( {j + l - 1} \right)} \right|} \right\} $$
    (16)
  • Step 4: According to the given threshold r and fuzzy membership function \(\theta \left( {d,r} \right)\), \(\phi^{m} \left( r \right)\) with the embedding dimension \(m\) can be obtained:

    $$ \theta \left( {d,r} \right) = \exp \left( {\frac{{ - (d)^{fp} }}{r}} \right) $$
    $$ \phi^{m} \left( r \right) = \frac{1}{{\left( {N - n} \right)}}\mathop \sum \limits_{i = 1}^{N - n} \frac{{\mathop \sum \nolimits_{j = 1,i \ne j}^{N - n} \exp \left( {\frac{{ - (d\left[ {X_{m} \left( i \right),X_{m} \left( j \right)} \right])^{fp} }}{r}} \right)}}{N - n - 1} $$
    (17)
  • Step 5: Let m = m + 1 and repeat steps 2–4. Calculate the average values \(\overline{\phi }_{\beta ,\alpha }^{m}\) and \(\overline{\phi }_{\beta ,\alpha }^{m + 1}\) of Eq. (17). Then RCmvMFE can be calculated by Eq. (18)

    $$ {\text{RCmvMFE}}\left( {Y, \beta ,M,n,r} \right) = - {\text{ln}}\left( {\frac{{\overline{\phi }_{\beta ,\alpha }^{m + 1} }}{{\overline{\phi }_{\beta ,\alpha }^{m} }}} \right) $$
    (18)

2.4.4 RCmvMSE

The probability calculation method for RCmvMSE [31] varies when the embedding dimension is \(m\).

$$ B_{i}^{m} \left( r \right) = (N - n - 1)^{ - 1} P_{i} $$
(19)
$$ B^{m} \left( r \right) = (N - n)^{ - 1} \mathop \sum \limits_{i = 1}^{N - n} B_{i}^{m} \left( r \right) $$
(20)

Let m = m + 1, and repeat the above steps to get \(B^{m + 1} \left( r \right)\). Calculate the mean values \(\overline{B}_{\beta ,\alpha }^{m}\) and \(\overline{B}_{\beta ,\alpha }^{m + 1}\) in \(m\) and \(m + 1\) dimensions. RCmvMSE can be calculated by Eq. (21)

$$ {\text{RCmvMSE}}\left( {Y, \beta ,M,n,r} \right) = - {\text{ln}}\left( {\frac{{\overline{B}_{\beta ,\alpha }^{m + 1} }}{{\overline{B}_{\beta ,\alpha }^{m} }}} \right) $$
(21)

2.4.5 MPE

To better study and analyze the dynamic characteristics of EEG, Ouyang G et al. [32] proposed a multiscale permutation entropy based on permutation entropy.

Step 1: A new time series is obtained by coarse-graining an original sequence Y of length N, where \(\tau\) is the scale factor.

$$ y_{j}^{\left( \tau \right)} = \frac{1}{\tau }\mathop \sum \limits_{{i = \left( {j - 1} \right)\tau + 1}}^{j\tau } x_{i} ,1 \le j \le \left \lfloor\frac{N}{\tau } \right \rfloor$$
(22)

Step 2: The phase space reconstruction is applied with \( y^{\left( \tau \right)}\) to obtain the time series \(X_{i}\).

$$ X_{i} = \left( {y_{i} ,y_{i + \lambda } , \ldots ,y_{{i + \left( {m - 1} \right)\lambda }} } \right) $$
(23)

where \(m\) is the embedding dimension and \(\lambda\) is the delay time.

Step 3: The \(X_{i}\) is sorted in ascending order to generate a sequence of position indexes. For any kind of \(X_{i}\), there are \(m!\) permutations. The probability of each permutation is calculated according to Eq. (24).

$$ P\left( \omega \right) = \frac{T\left( \omega \right)}{{N - \left( {m - 1} \right)\lambda }} $$
(24)

where \(T\left( \omega \right)\) is the number of occurrences of permutation \(\omega\), \(1 \le \omega \le m!\)

Step 4: Define the multiscale permutation entropy by Eq. (25).

$$ H_{PE} = - \sum P\left( \omega \right)\ln P\left( \omega \right) $$
$$ H_{MPE} = [H_{P1} ,H_{P2} ...H_{P\tau } ] $$
(25)

2.5 Deviation

In order to measure the stability of the model, a new deviation indicator is defined. Suppose that for experiment A, the result of the i-th repeated experiment is \(A_{i}\), i = 1,2,3……N.

$$ Deviation_{A} = {\text{max}}\left[ {A_{i} \left] { - {\text{min}}} \right[A_{i} } \right] $$
(26)

where \({\text{max}}\left[ {A_{i} } \right]\) is to find the maximum value of \( A_{i}\), \({\text{min}}\left[ {A_{i} } \right]\) is to find the minimum value of \(A_{i}\).

3 The WHO–VMD–CCWT–EFF

The framework of the WHO–VMD–CCWT–EFF model is shown in Fig. 2. Three parts of fault signal denoising, feature extraction and fusion, and feature classification are included in this method.

Fig. 2
figure 2

Flow chart of the WHO-VMD-CCWT-EFF algorithm

Denoising: First, the VMD optimized by the WHO algorithm decomposes various bearing fault signals into IMFs; Secondly, the correlation coefficients between each IMF and the original bearing fault signal are calculated; Then, selecting IMFs with correlation function values greater than 0.2 with the original bearing signal; Finally, the correlation coefficients are used as the weight to multiply the corresponding IMFs to reconstruct the fault signal.

Feature extraction: RCMDE, RCMFDE, RCMvMFE, RCMvMSE, and MPE are extracted from the fault signal after denoising, and then the five entropy features extracted are fused.

Classification: The feature samples of the fault signals are divided into training and test sets according to the experimental requirements, and then the fault signals are classified by Fisher classifier.

4 Experimental analyses

In order to verify the effectiveness of the method proposed in this paper, two classical public datasets are used in the experiments.

4.1 Analysis of bearing fault signal

4.1.1 WHO–VMD

To address the issue of VMD decomposition being greatly affected by parameters \(\left( {k,\alpha } \right)\), the WHO is used to optimize the parameters. Figure 3 shows the convergence curve of partial artificial damage and real damage under the Paderborn dataset. It can be seen that the number of iterations required to achieve convergence for each fault signal is inconsistent. Therefore, in order to consider as many fault signals as possible, 30 is chosen as the number of iterations. Similarly, the number of search agents is set to 30. From Fig. 4, it can be seen that when the number of search agents is 30, the convergence of the fault signal is better.

Fig. 3
figure 3

Convergence curve of partial fault signal under the Paderborn dataset

Fig. 4
figure 4

Convergence curves of IR with Label = 2 in the Paderborn University dataset under different search agents

Taking the 0.3556 mm outer race bearing fault at 0HP as an example, the fitness curve of the WHO-VMD is shown in Fig. 5. As can be seen from the Fig. 5, the value of the fitness function for the first iteration is \(7.2145 \times 10^{ - 4}\). The value of the fitness function is \(7.0273 \times 10^{ - 4}\) after a small decrease, and tends to be smooth after two iterations. The fitness function value is reduced to \(5.66 \times 10^{ - 4}\) after the fifth iteration, and to a minimum value of after the 14th iteration. The results show that WHO has a fast convergence rate in the process of VMD parameter optimization, which proves that WHO is suitable for optimizing the parameters of VMD. The parameters obtained by WHO-VMD in this experiment is \(k = 5,\,\alpha = 1106\).

Fig. 5
figure 5

The convergence curve of the WHO

To verify the superiority of the WHO algorithm in optimizing the VMD parameters, the particle swarm optimization algorithm (PSO) [33], the whale optimization algorithm (WOA) [34], and the moth-flame optimization algorithm (MFO) [35] are used to optimize the VMD parameters. The number of optimization algorithm populations is set to 30, and the maximum number of iterations is 30, and then the fitness function convergence curve shown in Fig. 6 is obtained. As can be seen from Fig. 6, the convergence curve of PSO is unstable and the phenomenon of sudden high and low appears. The WOA and MFO converge after the 2nd and 6th iterations, respectively, corresponding to a power spectrum entropy function value of \(7.0273 \times 10^{ - 4}\). It can be seen that WOA and MFO are faster in finding the optimum, and convergence can be achieved in fewer iterations. In contrast, WHO can reach the fitness termination value of WOA and MFO after the 2nd iteration, and still continue to iterate to \(5.646 \times 10^{ - 4}\) thereafter. This proves the superiority of WHO in optimizing VMD parameters.

Fig. 6
figure 6

VMD optimized by four different optimization algorithms

4.1.2 CCWT

After determining the VMD parameters, the fault signal is decomposed to obtain IMF and denoised through CCWT. The correlation coefficient about IR and OR in the Paderborn real damage D2 dataset as shown in Fig. 7. When 0.3 is chosen as the denoising threshold for CCWT, half of the IMFs are filtered, which may lead to excessive denoising and loss of otherwise useful information. When 0.1 is chosen as the denoising threshold of CCWT, no IMFs are filtered, and the expected denoising effect cannot be achieved. In the paper, 0.2 is chosen as the denoising threshold of CCWT. The IMFs with correlation coefficients less than 0.2 are filtered out first, and then the correlation coefficients of the remaining are used as the weighting coefficients of the IMFs to reconstruct the original signal.

Fig. 7
figure 7

The correlation coefficient about IR and OR in the Paderborn real damage D2 dataset. a OR (Label = 3), b IR (Label = 6)

Taking the OR (Label = 3) of the Paderborn real damage D4 dataset as an example. For easy observation, 500 sample points are selected to compare the differences before and after denoising, as shown in Fig. 8. When correlation coefficient threshold (CCT) is used, IMFs whose correlation coefficient value is less than 0.2 are removed and the remaining components are reconstructed. Based on the former, the CCWT applies the correlation coefficient values greater than 0.2 to the corresponding IMF components as weights. As can be seen from Fig. 8, compared with the signal of CCT denoising, the signal curve of CCWT denoising is smoother and has less burrs. Therefore, we think the denoising effect of CCWT is better.

Fig. 8
figure 8

Comparison of CCT and CCWT a original signal, b CCT denoising, c CCWT denoising

In addition, although the threshold is set to 0.2 considering the characteristics of most fault signals. Due to the large number of fault signals in both datasets, there are still some signals with all correlation coefficient values greater than 0.2. Taking the IR (Label = 8) of the Paderborn real damage D1 dataset as an example, the denoising effects of CCT and CCWT on such situations are explored. As can be seen from the Fig. 9, the signal after CCT denoising is almost indistinguishable from the original signal because all the correlation coefficient values are greater than 0.2. CCWT not only effectively avoids this defect, but also achieves good denoising effect. This is because the weighting operation on IMF not only enhances the useful signal, but also weakens the noisy signal. The larger the correlation coefficient between IMF and the original signal during CCWT denoising, the more useful information the signal contains. On the contrary, the smaller the value of correlation coefficient, the more the IMF is considered to contain noise. The operation of using the correlation coefficient values as weights is equivalent to amplifying the IMF that is considered to contain useful information and shrinking the IMF that contains noise. Therefore, the CCWT method is considered not only enhance the useful signal but also weaken the noisy signal.

Fig. 9
figure 9

Comparison of CCT and CCWT, a original signal, b CCT denoising, c CCWT denoising

4.2 The CWRU dataset

4.2.1 The CWRU dataset description

The experimental data in this paper comes from the rolling bearing test stand shown in Fig. 10. The 6205-2RS JEM SKF deep groove ball bearing is used as the test bearing, and the data is collected under four loads of 0HP, 1HP, 2HP, and 3HP, and the sampling frequency is 12 kHz. Three damage faults made by electro-discharge machining (EDM), namely inner race fault, outer race fault, and ball fault, are included in the experiment. Each fault includes three different degrees of damage with diameters of 0.1778 mm, 0.3556 mm, and 0.5334 mm, as shown in Table 1. In the experiment, 100 samples are intercepted for each fault signal without overlap.

Fig. 10
figure 10

The CWRU bearing data center bearing test stand

Table 1 The CWRU dataset description

4.2.2 Experimental analysis of single working condition bearing fault diagnosis

After feature extraction of bearing fault signal, Fisher classifier is used to classify the fused features. The number of samples for each class of fault signals in the experiment is set to 100, and then the training set and test set are divided according to different proportions. Each result is the average of ten replicates. In order to verify the performance of the selected classifiers, three classifiers are selected for experiments with training samples and test samples at different ratios. As shown in Fig. 11, the accuracy of all three classifiers under the four working conditions shows an increasing trend when the ratio of training samples to test samples is larger. The best performance for the decision tree classification is achieved with 2HP data, while the SVM has higher accuracy under 1HP. In Fisher classifier, except for the slightly lower performance of 0HP when the ratio of training samples to test samples is 1:9, the accuracy of other working conditions exceeds 99%. When the ratio of training samples and test samples is 1:9, the Fisher classifier is improved by 4.5–5.4% compared with the decision tree classifier, and 3.2–4.63% compared with the SVM classifier. It is clear that the Fisher classifier still shows the superior performance when the number of samples is small, which is what we would like to see.

Fig. 11
figure 11

Diagnostic accuracy of three classifiers under four working conditions

For the bearing fault diagnosis method, superior performance is the primary requirement, but the stability of the model is also crucial. To further illustrate the stability of the WHO-VMD-CCWT-EFF, the results of 10 experiments for four working conditions are recorded as shown in Fig. 12. The difference between the maximum and minimum values in 10 experiments are used as deviations to measure the stability of the WHO-VMD-CCWT-EFF. It can be seen from the Fig. 12 that when the ratio of training samples to test samples is 1:9, the deviation under 0HP is 2.23%, which is the largest deviation under the four working conditions. At the training sample to test sample ratio of 2:8, the deviation of the 0HP and 1HP are 0.25%, and the deviation of the 2HP and 3HP are 0. This indicates that the data under 0HP with few training samples is slightly less stable compared to the data under other working conditions. But when the proportion of training samples is slightly larger under the premise of small samples, the data under 0HP still shows very good performance.

Fig. 12
figure 12

The 10-time fault diagnosis accuracy of Fisher classifier under different training and testing ratios. a 0HP, b 1HP, c 2HP, d 3HP

Table 2 presents the experimental results of the four working conditions under different ratios of training samples and test samples. It can be seen that under the four working conditions, the Fisher classifier not only has higher accuracy but also maintains the smallest deviation, which further verifies the effectiveness and stability of the WHO-VMD-CCWT-EFF. At the same time, we can see that the accuracy of the other two classifiers reaches over 99% when the ratio of training samples to test samples is 9:1 and remains around 95% when the ratio of training samples to test samples is 1:9. This indicates that denoising and feature extraction are successful, for which feature extraction will be analyzed in detail later.

Table 2 Experimental results of four working conditions under different ratios of training samples and test samples (%)

4.2.3 Experimental analysis of bearing fault diagnosis under multiple working conditions

Since the actual industrial environment is complex and changeable, it is impossible to ensure that the collected data are always under the same working conditions. So, a variety of bearing fault diagnosis experiments and analysis under multiple working conditions are carried out. The CWRU dataset includes four working conditions, which means that six mixed experiments of two working conditions and four mixed experiments of three working conditions are included in the experiments. Like the single working condition experiment, the training set and the test set are divided according to different proportions. In addition, each experiment is repeated ten times, and the average value is taken as the final experimental result. Details of the experimental data for the two working conditions and the three working conditions are described in detail in Tables 3 and 4 respectively (0HP + 1HP and 0HP + 1HP + 2HP as an example).

Table 3 Dataset descriptions with different ratios of training data to test data under two working conditions
Table 4 Dataset descriptions with different ratios of training data to test data under three working conditions

Figure 13 shows the experimental results under multiple working conditions. Compared with Fig. 11, it can be seen that the WHO-VMD-CCWT-EFF performs better in the multiple working conditions experiment. To explain this phenomenon, the experiments shown in Fig. 14 are performed. It is not difficult to see that the accuracy rates of the three classifiers under multiple working conditions are almost always higher than those under a single working condition. Therefore, we can conclude that the phenomenon is not related to the classifier and may be since the extracted features of the same fault in different working conditions are relatively similar or the experimental samples of multiple working conditions are increased.

Fig. 13
figure 13

Experimental results of different ratios of training samples to test samples under multiple working conditions

Fig. 14
figure 14

Experiment comparison between a single working condition and multiple working conditions under three classifiers

To verify whether this phenomenon is related to the increase in the number of samples, experiments are carried out with the combination 0HP + 1HP as an example. Of course, the number of training samples and test samples for the selected combinations is the same as for the experiments under a single working condition. Table 5 presents the experimental data in detail.

Table 5 Description of the number of samples under two working conditions that are consistent with the number of samples under a single operating condition (0HP + 1HP as an example)

Figure 15 shows the experimental results, the difference between the accuracy of fisher classifier in both cases is 0–0.41%, and the difference between decision tree and SVM is 0–2.9%. It is not difficult to draw a conclusion that for some classifiers, an increase in sample size has a certain impact on accuracy. But the proposed model almost overcomes this shortcoming. For further validation, experiments as shown in Fig. 16 are performed. The number of samples for the two working conditions and three working conditions in the figure are shown in the Tables 3 and 4. In Fig. 16, the ratio of training samples to test samples for each experiment is 1:9. From the figure, the accuracy rate of some three working conditions experiments are higher than that of two working conditions, while the rest are lower than that of two working conditions. And the accuracy rate of four working conditions experiments is the lowest. This indicates that for the proposed model, the increase of samples does not necessarily lead to the improvement of fault diagnosis rate.

Fig. 15
figure 15

Effect of using the same number of data samples and the same proportional of data samples on bearing fault diagnosis a the same sample size as the single working condition experiment b the same sample proportion as single working condition experiment

Fig.16
figure 16

Effect of the phenomenon of increasing sample size due to increased working conditions on bearing fault diagnosis a 0H + 1H and its extended experiments b 0H + 2H and its extended experiments c 0H + 3H and its extended experiments d 1H + 2H and its extended experiments e 1H + 3H and its extended experiments f 2H + 3H and its extended experiments

4.3 Paderborn university dataset

4.3.1 Dataset description

The Paderborn dataset is proposed by Christian Lessmeier et al. of the Kat-Data Center, and the 6205 deep groove ball bearing is used as the test bearing for the collection of the artificial damage dataset and the real damage dataset. The rolling bearing test stand is shown in Fig. 17. As with the CWRU data, Paderborn dataset is collected under each of the four working conditions, as shown in Table 6. As shown in Tables 7 and 8, the artificial damage dataset contains a total of 8 faults, while the real damage dataset contains 9 faults. The bearing damage for both data sets are inner ring (IR) and outer ring (OR). For each type of fault signal, 100 samples are intercepted without overlapping for the experiment.

Fig. 17
figure 17

Rolling bearing test stand

Table 6 Four working conditions of bearing experimental data
Table 7 The artificial damage dataset description
Table 8 The real damage dataset description

4.3.2 Experimental analysis of bearing fault diagnosis in single working condition

Whether the fault types of bearings can be diagnosed efficiently and accurately depends largely on the feature extraction. And the classification effect will be better if the extracted features of fault are more obvious. To confirm the advantages of the feature extraction method in this paper, the features of the artificial damage and the real damage under four working conditions are visualized.

In Fig. 18a–d and e–h are the feature visualization plots for four single working conditions under artificial damage and real damage data, respectively. As shown in Fig. 18, D3 performs the best in the feature visualization of the artificial damage, which means that D3 is more successful in feature extraction. In contrast, D1 performs relatively poorly, which explains the lower accuracy of D1 compared to other data in Table 9. Similarly, the same situation occurred with the real damage experiment. This shows that the relatively low accuracy of D1 compared to other data may be due to the acquisition conditions.

Fig. 18
figure 18figure 18

Feature visualization a artificial damage D1, b artificial damage D2, c artificial damage D3, d artificial damage D4, e real damage D1, f real damage D2 g for real damage D3, h real damage D4

Table 9 Experimental results of artificial damage

The experimental results of artificial damage and real damage are given in Tables 9 and 10, respectively. The accuracy rate is relatively low and the deviation is large when the ratio of the training sample to the test sample is low. With the increase of the ratio, all the data reaches 100% except for the real damage D1 which reaches 99.77%. This shows that the increase of the ratio of the training sample to the test sample can enhance the experimental effect. In addition, the real damage D2 and D3 reach 100% when the ratio of training samples to test samples is 2:8, and D4 reaches 100% at 3:7. This indicates that even with only a small number of samples, the WHO-VMD-CCWT-EFF can identify the type of bearing fault.

Table 10 Real damage experimental results

4.3.3 Experimental analysis of bearing fault diagnosis under multiple working conditions

As with the CWRU data, the experiments on artificial damage and real damage under multiple working conditions are also carried out after completing the experiments of single working conditions. Here the experimental data for the artificial damage dataset are analyzed specifically, and the same is true for the real damage dataset. The experimental data for the two working conditions and the three working conditions are described in detail in Tables 11 and 12, respectively.

Table 11 Dataset descriptions with different ratios of training data to test data under two working conditions (D1 + D2 under artificial damage dataset as an example)
Table 12 Dataset descriptions with different ratios of training data to test data under three working conditions (D1 + D2 + D3 under artificial damage dataset as an example)

As shown in Tables 13 and 14, which are the results of the experiments on artificial damage and real damage under multiple working conditions, respectively. For both artificial damage and real damage, the accuracy rate reaches more than 99% when the ratio of training sample to test sample is 1:9. Compared to a single working condition, the accuracy of multiple working conditions is higher. This is consistent with the conclusions obtained from the CWRU data. In addition, when the ratio of training samples to test samples is 1:9, the deviation is relatively large, but it improves at 2:8 and then fluctuates as the ratio increases. This reminds us that the selection of training to test sample ratio is a very important when experimenting with multiple working conditions.

Table 13 Experimental results of artificial damage under multiple working conditions
Table 14 Real damage experimental results under multiple working conditions

To verify the effectiveness of the entropy fusion method proposed in this paper, the experiments of single entropy and fusion entropy are carried out respectively. 9 entropies are selected for the experiments in Tables 15 and 16, which are RCMDE, RCMFDE, RCMvMFE, RCMvMSE, MPE, multiscale dispersion entropy (MDE), multiscale weight permutation entropy (MWPE), multivariate fuzzy entropy (MVFE), and multivariate sample entropy (MVSE). Among them, MVFE and MVSE are single-scale entropy, and the remaining seven are multiscale entropy. Taking the multiple working conditions experiment as an example, the experimental effect of multiscale entropy both in artificial damage and real damage far exceeds that of single-scale entropy. In order to improve the accuracy and reduce deviations of the model, an entropy fusion method is proposed.

Table 15 Accuracy of single entropy of artificial damage
Table 16 Accuracy of single entropy of real damage

For the fusion of entropy features, the main goal is to improve the accuracy and minimize the deviation. According to Tables 15 and 16, the better-performing entropy features are fused in turn to obtain the experimental results of the artificial damage fusion entropy features in Table 17 and the experimental results of the real damage fusion entropy features in Table 18. Comparing Table 17 with Table 18, we can see that the WHO-VMD-CCWT-EFF is more applicable to the real damage dataset, which is the direction of our efforts. In the entropy fusion, the experimental performance of D1, D2, and D4 is not the best, but the difference with the indicator of the best effect is slight.

Table 17 Accuracy of artificial damage fusion entropy
Table 18 Accuracy of real damage fusion entropy

To make the model more convincing, the experimental results of the three working conditions experiments of D1, D2, and D3 with different ratios of training samples to test samples are analyzed. The trend plots of entropy fusion experimental results for artificial and real damage at D1, D2 and D3 are given in Figs. 19 and 20, respectively. As the number of fused feature entropies increases the artificial damage accuracy increases and the deviation decreases. In the real damage, the performance of the 5 entropy fusions is not the best when the ratio of training samples to test samples is high. This shows that for the fusion of entropy features, it is not the more the better, but the category and number of entropy should be reasonably selected according to the experiment. In this paper, five entropy features are fused in the model.

Fig. 19
figure 19

Experimental results of entropy fusion of artificial damage D1 + D2 + D3

Fig. 20
figure 20

Experimental results of entropy fusion of real damage D1 + D2 + D3

4.4 Model comparison

As shown in Table 19, in order to verify the performance of the proposed method, we selected existing models from recent years for comparison. In bearing fault diagnosis, the most classic CWRU dataset is used as the dataset for model comparison. The comparison results indicate that the proposed model can identify bearing faults with higher accuracy when the training and testing data are the same as other literature. In addition, while maintaining the same level of accuracy, less data is used.

Table 19 Comparison of the proposed model with other models in the literature

5 Conclusions

A bearing fault diagnosis method WHO-VMD-CCWT-EFF based on signal denoising and feature fusion is proposed to address the issues of low accuracy in traditional methods and the need for a large number of data samples in deep learning methods. This paper focuses on two aspects of bearing fault signal denoising and feature extraction. In order to verify the effectiveness and stability of the model, the CWRU dataset and Paderborn dataset are used for various single and multiple working experiments. The experimental results show that WHO-VMD-CCWT-EFF exhibits superior performance in both single and multiple working conditions when the training data of both datasets are small. The following experimental results confirm this conclusion.

  1. (1)

    The WHO-VMD-CCWT-EFF model can accurately identify the fault status of bearings. It can be proved by the fact that experiments on the CWRU dataset and the Paderborn dataset (12 single operating conditions and 30 multiple operating conditions) achieve over 99% accuracy.

  2. (2)

    In the Paderborn dataset, when the ratio of training samples to test samples is 1:9, the difference between real and artificial damage is 0.02% -1.44%. This indicates that even under small sample experimental conditions, the model has good stability and generalization ability.

  3. (3)

    The fusion entropy feature vector is an effective method for extracting bearing fault features. In the experiments on the CWRU dataset, in addition to the Fisher discriminator achieving an accuracy of over 98% in small samples, the accuracy of decision tree and SVM classification in small samples also reach over 93.5%, which proves this point.

  4. (4)

    Compared to the Paderborn dataset, the CWRU dataset performs better in the experiment. This indicates that differences in data from different devices can affect the performance of the model. Therefore, in the future, we will focus on researching cross equipment bearing fault diagnosis.