1 Introduction

At present, the gearbox is an important part of some mechanical equipment, and has been widely used in automotive transmissions, heavy machinery, ship building and other equipment. As a key component of mechanical equipment, once the fault cannot be found in time, the consequences caused by fault are not only mechanical damage, more likely to cause significant casualties and significant economic losses, so timely monitoring and fault diagnosis to the gearbox are of great significance.

The original signal of the gearbox is a non-stationary signal, because the original signal collected are the signal that the gearbox is running and the signal that suddenly changes over time. The effective feature information is the key to the fault diagnosis. Gearbox fault diagnosis had experienced a long time from the early manual detection to the subsequent computer detection until today’s intelligent detection. Now, there are a lot of artificial intelligence detection methods. However, the detection methods that are most widely used are the support vector machine (SVM) and neural networks. From development to today, domestic and foreign researchers in the gearbox fault diagnosis had proposed artificial neural network, support vector machine and other scientific effective identification methods. Recently, Bruna and Mallat (2013) used a wavelet to carry out quantitative analysis for the deep network structure, which was an important exploration in this direction. Wang et al. (2016) discussed the characteristics of the convolutional neural network for image processing, outlined two typical saliency detection algorithms by the use of convolutional neural networks, and analyzed thee future development of deep learning. Kankar et al. (2011) used a combination of multiple logical regression and wavelet packet transform (WPT) to diagnose faults. Huang et al. (2011) proposed a multi-fault diagnosis method based on improved SVM, which used a radial basis kernel function and empirical mode decomposition method. Wang et al. (2015) used the neural network to achieve a quantitative accurate identification for the crack fault of rotor. Jiang and Yuan (2014) improved the traditional wavelet neural network model, which could effectively diagnose the gearbox fault. Deep belief networks (DBN) were proposed by Hinton et al. (2006), which was constructed on the basis of the restricted Boltzmann machine (RBM) (Roux and Bengio 2008). It is theoretically proved that RBM can fit any discretely distributed signal as long as there were enough network units. In this algorithm, a number of RBMs were stacked to form a DBN. In the DBN, the upper-level network could extract a higher-order feature from the lower-level network, which was an unsupervised deep learning algorithm. The training algorithm to RBM is based on using CD (Zhang et al. 2015) algorithm. The deep model obtained by training large-scale data (de Mendívil 2018; Yang et al. 2015) had been also studied for better fault diagnosis. The above several methods can effectively diagnose the fault, but these methods have their own disadvantages in the multi-classification or precision.

Signals collected by different means or affected by various other factors may mix with some noise. Feature value decomposition can well extract the active components of the signals, and can effectively filter out part noises of the original signals. Therefore, this paper proposes a feature extraction method by feature value decomposition (FVD) based on the characteristics of the signals, and also plays a role of dimension reduction. Then, the features that are extracted by FVD are input into deep learning network (DLN) as the deep learning signals. Finally, the fault classification and identification are achieved by the intelligent diagnostic method, which is called the FVD-DLN method. Experiments show that this method is effective, accurate and fast for fault diagnosis of gearbox of automobile transmission. For more clearly displaying idea in this paper, the flow frame of the study is shown in Fig. 1.

Fig. 1
figure 1

The flow frame of idea

The novelty of the presented work mainly includes three aspects, i.e., First, a feature extraction method based on the singular value decomposition of the signal is proposed. Second, the learning method of grading, layering and classification based on the characteristics of the signal is given. Third, the design of neural network structure based on above given deep learning method is performed, namely, connection, depth, calculator, learning training of network and other design.

2 FVD-DLN method proposed

2.1 Feature extraction algorithm

The first novelty of this paper is: In previous literatures (Jiang and Zhang 2018; Wu et al. 2015, 2016; Li et al. 2018; Cho and Yu 2018), the feature extraction to target was achieved based on the relationship between signals or other features of the signal, such as variance, error, amplitude, and so on. However, this paper proposes a feature extraction method based on the singular value decomposition of the signal for the characteristics of the signal itself, and the decomposition of each parameter or decomposition of coefficient of parameter is implemented. Such feature extraction method can achieve signal denoising and dimensional reduction of large size data.

  1. 1.

    Feature value decomposition

In this section, the Hankel matrix is used, so here it is simply introduced as follows:

Hankel matrix is a \( m \times n \) symmetrical matrix, and the total different number of elements of the matrix is \( m + n - 1 \). Moreover, two numbers on the connected line obtained by connecting between the second element of the first row and the second element of the first column in the matrix are the same, and the numbers on the same lines that parallel to this connected line are the same, i.e., this matrix can be intuitively shown in Fig. 2. In Fig. 2, the squares of the same color represent the same number. For example, a 4 × 5 Hankel matrix is given as follows:

$$ D = \left( {\begin{array}{*{20}l} 1 \hfill & 2 \hfill & 3 \hfill & 4 \hfill & 5 \hfill \\ 2 \hfill & 3 \hfill & 4 \hfill & 5 \hfill & 6 \hfill \\ 3 \hfill & 4 \hfill & 5 \hfill & 6 \hfill & 7 \hfill \\ 4 \hfill & 5 \hfill & 6 \hfill & 7 \hfill & 8 \hfill \\ \end{array} } \right) $$
(1)

The general form of Hankel matrix is shown in the formula (2).

Fig. 2
figure 2

A 5 × 5 Hankel matrix

Assume the sampling sequence of the sample data for the running gearbox is \( y = \left\{ {x_{1} ,x_{2} , \ldots ,x_{k} } \right\} \) and its length of the sequence y is k. \( x_{i} (i = 1,2, \ldots ,k) \) is the sample signal data. A \( m \times n \) matrix D can be constructed by y, and then the special Hankel matrix is used to carry out a construction for the discrete signal y. The constructed matrix D is given as follows:

$$ D = \left( {\begin{array}{*{20}c} {x_{1} } & {x_{2} } & \cdots & {x_{n} } \\ {x_{2} } & {x_{3} } & \cdots & {x_{n + 1} } \\ \vdots & \vdots & {} & \vdots \\ {x_{m} } & {x_{m + 1} } & \cdots & {x_{k} } \\ \end{array} } \right) $$
(2)

where D is a constructed matrix with a \( m \times n \) dimension. \( x_{i} \left( {i = 1,2, \ldots ,k} \right) \)\( (i = 1,2, \ldots ,k) \) is input signal. According to the characteristics of Hankel matrix, there is \( k = m + n - 1 \). The feature value decomposition is implemented to D as follows:

$$ D = USQ^{T} $$
(3)

where \( U,Q \) is an orthogonal matrix, and s is a diagonal matrix. For any real symmetric matrix D, there are always two orthogonal matrices \( U,Q \), and then D is turned into a diagonal matrix, i.e., formula (3). Their expressions are as follows:

$$ \begin{aligned} U & = \left[ {u_{1} ,u_{2} , \ldots ,u_{m} } \right] \\ Q & = \left[ {q_{1} ,q_{2} , \ldots ,q_{n} } \right] \\ D & = \left[ {{\text{diag}}\left[ {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{r} } \right],0} \right],\quad r = \hbox{min} \left\{ {m,n} \right\} \\ \end{aligned} $$
(4)

where U is a \( m \times m \) matrix; Q is a \( n \times n \) matrix; \( \lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{r} \) are feature values obtained by decomposing the matrix D, and their relationship is \( \lambda_{1} \ge \lambda_{2} \ge \cdots \ge \lambda_{r} \). These feature values obtained by decomposition can be used as extracted features.

  1. 2.

    Fault feature extraction of gearbox

For the operation signal \( y = f\left( x \right) \) of the mechanical gearbox, the sequence that consists of signals \( x_{1} ,x_{2} , \ldots ,x_{k} \) can be obtained by sampling and its length is k. A special Hankel matrix is structured for the sequence. The feature value vector \( z = \left( {\lambda_{1} ,\lambda_{2} , \ldots ,\lambda_{r} } \right) \) is obtained by feature value decomposition, as is the extracted features of operation signal. According to the characteristics of feature value decomposition, the larger value of the feature is the effective part of signal that is the reflection of the fault, and the small part of the feature value is the embodiment of the noise of signal. So, the feature values are arranged from large to small and the larger feature value is taken as the fault feature.

2.2 Deep learning method

The second novelty of this paper is: the learning method of grading, layering and classification based on the characteristics of the signal is given. That is, the pre-stage achieves the input according to the characteristics of the signal, and the normalization processing such as unified units, input by dimension or semantic field is implemented, etc.; The intermediate processing stage is a hidden layer. The hidden layer has several layers, and the learners of each layer are different. For example, the first layer performs redundancy removal processing on the data, and the second layer performs feature detection, extraction and labeling, and so on; The output stage is the design of detector and recognizer of the output layer, and the output result. For example, the design of detector and recognizer device is some calculations of maximum membership, recognition rate, variance, error comparison, etc. The output results are the effective values of the target or target parameters, and the output form is data, tables or images, and so on.

The deep learning algorithm is composed of multiple non-linear monolayer networks superimposed, which has a strong mathematical expression. Deep belief networks (DBN) were proposed by Hinton et al. (2006), which was constructed on the basis of the restricted Boltzmann machine (RBM) (Roux and Bengio 2008). It is theoretically proved that RBM can fit any discretely distributed signal as long as there were enough network units. In this algorithm, a number of RBMs were stacked to form a DBN. In the DBN, the upper-level network could extract a higher-order feature from the lower-level network, which was an unsupervised deep learning algorithm. RBM was composed of a visual layer V and a hidden layer H, which its structure was characterized by interconnection between layers and is independent interior layers. The structure of RBM is shown in Fig. 3. The bottom neurons \( V = \left( {v_{i} } \right)_{M \times 1} \) in the graph are a second order visual layer. \( W = \left( {w_{ji} } \right)_{N \times M} \) are the connected weights between two layers of the interconnect layer. The upper neurons \( H = \left( {h_{j} } \right)_{N \times 1} \) is a hidden layer. Where, assuming the number of neurons in the hidden layer of RBM model is N, and the number of neurons in the visual layer is M.

Fig. 3
figure 3

RBM model

RBM is a probability model. The relation between the two layers is completely independent and interconnected. Therefore, in the case when the state is known in visual layer, the state of each neuron in the hidden layer can be obtained by the conditional probability \( p\left( {h\left| v \right.} \right) = \prod\nolimits_{j = 1}^{N} {p\left( {h_{j} \left| v \right.} \right)} \). Similarly, in the case when the state is known in the hidden layer, the state of visual layer can be obtained by the conditional probability \( p\left( {v\left| h \right.} \right) = \prod\nolimits_{j = 1}^{M} {p\left( {v_{i} \left| h \right.} \right)} \). Where, \( h,h_{j} \in H \), \( v,v_{i} \in V \). In RBM, the number of neurons in the visual layer is set as the characteristic dimension of the training data. The number of neurons in the hidden layer needs to be set in advance. The connection weight W that needs to be trained is a \( N \times M \) matrix. The biased vectors are set as a that is a M dimension column vector in the visual layer and b that is a N dimension column vector in the hidden layer, respectively. The training algorithm to RBM is based on using CD (Zhang et al. 2015) algorithm, the detailed steps are as follows:

P1:

Initialize the connection weight W. According to the training rules for the network, define the incentive function is \( h = W * x + a \) to determine the input value of the hidden layer, where \( x = \left( {x_{1} ,x_{2} , \cdots ,x_{M} } \right) \) is the network input signal;

P2:

The incentive function h obtained by the first step is the input to the hidden layer. The open state function of the hidden layer neuron, i.e., the probability of the open state is defined by \( p\left( {h_{j} } \right) = \frac{1}{{1 + e^{{ - h_{j} }} }} \) based on the network stability requirement. A logistic function is used as the incentive function \( ho_{j} \) of output of the hidden layer. The probability value of each hidden neuron in the open state is calculated the by the above calculation formula. 0 indicates the probability of closing state, and 1 indicates the probability of opening state. Because each neuron is binary, the specific switch state of each neuron can be determined. It is necessary to compare the open probability of each neuron obtained with a randomly selected value \( c = UI\left( {0,1} \right) \) in the uniform distribution from 0 to 1, and the comparative formula is given as follows

$$ ho_{j} = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {p\left( {h_{j} = 1} \right) \ge c} \hfill \\ {0,} \hfill & {p\left( {h_{j} = 1} \right) < c} \hfill \\ \end{array} } \right. $$
(5)
P3:

Determine the opening state of the neurons in visual layer. The calculation method is similar to the second step.

The probability distribution of the training samples obtained by the CD algorithm is the optimal distribution. The greedy algorithm is used in training process of DBN, that is, each RBM is trained to layer by layer, and thus the best results of each layer are obtained for each RBM. However, the global results obtained are suboptimal. The next step requires a global fine tuning, then the global optimal results can be gotten. The specific training process (Ma et al. 2015) of DBN is as follows:

P1:

Training the first RBM model;

P2:

Fixed the weight and the biased vectors obtained by the first step, and let the output as the input of next RBM;

P3:

Repeat the corresponding first two steps according to the number of hidden layers designed;

P4:

Carry out a global optimal tuning.

3 Design of network for the FVD-DLN method

The third novelty of this paper is: the design of neural network structure based on above given deep learning method is performed, namely, connection, depth, calculator, learning training of network and other design. The design of the connection structure between the layers, for example, single connection, multiple interconnection and sparse connection; Design of each stage and the number of layers at each stage, such as the design of the input layer, several hidden layers, output layers, etc.; The design of calculator in each layer, for example, preprocessing, feature extraction, maximum membership, recognition rate, variance, error comparison, etc.; Learning training is designed according to feedback, feedforward, and so on.

3.1 Network input

The input value of the network is the original signal, which is ready to be processed. In here, the input signal is the fault data of gearbox. It needs to collect discrete signals. The fault data of the gearbox used by experimental verification is provided by a car repair shop. These data come from a test drive system.

We carried out some experiments on fault. These faults that were some single point faults were manually determined by electrical discharge machining. Perform the statistics on the frequency of gearbox faults. Statistics show that 60% of the gearbox fault is caused by the gear fault, so the fault diagnosis of gear is only discussed in here. For the fault of the gear, several features here are selected in the frequency domain. The more obvious gear faults in the frequency domain are at the edge band at the meshing frequency. Therefore, in the selection of the characteristic signal in the frequency domain, the amplitude \( A_{i,j1} \), \( A_{i,j2} \) and \( A_{i,j3} \) at the edge frequency band \( f_{s} \pm nf_{z} \) of the 1, 2, 3 axes at 2, 4, 6 gears are extracted. Where \( f_{s} \) is the meshing frequency of the gear; \( f_{z} \) is the transfer frequency of the axis; n = 1, 2, 3, i = 2, 4, 6 indicate the position of gear; j = 1, 2, 3 denotes the number of axis. Because there are two pairs of gear meshing at the 2nd axis and the 3rd axis, 1 and 2 represent two meshing frequencies, respectively. In this way, the input of network is a 15-dimensional vector. These data have different units and orders of magnitude, so they should be normalized before being inputted the FVD-DLN network. In this section, the number of samples provided is nine sets of data and each data is a 15-dimensional vector. Here, the nine sets of data take from the rotational speeds of three different gears, which are 6120.09 rpm, 8670 rpm and 1705.5 rpm, respectively, and the number of teeth of three gears is 17, 12 and 61 correspondingly. Table 1 shows the nine sets of data of input vector, which are the normalized sample data.

Table 1 Sample data of state of gearbox

The fault is a given single point fault processed by human through electrical discharge machining. The sampling frequency is 5 s, 1.4 s, 1 s, respectively. Fault data are no fault, tooth root’s crack, fracture of gear in turn, as shown in Fig. 4.

Fig. 4
figure 4

Vibration signal of gearbox with fault source

In Fig. 4, \( f_{1} \) denotes the signal with no fault in the case of the above three rotational speeds. \( f_{2} \) and \( f_{3} \) denote the signals with tooth root’s crack at rotational speeds of 6120.09 rpm and 8670 rpm, respectively. \( f_{4} \) denotes the signal with fracture of gear in the case of the above three rotational speeds. The points of much higher amplitude of wave curve are fault points for signals \( f_{2} \), \( f_{3} \) and \( f_{4} \).

1000 points that appear in Fig. 4 are explained as follows: According to collected signal data, there are several periods about the vibration signal of gearbox. The 1000 time points in Fig. 4 are a period inside the signal.

In addition, the relationship between the rotational frequency and the teeth number of each wheel is discussed as follows:

The meshing frequency of gear is equal to the rotational frequency of the gear (rotation per second or Hz) multiplied by its number of teeth. The meshing frequencies of the two gears that mesh with each other are equal. For example, a gear with 17 teeth has a rotational speed of 6120.09 rpm (rotation per minute), and the number of teeth of gear meshed with it is 61, then the latter has a rotational speed of (17/61)*6120.09 = 1705.5 rpm. The rotational frequency of the two is 102.0 Hz and 28.4 Hz, respectively. It can be verified that the meshing frequency of the two is 102.0*17 = 28.4*61 = 1734.0 Hz.

When the gear is working, there are some kinematic process is, i.e., First is to transfer the movement; Second is to change the speed of the movement; Third is to change the direction of the movement, for example, change parallel to vertical or any other angle.

The spectrum characteristics of the meshing frequency of gear and its harmonics when the gear fault occurs are introduced as follows:

  1. 1.

    In the initial state of health, the amplitude of the meshing frequency is the highest, and the amplitude of each harmonic is sequentially decreased.

  2. 2.

    As the gear wear increases, the tooth profile of involute is gradually damaged, and the gear vibration is intensified. At this time, the amplitude of the meshing frequency and each harmonic are gradually increased, and the increase of the amplitude of each harmonic is much faster than that of the meshing frequency.

  3. 3.

    When the wear is severe, the amplitude of the second harmonic exceeds the amplitude of meshing frequency, as shown in Fig. 5.

    Fig. 5
    figure 5

    Meshing frequency and its second harmonic in severe wear

From the above Fig. 5, the degree of wear of the gear can be judged based on the increment of amplitude of the meshing frequency and its harmonic.

3.2 Feature value decomposition for feature extraction

After the processed data is obtained, these data are decomposed to be the feature value and the input signal sequence is \( y = \left\{ {x_{1} ,x_{2} , \ldots ,x_{10000} } \right\} \). \( x_{i} (i = 1,2, \ldots ,10000) \) is the ith data point. Moreover, the range of length of signal \( x_{i} \) is chosen to be [0, 2500] in the actual experiment for quick implementation and obvious results through a large number of experiments. Using the Hankel matrix for y to construct a matrix A. According to the characteristics of a \( m \times n \) Hankel matrix, since the dimension of the input signal sequence y is 10,000 in here, and when the number n of column of the constructed matrix A is chosen to be 2500, the number m of row of the constructed matrix A must be 7501. The dimension of Hankel matrix A is \( 7501 \times 2500 \). The matrix A is given as follows:

$$ A = \left( {\begin{array}{*{20}c} {x_{1} } & \cdots & {x_{2500} } \\ \vdots & \ddots & \vdots \\ {x_{7501} } & \cdots & {x_{10000} } \\ \end{array} } \right) $$
(6)

Because of the differences between the different signals, the feature values obtained by the formula (6) for feature value decomposition are also different. Figure 6 shows the feature values obtained by different feature decomposition. The feature values of different signal \( x_{i} \) are different within a same range of length. From Fig. 6, the feature values with faults at normal state are small that is almost zero, however, the feature values with faults at fault states are larger. It can be seen from the Figure that the feature values have obvious differences within the 2000 dimensions of the feature values. So, the 2000 dimensions here are chosen as the feature vectors.

Fig. 6
figure 6

Feature value distribution curve

3.3 Model training

After the feature value decomposition, the obtained feature vector is normalized and used as the input of DLN. DLN is a probabilistic model that cannot be used directly as a fault diagnosis model, which needs to be designed accordingly. In order to be able to do more classification, the ordinary neural network classifier \( yo_{i} (i = 1,2, \ldots ,O) \) need to be added to the output layer of the model. Here, assume the value of \( yo_{i} \) is \( v_{i} \). If \( v_{k} = \mathop {\hbox{max} }\limits_{i = 1,2, \ldots ,O} \left\{ {v_{i} } \right\} \), i.e., \( v_{k} \) is maximum, then the corresponding classifier \( yo_{k} \) is chosen and marked as max, as a maximum membership classifier. DLN model is shown in Fig. 7. When the model carries out the identification and classification for target, the feature vector is inputted from the input layer, and it is passed along layer by layer from the input layer to the hidden layer, and finally passed to the classifier layer. The classification results are obtained by the classifier layer.

Fig. 7
figure 7

DLN classifier structure

3.3.1 Depth of FVD-DLN network

The number of hidden layers and the number of neurons in each layer are four kinds of structure in the following, i.e., (a), (b), (c) and (d), as shown in Fig. 8.

Fig. 8
figure 8

Four different depth classifiers of DLN

The data in the box in Fig. 8 is the number of neurons in the layer. Through the four DLN models with different depth, these given input data are classified. By the simulation results, the classification results are different for inputting the same sample data. The classification accuracy of the four different classifiers of DLN with different depth on the test set is shown in Table 2.

Table 2 Classification accuracy of DLN at different depths

It can be seen from Table 2, the accuracy of the designed DLN classifier is improved when its hidden layer increases from one layer to 3 layers. The accuracy is the highest when the number of layer reaches the third layer. If the number of hidden layers is more than 3 layers, the accuracy will gradually reduce, so the number of hidden layers here is chosen as 3 layers.

3.3.2 Number of units in hidden layer of FVD-DLN network

In Hinton (2012), Hinton pointed out that the number of bits required to characterize the category of the sample was usually equal to the number of constraints applied to a parameter by a training sample in the training process of the model, which could be calculated by using the following expression:

$$ {\text{bits}} = - \log_{2} n $$
(7)

The n in the formula represents the number of category of sample. After determining the number of bits, let it multiply the capacity of the training set, and select a value less than one order of magnitude as the number of hidden neurons (Zhang et al. 2015). If the training set is highly redundant, the fewer number of hidden neurons can be used. The number of hidden neurons can be calculated by using the following expression:

$$ N_{p} = N_{s} * {\text{bits}}/10 $$
(8)

where \( N_{p} \) denotes the number of hidden neurons; \( N_{s} \) indicates the number of training samples. The numbers of neurons of hidden layers that can be set are 200, 200 and 300, respectively.

3.3.3 Weight initialization, batch selection and global optimization

Some parameters will be trained in the designed structure of DLN in the following. The training is divided into the following three steps:

  1. 1.

    Weight initialization. The training is carried out layer by layer, which is a greedy training. First initialize each parameter, including the weight and bias of each layer. In general, the weights W between each layer are assigned to the values generated by the normal distribution \( N\left( {0,0.01} \right) \) (Hinton 2012), and all biases are set as 0.l learning rate should be from 0 to 1. According to the training experience for sample, that the learning rate is selected as 0.1 is more appropriate in the course of the experiment, so the learning rate here is selected as 0.1.

  2. 2.

    Batch selection of sample. In the course of updating all parameters, select the small batch training data. Although a training sample can also be used to update, the amount of calculation is too large. In the case of small batch training, the choice of capacity of batch is critical. If the batch selected is too small, the training is not sufficient; if the capacity of batch selected is too large, the optimal point can be easily missed. Hinton (2012) suggested that if the categories of data in the data set are equal probability, the desired size of the batch should be equal to the number of categories of samples, and each category contains a sample as far as possible. For categories of non-equal probability data set, choose 10–100 as the size of batch by experience. The experimental data set here is equal probability category, so the size of batch is selected by 7. That is, every 7 samples are selected as a small batch of training set to carry out the training in batches. In the use of small batch data, the parameters also need to make the appropriate changes in the process of updating the parameters (Fischer and Igel 2013). The average gradient is usually used, the expression is as follows:

    $$ \theta^{{\left( {t + 1} \right)}} = \theta^{\left( t \right)} + \varepsilon \left( {\frac{1}{B}\sum\limits_{{t^{\prime} = B * t + 1}}^{{B\left( {t + 1} \right)}} {\frac{{\partial \log p\left( {v_{{t^{\prime}}} \left| \theta \right.} \right)}}{\partial \theta }} } \right) $$
    (9)

    where B is the capacity of small batch; \( p\left( {v_{{t^{\prime}}} \left| \theta \right.} \right) \) is the conditional probability; \( \theta^{\left( t \right)} \) is the value of parameter of sample at time t.

  3. 3.

    Global optimization algorithm. Using the supervision algorithm to carry out the whole fine tune and classify. The parameter adjustment is achieved for the whole network by using the training methods to ordinary neural network. The stochastic gradient method (SG), the momentum learning rate (MLR) (Hinton 2012) and the Adagrad variable learning rate (He and Li 2016; Tuerxun and Dai 2015; Lu and Zhang 2016; Su et al. 2017; Su and Zhang 2017; Zhang et al. 2017) are used respectively in the selection of methods. The training results are shown in Fig. 9.

    Fig. 9
    figure 9

    Accuracy of global training

The abscissa in the graph is the number of iterations and the ordinate is the classification accuracy of the parameters. It can be concluded from simulation that the classification accuracy of the parameters is highest when the Adagrad variable learning rate is used for global training. So the Adagrad variable learning rate is chosen for classification.

4 Application of FVD-DLN Method in Gearbox Fault Diagnosis

According to the deep learning network constructed by the above section, the basic process of fault diagnosis mainly consists of three steps, i.e., the first step is to implement the training and test of the FVD-DLN network for classification and feature extraction for parameters, the second step is to carry out the matching and recognition of fault, the third step is to perform the analysis of advantages and disadvantages of FVD-DLN method for fault diagnosis, as shown in Fig. 10.

Fig. 10
figure 10

The process of fault diagnosis based on the constructed FVD-DLN network

4.1 Training and test of the constructed network for actual data

In each test, all nine sets of data given above are input into FVD-DLN network, i.e., the data given in Table 1 in Sect. 3.1. The above gearbox is diagnosed by using the FVD-DLN method. On the basis of the above input samples and test samples, the code is performed and simulated. The test is repeated three times. The output results of fault type are as follows: \( {\text{output}} = \left[ {1, \, 2, \, 3; \, 1, \, 2, \, 3; \, 1, \, 2, \, 3} \right]\;{\text{is}}\;{\text{a}}\;3 \times 3\;{\text{matrix}},\;{\text{i}} . {\text{e}} .,\;{\text{output}} = \left( {\begin{array}{*{20}c} 1 & 2 & 3 \\ 1 & 2 & 3 \\ 1 & 2 & 3 \\ \end{array} } \right) \).

The numbers 1, 2, 3 in the above matrix represent three different faults, i.e., the number 1 indicates no fault, the number 2 indicates the tooth root’s crack, and the number 3 indicates the fracture of gear. Here, the number of rows of the matrix indicates the number of tests, and the number of columns indicates the category of the fault type. The matrix has 3 rows indicating that the test is repeated three times, and 3 columns indicating that there are three types of faults. It can be seen from this matrix of output result that the output of each test can correctly detect these three types of faults.

The simulation results by FVD-DLN method show that the detection result of this network to fault is correct. That is, the network has successfully diagnosed these three kinds of faults, so the network is valid for gearbox fault diagnosis. The effect of training process for the network is shown in Fig. 11. Thus, when the maximum number of training times is reached, training stops.

Fig. 11
figure 11

Training process of FVD-DLN network

In Fig. 11 there are an input layer, two hidden layer and an output layer. The first hidden layer is a preliminary feature extraction layer, i.e., data classification layer, the second hidden layer is a further feature extraction layer, and the output layer is a fault matching and recognition layer.

In the hidden layer, the input data are classified firstly, i.e., preliminary feature extraction is implemented, then the further feature extraction is performed based on the above feature extraction algorithm. In the output layer, the fault matching and recognition are carried out. The matching and recognition method is discussed in the following.

4.2 Matching and recognition method

In section, the fault recognition discussed denotes the fault diagnosis. These neurons in the output matching layer of FVD-DLN network are some matching filters, such as, matching operator, matching rule, etc. By making use of these matching filters, the degree of similarity in a certain parameter between the identified fault and the known fault is computed. For better identifying fault, it requires to adjust to the weights \( 0 \le w_{pq} \le 1 \) in the output layer. The adjustment for \( w_{pq} \) is: From the display of data obtained in neurons of this layer, if the degree of similarity is greater than the threshold value, the value of weight \( w_{pq} \) is increased, otherwise, its value is decreased, where \( p,q = 1,2, \ldots ,M \). The input \( F_{q}^{t} \) in the output matching layer is \( F_{q}^{t} = \sum\nolimits_{p} {w_{pq} E_{p}^{t} } \) at time t, where \( E_{p}^{t} \) is the output data in the hidden layer.

Based on the extracted characteristic vector of fault, the fault recognition is performed. Since some faults are complicated or uncertain, the characteristic parameters that the characteristic vectors consist of uncertain faults have uncertain characteristic to some extent. Thus, the known characteristic parameters can be thought to be uncertain number, and then both the known characteristic vectors and the diagnosed characteristic vectors are uncertain number vectors. So, the uncertainty fault recognition can be performed well by the constructed FVD-DLN network. The characteristic vector of unknown fault with that of known classificatory fault are compared in output matching layer. Based on the maximum membership principle, if and only if the degree of similarity between the characteristic vector of unknown fault and that of given \( i_{0} {\text{th}} \) category fault is maximal, the unknown fault is judged to belong to the \( i_{0} {\text{th}} \) category. The recognition method is given as follows:

  1. 1.

    Calculation for degree of similarity

Assume there are n categories of fault. The characteristic vector of every fault consists of k characteristic parameters. Assume the \( i{\text{th}} \) category fault has \( n_{ij} \) values on the \( j{\text{th}} \) characteristic parameter, \( X_{ij}^{m} \) denotes the \( m{\text{th}} \) number of the \( i{\text{th}} \) category fault on the \( j{\text{th}} \) characteristic parameter, and \( \theta_{ij}^{m} \) is its mean; \( \tilde{X}_{j} \) is the observation of the unknown fault on the \( j{\text{th}} \) characteristic parameter, and \( x_{j} \) is its mean, where \( i = 1,2, \ldots ,n \), \( j = 1,2, \ldots ,k \), \( m = 1, \ldots ,n_{ij} \).

Here, the fault recognition denotes the observation vector which consists of the observed number \( \tilde{X}_{j} \) of an unknown fault is classified a characteristic vector of a known fault category, i.e., the known characteristic vector are the most similar to the observation vector of an unknown fault.

Assume that \( \mu_{{X_{ij}^{m} }} \left( u \right) \) and \( \mu_{{\tilde{X}_{j} }} \left( u \right) \) are the membership function of \( X_{ij}^{m} \) and \( \tilde{X}_{j} \) respectively. The Cauchy membership function is chosen based on experience. There is

$$ \mu_{{X_{ij}^{m} }} \left( u \right) = \frac{{\sigma_{ij}^{2} }}{{\sigma_{ij}^{2} + \left( {u - \theta_{ij}^{m} } \right)^{2} }} $$

and

$$ \mu_{{\tilde{X}_{j} }} \left( u \right) = \frac{{\sigma_{j}^{2} }}{{\sigma_{j}^{2} + \left( {u - x_{j} } \right)^{2} }} $$

where u is a variable to correspond to \( X_{ij}^{m} \) or \( \tilde{X}_{j} \); \( \sigma_{ij} \) and \( \sigma_{j} \) are the ductility degree of \( \mu_{{X_{ij}^{m} }} \left( u \right) \) and \( \mu_{{\tilde{X}_{j} }} \left( u \right) \), respectively.

In order to make sure the type of the unknown fault, the degree of similarity \( d_{ij}^{m} \) between \( \tilde{X}_{j} \) and \( X_{ij}^{m} \) needs to be calculated, i.e.,

$$ d_{ij}^{m} = \tilde{X}_{j} \circ X_{ij}^{m} $$

where \( \circ \) represents a synthetic operation, which is the supremum of intersection of \( \mu_{{\tilde{X}_{j} }} \left( u \right) \) and \( \mu_{{X_{ij}^{m} }} \left( u \right) \), that is, it is the height of intersection of two distribution curves between \( \tilde{X}_{j} \) and \( X_{ij}^{m} \), therefore, there is

$$ \frac{{\sigma_{ij}^{2} }}{{\sigma_{ij}^{2} + \left( {u - \theta_{ij}^{m} } \right)^{2} }} = \frac{{\sigma_{j}^{2} }}{{\sigma_{j}^{2} + \left( {u - x_{j} } \right)^{2} }} $$

Then, there is

$$ u = \frac{{\sigma_{ij} x_{j} + \sigma_{j} \theta_{ij}^{m} }}{{\sigma_{j} + \sigma_{ij} }} $$

Therefore, there is

$$ d_{ij}^{m} = \frac{{\sigma_{ij}^{2} }}{{\sigma_{ij}^{2} + \left( {u - \theta_{ij}^{m} } \right)^{2} }}\left| {_{{u = \frac{{\sigma_{ij} x_{j} + \sigma_{j} \theta_{ij}^{m} }}{{\sigma_{j} + \sigma_{ij} }}}} } \right. = \frac{{\left( {\sigma_{j} + \sigma_{ij} } \right)^{2} }}{{\left( {\sigma_{j} + \sigma_{ij} } \right)^{2} + \left( {x_{j} - \theta_{ij}^{m} } \right)^{2} }} $$

For m, take the upper bound value of \( d_{ij}^{m} \), so there is

$$ d_{ij} = \mathop \vee \limits_{m} d_{ij}^{m} $$

So, for each known \( i{\text{th}} \) data category, there is a degree of similarity between it and the \( j{\text{th}} \) identified unknown fault, then for any j, the similarity matrix can be obtained between the unknown fault vector and the known \( i{\text{th}} \) category fault vector, that is,

$$ D_{i} = \left[ {d_{i1} ,d_{i2} , \ldots ,d_{ik} } \right]^{T} $$
  1. 2.

    Matching and recognition principle

The method for implementing the matching is given as follows: First, according to the above solving method for degree of similarity, carry out the calculation of degree of similarity between the element in matrix of feature parameters of the identified unknown fault and the corresponding element in matrix of feature parameters of the known standard template library, so that the similarity matrix is obtained. Second, solve the quadratic norm of this matrix. Finally, the type of fault is judged according to the value of matrix norm and the principle of maximum membership, that is, the parameter type corresponding to the maximum norm is the type of fault which belongs to parameter type of the known standard template.

The maximum membership principle is: if \( \exists i_{0} \in B \), it can satisfy the following formula

$$ i_{0} = \arg \mathop {\hbox{max} }\limits_{i} \left\{ {\left\| {D_{i} } \right\|} \right\} $$

where \( \left\| \cdot \right\| \) is the matrix norm. Based on the maximum membership principle, the unknown fault is judged to belong to the \( i_{0} {\text{th}} \) category fault. Therefore, a fault is diagnosed.

4.3 Analysis of advantages and disadvantages for FVD-DLN diagnosis method

  1. 1.

    Establish a standard template database

The fault database used in experimental test and the sample database are established as follows:

The gear raw data in the mechanical equipment is an analog signal, so the signal needs to be discretely sampled. The gear fault data used in the experimental verification is collected by our staff of mechanical laboratory. The data comes from a test drive system. The main components of the experimental instrument are a motor, a torque sensor and a dynamometer. The main fault point of the discussion is the crack of tooth root and the fracture of gear.

The rotation speed of the gear is 1705.5 rpm, and about 1000 data points are collected according to the rotation frequency that the gear rotates one cycle, so every 1000 sampling points are used as one sampling result. In order to reflect the data more comprehensively, 10 sampling points are moved backwards on the basis of the first sampling, as the second sampling data, and so on, the number of sampling points is 1000 until the gear rotates one cycle. Thus, 100 sets of sampled data can be obtained.

The history data of gear are inputted to the above constructed depth network and is used as a standard template library that includes multiple types of data, such as data types at health, minor, medium, and severe fault states, etc. For the data at each state, the same number of parameters are discussed, such as variance, error, amplitude, meshing frequency and its amplitude, amplitude of its primary and secondary harmonics, increment of amplitude, frequency and its increment, etc. In this experiment, 10 parameters are chosen. The data collected for each parameter is 1000. Then, each parameter is decomposed according to the feature value decomposition formula (2) so as to constitute a feature matrix.

The specific establishment process for sample library is given as follows:

  1. 1.

    100 different samples for health and fault data are chosen randomly from health and fault database, and 6 data are chosen randomly from 10 data of each health or fault. Thus, a total of 600 data are obtained.

  2. 2.

    Five data are chosen randomly from 6 data of each health or fault as a training data set, thus, 500 training data based on 100 different samples are composed for experimental health and fault database. Since the total number of samples is 600 samples, the remaining 100 data compose the test samples database.

  3. 3.

    The established 500 training samples are carried out the feature extraction and fault recognition. The 10 kinds of features for each fault are acquired, respectively. The training vector for samples contains 5000 characteristic parameters.

  4. 2.

    Classification and extraction of data by feature extraction algorithm

After the fault data to be identified is input into the network, the feature value decomposition is performed. The discrete sequence \( x_{i} (i = 1,2, \cdots ,1000) \) is the \( i{\text{th}} \) sampling point. 10 parameters are chosen as a feature sequence to carry out the matching. Therefore, according to the formula (2), the feature matrix A of the signal \( y = \left\{ {x_{1} ,x_{2} , \ldots ,x_{1000} } \right\} \) is constructed with \( 991 \times 10 \) dimension, i.e.,

$$ A = \left( {\begin{array}{*{20}c} {x_{1} } & \cdots & {x_{10} } \\ \vdots & \ddots & \vdots \\ {x_{991} } & \cdots & {x_{1000} } \\ \end{array} } \right) $$

Due to the difference between different signals, there are also the differences among features calculated by formula (2). In first hidden layer of network, the different amplitudes are extracted by using the formula (2) so as to obtain the classification result for fault. In second hidden layer of network, the increments of amplitude are extracted by formula (2) so as to obtain detailed features.

  1. 3.

    Matching and recognition

The basic step of the fault matching recognition is as follows:

P1:

The fault in test database are carried out the feature extraction in hidden layer of network

P2:

The unknown identified characteristic vector with all known characteristic vectors in fault archive is matched in output layer. The random selections and matching recognitions in 100 times are implemented based on the recognition principle. The most similar feature matrix is chosen as the recognition result which has the maximum degree of similarity based on the Sect. 4.2

P3:

For each fault in the test sample database, after 100 times recognition are done according to the steps P1 ~ P2, the times of the correct recognition and the error recognition are recorded respectively, and the correct recognition rates are obtained by the following calculation formula (11)

Next, the FVD-DLN network is tested by using health and fault data other than the training samples. There are a set of two different types of 100 health and fault samples with 10 kinds of characteristic parameters. Then the FVD-DLN network is tested by using this set of data to detect whether or not the FVD-DLN network can successfully identify different fault parameters.

According to the characteristic parameter type of the gearbox fault, the number of state of the last layer of FVD-DLN network here is set to 10. The 100 health and fault samples that are the signal source are identified by using the FVD-DLN network. The amplitude of various signal source inputted is shown in Fig. 12a. In the first hidden layer of network that is shown Fig. 11, the FVD-DLN performs the classification for the fault signals, and the classification effect is shown in Fig. 12b. In Fig. 12a, b, the negative amplitude indicates the magnitude in the opposite direction. In the second hidden layer of network, further, the FVD-DLN carries out the feature extraction for classified signals, and the extraction effect is shown in Fig. 12c, d, respectively. In the output layer of network, the matching and recognition for various fault signals are performed, and the response of these 10 states is identified as the ten characteristic parameters of the fault. The final recognition results for the gearbox fault are ten types of characteristic parameters. The ten types of fault characteristic parameters are the amplitude of two kinds of faults that are the crack of tooth root and the fracture of gear, as well as the increment of their amplitude at minor, middle and severe fault states, at different rotational speeds, respectively. It can be seen that the FVD-DLN network can better distinguish these seven parameters of the gearbox. This shows that the hidden layer of the designed FVD-DLN network have a good distinguishing ability to feature extraction of signals.

Fig. 12
figure 12

Recognition effects to fault data of gearbox by the FVD-DLN method

In Fig. 12, the horizontal axis is the time axis that is the frame number k, and the vertical axis denotes different measures, which it denotes the amplitude of signal in Fig. 12a, b, but it denotes the increment of amplitude of fault signal in Fig. 12c, d. \( f_{0} \) indicates the increment of amplitude of the crack of tooth root at minor fault state, and its value is small. The same, \( f_{1} \) and \( f_{2} \) indicate the increment of that at middle and severe fault states, respectively, and their values are bigger. \( g_{0} \), \( g_{1} \), \( g_{2} \), \( g_{3} \) and \( g_{4} \) indicate the increments of amplitudes of the fracture of gear at minor and middle fault states, as well as those at severe fault sate and at rotational speeds of 1705.5 rpm, 6120.09 rpm and 8670 rpm, respectively, and the value \( g_{0} \) is minimal and the value \( g_{4} \) is maximum among five parameters.

Figure 12a shows the input signal source that includes health and various fault data, and they will be classified and identified by the FVD-DLN network. Figure 12b shows that the first hidden layer of the FVD-DLN network classifies the 100 data samples according to the nature of the fault parameters, and divides two types of parameters of different natures, i.e., the first type of parameter is the crack of tooth root and the second type of parameter is the fracture of gear. Then, the second hidden layer of the network further subdivides the two types of parameters. That is, the first type of parameter is further divided into three different parameters, which are the increment of amplitude of the crack of tooth root at minor, middle and severe fault states, respectively, as shown in Fig. 12c. The second type of parameter is further divided into five different parameters, which are the increments of amplitudes of the fracture of gear at minor and middle fault states, as well as those at severe fault sate and at three different rotational speeds, respectively, as shown in Fig. 12d. At the moment k, the responses of the 10 states in the final output layer of the FVD-DLN network are the values of ten category of parameters, respectively. There are two states, which are two amplitudes of the crack of tooth root and the fracture of gear, respectively. There are three states, which are three increments of amplitude of the crack of tooth root at minor, middle and severe fault states, respectively. Similarly, other five states are five increments of amplitudes of the fracture of gear at minor and middle fault states, as well as those at severe fault sate and at three different rotational speeds, respectively.

Assume that the total number of samples per experiment is \( {\mathbb{N}} \), and the number of experiments is \( {\mathbb{Z}} \). The fault sample is identified by the FVD-DLN method, and the number of correctly identified samples is recorded as \( n_{i} (i = 1,2, \ldots ,{\mathbb{Z}}) \), then the recognition rate is defined as follows:

$$ R_{i} = \frac{{n_{i} }}{\mathbb{N}},\quad i = 1,2, \ldots ,{\mathbb{Z}} $$
(10)

where \( i = 1,2, \ldots ,{\mathbb{Z}} \) denotes the number of experiments. \( R_{i} \) indicates the correct recognition rate at the \( i{\text{th}} \) times; \( n_{i} \) denotes the number of samples by recognition correctly at the \( i{\text{th}} \) times; \( {\mathbb{N}} \) indicates the total number of samples at each times.

In this experiment, the cross validation method for gearbox fault data set is used. The training samples and test samples are randomly selected from the data set at each time. The ratio of the training samples and the test samples selected is 5: 1. The 100 experiments are carried out. The recognition rates of fault at each of the times in the 100 experiments are shown in Fig. 13.

Fig. 13
figure 13

Recognition rates of fault at each times in 100 experiments

After 100 experiments, the accuracy of recognition by FVD-DLN method to gearbox fault data at each times is different. So, the mean value of accuracy of 100 times is calculated by 96.65% and the variance is obtained by 0.36. It is shown that the FVD-DLN method given in this paper is effective and feasible.

  1. 4.

    Analysis of advantages and disadvantages

From experimental results, the fault diagnosis method based on the FVD-DLN can be concluded the following advantages:

  1. 1.

    The FVD-DLN method not only can deal with the simple fault based on the known reason, but also deal with the uncertain fault caused by uncertainty reason.

  2. 2.

    The FVD-DLN method can continuously reduce the complexity of data set by relying on the features of attribute. Moreover, through calculating the similar degree between unknown fault and known fault by the matching principle, the FVD-DLN method can educe the decision-making results with the higher accuracy by the reduction of attributes.

  3. 3.

    This method based on FVD-DLN can more accurately perform the feature extraction for multi-dimensional parameters by extracting important attributes, and decrease the computing time.

  4. 4.

    The FVD-DLN method does not need to provide any prior information in addition to data processing.

The main disadvantages of FVD-DLN method is:

  1. 1.

    Since the calculation of similar degrees are defined or given by the experts or users, the subjectivity to processing problem is correspondingly stronger.

  2. 2.

    In a longer chain of fault diagnosis, the system needs the more parameters setup, so the discussion to problem and its attributes are on the increase accordingly.

4.4 Comparison of FVD-DLN and existing recognition methods

In order to verify the superiority of FVD-DLN method, this paper also compares it with the classical neural network (CNN) (Wang et al. 2015, 2016), promoted wavelet (PW) (Bruna and Mallat 2013; Jiang and Yuan 2014), support vector machine (SVM) (Huang et al. 2011), logical regression method (LR) (Kankar et al. 2011) to carry out the fault diagnosis.

\( {\mathbb{Z}} \) experiments were performed, and the number of samples j was used from the minimum number \( S_{\hbox{min} } \) to the maximum number \( S_{\hbox{max} } \), that is, the number j of samples was \( S_{\hbox{min} } \le j \le S_{\hbox{max} } \). According to the above formula (10), there is a correct recognition rate \( R_{i} \) for each experiment. Then, for each determined number j of samples, the recognition rate is calculated as follows:

$$ C_{j} = \frac{1}{\mathbb{Z}}\sum\limits_{i = 1}^{\mathbb{Z}} {R_{i} } ,\quad S_{\hbox{min} } \le j \le S_{\hbox{max} } $$
(11)

where \( C_{j} \) indicates the correct recognition rate for each determined number j of samples; \( R_{i} \) can be calculated by the above formula (10).

In the experiment in this section, for each determined sample number j, using the above various methods, 100 experiments are performed, that is, \( {\mathbb{Z}} = 100 \), then all the sample numbers \( 10 \le j \le 1000 \) are used for the experiments. The experimental result is the fault recognition rate under the condition of each determined number of samples, as shown in Fig. 14.

Fig. 14
figure 14

Comparison of FVD-DLN and four existing methods in gearbox fault identification

Further, the mean of the correct recognition rate can be calculated by averaging the correct recognition rate from the minimum number \( S_{\hbox{min} } \) of samples to the maximum number \( S_{\hbox{max} } \) of samples, the calculation is as follows:

$$ \bar{C} = \frac{1}{{S_{\hbox{max} } - S_{\hbox{min} } + 1}}\sum\limits_{{j = S_{\hbox{min} } }}^{{S_{\hbox{max} } }} {C_{j} } $$
(12)

where \( \bar{C} \) indicates the average correct recognition rate; \( C_{j} \) can be calculated by the above formula (11).

In this experiment, the number of experimental samples used was from minimum number 10 to maximum number 1000, and the repeating number of experiments is 100. The results are shown in Fig. 14.

In Fig. 14, the horizontal axis is the time axis, and the vertical axis denotes the correct recognition rate. From Fig. 14, the correct recognition rate increases with the increase of the number of samples. When the number of samples reaches 400 or more, the correct recognition rate of various methods tends to be stable, and the correct recognition rate of the FVD-DLN method proposed in this paper is the highest. However, when the number of samples is less than 400, the advantages of the FVD-DLN method cannot be exhibited, so the FVD-DLN method is suitable for the case where the dimension of samples is relatively large.

In order to visually verify the accuracy of various methods for fault diagnosis, the simulation is further carried out based on the formula (12). The simulation results show that the average accuracy of these methods for fault diagnosis. At the same time, the average correct recognition rate is also shown in Fig. 15.

Fig. 15
figure 15

Average correct recognition rate of several recognition methods

From Fig. 15, the number of experimental samples was from 10 to 1000, after 100 experiments, the average correct recognition accuracy of the FVD-DLN method to the fault diagnosis of gearbox is 96.65%, and the variance is 0.36. This shows that the FVD-DLN method presented in this paper is effective and feasible.

From Figs. 14 and 15, in a large number of data fault diagnosis, when the number of samples reaches a certain value, the recognition rate of FVD-DLN method for gearbox fault diagnosis is higher than that of the several existing diagnosis methods. It can be seen that the FVD-DLN method has a good diagnostic capability for gearbox fault.

The recognition effect of FVD-DLN method given in this paper is superior to that of several existing methods for the gearbox fault, because the FVD-DLN method not only has higher recognition accuracy, but also has faster recognition speed. The comprehensive comparison results are shown in Table 3.

Table 3 Comprehensive comparison of FVD-DLN and several existing recognition methods

By comparing the above four methods, it can be seen that the accuracy obtained by FVD-DLN method is the highest and the average accuracy rate is 96.65%, and the average recognition speed is 0.612 s. The average accuracy obtained by existing methods are shown as follows: those of promoted wavelet, support vector machine, classical neural network, logical regression are 93.52%, 91.87%, 87.36% and 79.98%, respectively, and the average recognition speed is 0.826 s, 1.087 s, 0.938 s, and 0.896 s in turn.

The further comparison between FVD-DLN method and other existing methods is summarized as follows:

  1. 1.

    From the point of view to dealing with the uncertainty, the existing methods use the probability or evidence to express the uncertainty, however, the FVD-DLN method not only can use the probability, but also can use the degree of similarity to express the uncertainty.

  2. 2.

    From the complexity of the calculation, the existing methods have the complexity of index information and index time, however, the FVD-DLN method can decrease the computing time, and reduce the storage capacity by the reduction of attributes.

  3. 3.

    The existing methods cannot distinguish correctly the fault reason between uncertainty and ignorance, however, the FVD-DLN method is able to more intuitively to distinguish between the two by using feature extraction algorithm in hidden layer.

  4. 4.

    The existing methods require assuming the priori probability, conditional probability, or independent evidences, however the FVD-DLN method does not need to provide any prior information besides data processing.

In addition to the advantages of the FVD-DLN method mentioned above, its greatest disadvantage is that it requires setting the system parameters, however, the existing methods do not need those.

5 Conclusion

This paper proposes a FVD-DLN fault diagnosis method to classify and identify for faults of gearbox. The method firstly performs spatial reconstruction and singular value decomposition for the input signal, extracts features and gives feature extraction methods; Secondly, the singular value as a feature signal is input into the DLN network for learning and training; Thirdly, a FVD-DLN network is constructed based on the feature extraction of the signal and deep learning method. Finally, the trained FVD-DLN network is applied to the fault diagnosis of the automotive transmission gearbox. The application process includes four aspects, i.e., training and test of the constructed network for actual data, matching and recognition, analysis of advantages and disadvantages for FVD-DLN diagnosis method, and the comparison of it and existing recognition methods. To compare with existing referred fault diagnosis methods, the greatest advantage of the FVD-DLN method is that it is not only has faster processing speed, but also applicable to dense signal categories environment, more conversion and transmission error. However, its biggest disadvantage is that the system parameters setup is complicated. For example, some parameters in weights adjustment need great deal simulations to determine, and these parameters are related with the choice of the threshold value. These studies show that the method proposed in this paper has good accuracy, stability and rapidity.

Although the deep learning had achieved great success in practice (Zeng et al. 2016), the deep model obtained by training large-scale data (Ren et al. 2017) is still much work to be done in the future for the theoretical analysis behind it. Moreover, joint deep learning and multi-stage deep learning are two examples, but there is more work in this area in the future.