1 Introduction

Batch process has been accounted for an increasing proportion in complex industrial fields such as food, medical, and chemical industries. With the safety and stability of batch process gain more and more attention, intelligent fault diagnosis as an important technical method in ensuring reliability has become a popular research topic in academia and industry [1]. Due to the discontinuous operation and complex production environment, the process data of incipient faults have the characteristics of the multivariate, nonlinear, and high-dimensionality. When an incipient fault occurs, it is easy to be covered by noise because of its low fault amplitude, and it will be extremely difficult to diagnose the fault effectively [2,3,4]. How to effectively diagnose and identify incipient faults of batch process has grown up to be an increasingly serious problem.

With the increase in complexity of batch process, it is difficult to establish an accurate analytical mathematical model or an efficient expert knowledge system. In the past decades, fault diagnosis methods based on data-driven have been widely used in the batch process, which can be typically divided into two categories: multivariate statistical models and machine learning models. Many multivariate statistical methods have been applied to obtain the incipient fault features, like the cumulative sum, principal component analysis, kernel principal component analysis, and kernel entropy component analysis [5,6,7,8,9]. Cumulative sum as an effective way of data analysis is utilized to detect small changes in the process. As the correlation between the variables of the complex production process is closely related, the traditional variable cumulative sum control charts may generate the false alarm and the missed report [5]. Principal component analysis and its improved methods have used to research batch production process monitoring [6, 7]. However, these approaches do not suitable for solving nonlinear problems. Kernel principal component analysis and kernel entropy component analysis are used to realize the nonlinear fault problems [8, 9]. Statistical methods have made great achievements in fault detection and diagnosis of batch process. However, with the wide application of distributed control systems and the continuous development of data storage, transmission, and processing technologies, intelligent fault diagnosis method based on machine learning has become more suitable. It can effectively analyze and judge multi-source and high-dimensional data [10].

The machine learning method can overcome the shortcomings of traditional detection methods that cannot judge the exact fault type. It can use process data to learn the fault mode features and directly judge the fault model to realize the fault identification [11]. Support vector machine (SVM) as a typical machine learning method has increased attention in fault diagnosis of the batch process because it can obtain remarkable results [12]. Ji et al. [13] put forward an algorithm based on ensemble empirical mode decomposition and support vector machine in the sensor fault detection and classification. However, SVM belongs to the shallow network in structure, which limits the classifier’s ability to learn the complex nonlinear and non-stationary relationships contained signals. With developing computing power and algorithms, many deep learning technologies have marched rapidly [14]. Novel convolutional neural networks (CNN) [15], recurrent neural networks (RNN) [16], long short-term memory (LSTM) [16], and other deep neural network methods are growing fast. It can be seen that the deep learning methods have a precedent in the application of batch process fault diagnosis. Wang and Wang [17] introduced LSTM to perform feature learning to obtain the deep feature expression of the data and realized the fault diagnosis of batch process. Zhang et al. [18] established a weighted autoregressive LSTM network to extract in-deep features and realize effective fault diagnosis in the TE process. Despite the progress made by state-of-the-art methods on intelligent fault diagnosis, the focus of incipient fault diagnosis in the batch process is how to obtain the effective characteristics of the data; the deep neural network still suffers from the following three limitations:

  1. (1)

    The incipient faults have low amplitude and are susceptible to noise interference, and are easily concealed by noise when processing with machine learning methods.

  2. (2)

    The data of batch process present the characteristics of nonlinearity, high-dimensionality, and mutual coupling of various variables, which can be regarded as a multivariate time series. Traditional machine learning methods cannot effectively obtain in-deep characteristics.

  3. (3)

    Traditional deep neural network models, such as original LSTM, do not sufficiently and properly handle the incipient faults information inherent in batch process data.

In this paper, we propose an intelligent incipient fault diagnosis model based on the deep time series feature extraction network. Firstly, denoising autoencoder (DAE), layer normalization and dropout layer added between the layers of the LSTM (LD-LSTM), and stacked autoencoder (SAE) are integrated to obtain the efficient features. Then, the softmax function is applied to achieve the incipient fault diagnosis based on the deep time series features. Finally, a fed-batch penicillin fermentation process is utilized to compare and analyze the accuracy of different diagnosis models according to five evaluation metrics most used in fault diagnosis fields. The main contributions of this paper are summarized as follows:

  1. (1)

    An interpretable data-driven incipient fault diagnosis model is proposed based on deep time series feature extraction network, which integrates DAE, LD-LSTM, and SAE to obtain the efficient features of batch process time series data.

  2. (2)

    For solving the problems of slow model convergence and over-fitting, layer normalization and dropout layers are added between the layers of the LSTM (LD-LSTM) to improve the performance.

  3. (3)

    The proposed incipient fault diagnosis model tends to obtain better fault diagnosis accuracy and achieves more stable results in the batch process when compared to several other models.

The remainder of this paper is organized as follows. Section 2 describes the main concepts used in this paper. In Sect. 3, the detailed procedures of the incipient fault diagnosis model based on the deep time series feature extraction network are introduced. The experimental evaluation is given in Sect. 4. Section 5 presents conclusions and suggestions for future work.

2 Preliminaries

2.1 Autoencoder Network and its Variants

2.1.1 Basic Autoencoder

Autoencoder (AE) as a type of artificial neural network is employed in unsupervised learning or semi-supervised learning. It is primarily used for data dimensionality reduction and feature extract for the efficient representation of input data [19]. AE network usually consists of an encoder and a decoder. The encoder is used to get the representation y of input vector x. And the decoder uses this representation to reconstruct the original input vector by minimizing the residual vector between the input vector x and output vector x. Because the reconstruction loss is small enough, the representation y could retain the most information of the input vector x [20]. The number of neurons in the hidden layer of the AE network is smaller than the input and output layers, which can be regarded as an efficient expression of data. The encoder receives the input data and passes to the hidden layer for obtaining a compressed representation of the data. The decoder passes the low-dimensional data to the output layer for obtaining the original data reconstruction. Figure 1 shows the classic AE network structure.

Fig. 1
figure 1

The classic AE network structure

In the AE network training process, the input data x is mapped and transferred to the hidden layer to obtain the encoding y, and then reverse decoding is performed through the encoding to obtain the input reconstruction \(\hat{x}\). The dimension of the input vector x is m. And the number of the input samples is n. The dimension of output vector y is m’.

Encoder: The encoder is formulated as:

$$ y = f_{{{\text{encoder}}}} (W_{e} x + b_{e} ) $$
(1)

where \(f_{{{\text{encoder}}}}\) is an activation function, such as sigmoid, tanh, identity, and so on, \(W_{e}\) is a \(m\prime \times m\) weight matrix, and \(b_{e}\) is an offset vector whose dimension is m’.

Decoder: The decoder is formulated as:

$$ \hat{x} = f_{{{\text{decoder}}}} (W_{d} y + b_{d} ) $$
(2)

where \(f_{{{\text{decoder}}}}\) is also an activation function, \(W_{d}\) is a \(m\prime \times m\) weight matrix, and \(b_{d}\) is an offset vector whose dimension is m.

Loss function: Autoencoder obtains a suitable parameter \(\theta = (W_{e} ,W_{d} ,b_{e} ,b_{d} )\) through minimizing the loss function. The loss function is formulated as:

$$ J(\theta ) = L(x,\hat{x}) + \lambda \left\| W \right\|^{2} = \sum\limits_{i = 1}^{n} {\left\| {x_{i} - \hat{x}_{i} } \right\|^{2} } + \lambda (\left\| {W_{e} } \right\|^{2} + \left\| {W_{d} } \right\|^{2} ) $$
(3)

where \(\lambda \left\| W \right\|^{2}\) is a regularized item to avoid over-fitting by minimizing the L2 norm of parameters, and \(L(x,\hat{x})\) is reconstruction loss.

The goal of AE network training is to minimize the error of the input range reconstruction value. The intermediate coding y is the focus, which can be regarded as an efficient expression of the input data.

2.1.2 Denoising Autoencoder (DAE)

Denoising autoencoder (DAE) is an extension of autoencoder. It adds noise to the input data, and the training enables the autoencoder to obtain the data that is not polluted by noise from the input data [21]. The training process makes the encoder to learn to extract the most significant features of the input data, and its generalization ability is better than that of general encoders. The DAE model avoids the shortcoming of that the general AE model may learn the identity function and the limitation of sample number required greater than the sample dimension. By minimizing the noise reduction reconstruction error, the true original input is reconstructed from the data containing random noise. The added noise to the input data is equivalent to neurons losing data with a certain probability, which can be regarded as a common dropout operation in neural networks. The model structure is shown in Fig. 2.

Fig. 2
figure 2

The schematic diagram of the DAE structure

2.1.3 Stacked Autoencoder (SAE)

When handling the complex process datasets, the representation extracted by a single autoencoder may not meet the requirement. And the stacked autoencoder (SAE) could generate a more appropriate representation, which is constructed by stacking several individual autoencoders, and each layer of the network is trained in turn using the layer-by-layer greedy training method [22].

Given a stacked autoencoder containing l autoencoders, the process of encoding is formulated as:

$$ y = f_{{{\text{encoder}}_{{\text{l}}} }} (f_{{{\text{encoder}}_{l - 1} }} ( \cdots f_{{{\text{encoder}}_{1} }} (x))) $$
(4)

where \(f_{{{\text{encoder}}_{i} }}\) is the activation function of the ith encoder.

The process of decoding is formulated as:

$$ \hat{x} = f_{{{\text{decoder}}_{l} }} (f_{{{\text{decoder}}_{l - 1} }} ( \cdots f_{{{\text{decoder}}_{1} }} (y))) $$
(5)

where \(f_{{{\text{decoder}}_{i} }}\) is the activation function of the ith decoder. The autoencoders in the stacked autoencoder can be trained sequentially.

The steps of training an SAE are to use the original input data to train the first AE network, apply the hidden layer output of the first AE network as the input of the second AE network, and then use the same method to train, repeat the above steps several times to obtain the output of the last hidden layer as an efficient expression, which can be regarded as the deep characteristics of the original data (see Fig. 3).

Fig. 3
figure 3

The schematic diagram of SAE structure

2.2 Long Short-Term Memory

The long short-term memory network is essentially an improved recurrent neural network (RNN). The neurons in the same layer of the fully connected network are independent of each other and are subject to greater constraints when dealing with sequence problems [23]. RNN is suitable for solving sequence problems, it adds a loop structure to the hidden layer of the traditional neural network, the neuron at the current moment is connected with the neuron at the previous moment, and then, the information of the previous time step can be stored. Specifically, the output of the RNN at the current moment depends not only on the input at the current moment but also on the intermediate state at the previous moment. Through this dependency transmission, the RNN ensures the correlation between neurons in the same layer, and the structure is as shown in Fig. 4.

Fig. 4
figure 4

Schematic diagram of the RNN structure

Where \(x_{t}\) denotes the input at time \(t\), \(s_{t}\) denotes the intermediate state at time \(t\), which can be understood as the ‘memory’ of the neuron at the current time. \(o_{t}\) represents the output at time \(t\), \(U,W,V,b_{1} ,b_{2}\) are the weight parameters that need to be learned, and the weights are shared in the same layer. The schematic diagram of the RNN structure, we can obtain the following equations:

$$ s_{t} = f_{1} (Ux_{t} + Ws_{t - 1} + b_{1} ) $$
(6)
$$ o_{t} = f_{2} (Vs_{t} + b_{2} ) $$
(7)

The input sequence of RNN can be infinitely long, but in practical applications, it is found that RNN can only establish strong connections with neurons in a relatively close range [24]. When RNN is used to deal with problems, only relatively recent information is needed. However, if the time interval becomes particularly long, the RNN may lose its effect and cause problems such as gradient disappearance and gradient explosion. The connection weight within the RNN network decreases as the interval between prediction information and related information increases. The LSTM network is proposed to solve the problem of long-term dependence.

On the basis of the RNN network, the LSTM network introduces a cell control mechanism inside the cell, which effectively solves the problem of gradient disappearance and gradient explosion [25]. The transmission of state information is like walking along a conveyor belt; the information passes through the entire cell after a few linear operations. This structure allows the information to pass through the entire cell without change, and its structure is shown in Fig. 5.

Fig. 5
figure 5

LSTM cell structure

In Fig. 5, the upper horizontal line has no way to add or delete the cells information, so the gate structure is introduced to process information. The gate structure can change the degree of information retention through a sigmoid neural layer and point-by-point multiplication. The sigmoid function transforms the input data into the [0,1] interval, which represents the probability of the corresponding information passing. When it is 0, it means that the information cannot pass, and when it is 1, it means that the information can all pass. Each LSTM cell unit has three gate structures to control the passage of information, which are forget gate, input gate, and output gate [26].

3 Proposed Deep Time Series Feature Extraction Model

In actual industrial production, equipment operating status data collected by sensors are often disturbed, which increases the difficulty of feature extraction and fault diagnosis. This paper proposes a deep time series feature extraction model to realize intelligent fault diagnosis. The deep time sequence features of the fault data are obtained by a three parts network, which integrates DAE, LD-LSTM, and SAE. The structure of the model is shown in Fig. 6.

Fig. 6
figure 6

Deep time series feature extraction model

The first part of the model in this paper is the DAE layer with denoising and data restoration capabilities. The original data should be standardized and substituted into the DAE network. The DAE network is composed of 7 layers of AE networks. The number of neurons in the input layer is the total number of original data n. The number of neurons in the second layer is 900, the number of neurons in the third layer is 300, and the number of neurons in the hidden layer is 64. Due to the symmetry of the AE neural network structure, the number of subsequent neurons is symmetrically equal to the first three layers, 300, 900, and n, respectively. It adds a dropout layer after the input layer to randomly deactivate some neurons so that to achieve the effect of restoring the real data after training. The output data of the layer are used as the input of the next layer.

In terms of time series processing, LSTM can obtain the essential features from complex time series. Besides, LSTM with layer normalization and dropout layer (LD-LSTM) can solve the problems of slow model convergence and over-fitting. Therefore, the second part of the model uses the LD-LSTM network to perform feature extraction. The LSTM network layer adopts a double-layer cyclic structure, the number of hidden layers is set to 128, and the value at the last moment is taken as the output time series features.

However, the time series features extracted through the LD-LSTM network still have irrelevant and redundant problems, which are not conducive to the incipient fault diagnosis. Therefore, the third part of the model uses the SAE layer to perform feature compression and dimensionality reduction, and extract the most critical feature of the time series feature. In this paper, SAE with three layers is used, in which the numbers of neurons are 128, 64, and 128, respectively. The input layer of AE is equal to the number of hidden layers of LD-LSTM. Through the SAE network layer, the time sequence features obtained from the LD-LSTM network layer are extracted layer by layer, and then, the deep time sequence features are flattened through the output layer and used as the input of the softmax classifier, and finally to achieve the purpose of fault diagnosis. The flowchart of the proposed deep time series feature extraction model is shown in Fig. 7.

Fig. 7
figure 7

The flowchart of the proposed deep time series feature extraction model

4 Experimental Research

4.1 Data Source

The penicillin fermentation process is the metabolic activity of the penicillin-producing bacteria to synthesize antibiotics under appropriate conditions of medium, temperature, pH, and stirring. The schematic diagram of the fermentation process is shown in Fig. 8. The penicillin fermentation process is a typical multi-stage biochemical reaction process. From the perspective of operating technology, it is mainly divided into two stages: batch operation stage and semi-batch feeding operation stage. In the initial batch operation stage, most of the bacteria are produced in this stage, and the initially added substrate (glucose) is also consumed by the bacteria. When the reaction enters the second semi-batch feeding operation stage, the growth rate of the bacteria needs to be less than a certain minimum value to ensure a higher yield. Therefore, it is necessary to continuously add glucose to the fermentation tank at this stage. The stages include the synthesis period of penicillin and the death period of the bacteria. Therefore, the penicillin fermentation process is a typical batch industrial process.

Fig. 8
figure 8

Schematic diagram of penicillin simulation experiment platform

Pensim 2.0 is utilized to generate experimental data in this paper; it can simulate the process variables under various conditions and provide a standard platform for fault diagnosis of batch process. The core of the software is the improved Birol model based on the Bajpai mechanism, in which the changes of numerous variables can output in the form of data and simulate the changes of various parameters during the fermentation process [27]. In the penicillin fermentation model, there are 5 input variables to control the process parameters, 9 process variables in the process of bacterial synthesis and growth, and 5 quality variables can affect the yield of penicillin; the parameters are shown in Fig. 9. The temperature variation can be controlled using the flow rate of cold and hot water, which is controlled by the PID controller. The pH value can be controlled by the flow rate of the acid–base solution through the PID controller [28]. The parameters of the software initial conditions in the experiment can be set as Tables 1 and 2; Table 1 is the initial reaction condition setting range and Table 2 is the reaction process parameter setting range.

Fig. 9
figure 9

Input and output of penicillin fermentation process

Table 1 Pensim2.0 initial condition setting
Table 2 Pensim2.0 parameter settings

In the penicillin fermentation process, abnormal situations such as valve leaks or pump failures often occur. To simulate the actual production situation, Pensim2.0 software usually generates three common faults: air flow (Fault 1), agitation power (Fault 2), and bottom flow rate (Fault 3), each fault has two types of faults (step signal and ramp signal) in every fault type [29]. Set the three types of faults as small-amplitude step signals, the amplitude of which is -2% of the initial set point, and the fault leading into time is 50 h. To ensure the diversity of data, the initial values and parameters are randomly generated, and the normal state and the three types of faults are repeated 1000 times, 1000 batches of 4000 batches of fault data are generated. Add random noise with a mean value of 0 and a variance of 1 to imitate the actual data disturbed in production.

4.2 Model Simulation Experiments

This experiment uses Python 3.7 as the programming language and PyTorch 1.0 as the deep learning framework. The experimental data are encapsulated into the data loading function by PyTorch, set the first 800 of each type as the training data, and the last 200 as the test data. The batch_size parameter is set as 32, where the network training takes 32 for training every time.

In the obtained experimental data, each batch has 17 variables, which usually have different dimensions, different sizes, and different magnitudes. If you directly calculate the original data, the role of high-dimensional indicators in the comprehensive analysis will be highlighted, the role of low-value indicators will be weakened, and many orders of magnitude will affect the network training speed and effect. Therefore, it is necessary to standardize the data before training. In this paper, minmax normalization is used to perform a linear transformation on the original data. This transformation will not change the distribution of the data; the original data are mapped to the interval [0,1]. The calculation process as follows.

Transform the sequence \(x_{1} ,x_{2} ,...,x_{n}\):

$$ y_{i} = \frac{{x_{i} - \mathop {\min }\limits_{1 \le j \le n} \{ x_{j} \} }}{{\mathop {\max }\limits_{1 \le j \le n} \{ x_{j} \} - \mathop {\min }\limits_{1 \le j \le n} \{ x_{j} \} }} $$
(8)

The new sequence \(y_{1} ,y_{2} ,...,y_{n} \in [0,1]\) and dimensionless.

The standardized training group data are substituted into the established deep time series feature extraction model. The number of iterations Epoch is set as 200; the learning rate lr is set as 0.01. Adam is chosen as the optimization function, CrossEntropyLoss is the loss function for multi-classification, and Relu is chosen as the activation functions. The loss function during the training process and the accuracy of the training set are output; the results are shown in Fig. 10.

Fig. 10
figure 10

The performance of training set

It can be seen from Fig. 10 that the deep time series extraction model obtained by the preliminary training does not perform well. The loss function shows that the loss function fluctuates near a larger value, and the accuracy of the training set is not stable, which oscillates around 80%. The accuracy rate is only 81.73%; it should be optimized.

4.3 Model Optimization

  1. (1)

    Deep LSTM.

From the perspective of the basic model, the training performance of the model gradually converges, but the effect is not good. One is that the convergence curve oscillates violently, and the accuracy of the training set is only about 80%. The reason for the poor effect may be the insufficient fitting ability of the network to the data set, which occurs the problem of under-fitting. The first thing to consider is the problem of network structure, and the most critical part is the LSTM in the deep time series feature extraction model, which is used to extract the time series features of the data. The pros and cons of the LSTM determine whether the model can extract useful time features, and then affect the diagnostic effect of the model. In the basic model, the structure of LSTM adopts the one-way, single-layer structure, and the number of hidden layer units is 64. Due to the insufficient fitting ability of the model, consider deepening the LSTM network structure, from a one-way two-layer structure to a two-way four-layer structure, and the number of hidden layer units is set to 128, then observe the training performance. The performance of the training set in deep LSTM model is shown in Fig. 11.

Fig. 11
figure 11

The performance of training set in deep LSTM model

It can be seen from Fig. 11 that the deep time series extraction model obtained by the deep LSTM model does not perform well, the accuracy rate and loss function fluctuate greatly, and the convergence is slow. At the same time, the accuracy of the training model is close to 100%, but the performance accuracy of testing set has only 85.65%, and the model has over-fitting problem.

  1. (2)

    LSTM with added Layer normalization (LN) and Dropout layer (LD-LSTM)

For solving the problems of slow model convergence and over-fitting, the deep model should be optimized, and the direction mainly includes the structure and the hyperparameters of the deep learning model. In this paper, the layer normalization (LN) and the dropout layer are added between the layers of the LSTM (LD-LSTM) to improve the performance. The LN normalizes the activation value of neurons, reduces network training time, speeds up convergence, and enhances network stability. The dropout operation randomly inactivates neurons to avoid excessive network capabilities and can effectively solve the problem of over-fitting. The performance of the training set of LD-LSTM model and fault diagnosis results are shown in Figs. 12 and 13.

Fig. 12
figure 12

The performance of training set of LD-LSTM model

Fig. 13
figure 13

Fault diagnosis performance of penicillin fermentation process

Figure 12 shows the results of the accuracy and the loss of training set. It can be seen that the improved model using the LN and dropout layers has converged when it runs to 150 times. The convergence speed of the model has been significantly improved. Besides, the prediction results of testing samples can be seen in Fig. 13, and the diagnosis accuracy rate is 98.87%, which explains that the improved model is more effective.

In order to visually see the effect of the optimization method, Fig. 14 shows the accuracy and loss value results with different algorithms. It can be seen that the deep LSTM and LD-LSTM has a better fitting effect than the model without optimization. The deep LSTM and LD-LSTM can finally reach a higher accuracy rate and the loss value that the convergence speed of LD-LSTM is faster. In summary, coupled with the over-fitting problem of the previous deep LSTM, the effectiveness of the proposed model in this paper can be preliminarily obtained.

Fig. 14
figure 14

Performance before and after model optimization

4.4 Model Stability

As we know, the deep learning model has a certain degree of randomness during training; the model obtained from each training is not the same. To fully explain the general performance of the model in this paper, we conduct 20 repeated experiments and record the results to observe the stability. The average of the 20 experiments is used to illustrate the stability of the model. The results of 20 experiments are shown in Fig. 15.

Fig. 15
figure 15

Accuracy performance of repeated experiment model

In Fig. 15, the red points represent the accuracy of the testing set, which is obtained from each training model. The blue straight line represents the average result of the 20 times experiments. It can be seen that the best accuracy rate that the model can achieve is 98.87%, the lowest is 92.05%, and the average value can reach 96.84%. These indicators demonstrate that the model is effective in terms of fault diagnosis accuracy. However, it still has some unstable situation; we should optimize the model to improve the stability of fault diagnosis in the future.

4.5 Comparative Verification

An incipient fault diagnosis model for batch process is proposed based on deep time series feature extraction; the LN layer and the dropout layer are utilized to optimize the network structure. The effectiveness of the model is verified through the penicillin fermentation process. Firstly, the original data are preprocessed, map-minmax normalization is used for data processing in this paper, which can eliminate the influence of dimensions, speed up the model calculation speed, and increase model accuracy. Then, substitute the preprocessed data into three models for training. To investigate the effectiveness of the proposed models, we have used six other models. The support vector machine (SVM) uses fixed constants by experiences and studies (C is 100 and γ is 0.5). The kernel extreme learning machine (KELM) parameter takes [0.5,100], and the data are flattened. The BP structure is set to a 5-layer fully connected network; the number of neurons in each layer is 800*17, 256, 128, 64, and 4, respectively. Besides, LSTM and deep LSTM models are utilized to compare. The structure parameters of different machine learning and deep learning are shown in Table 3. The average recognition training and prediction accuracies of the five models are presented in Table 4, where Epoch, lr, and other hyperparameters are consistent under the data preprocessing.

Table 3 The structure parameters of different machine learning and deep learning methods
Table 4 Diagnostic accuracy of different models

It can be seen from Table 4 that the traditional machine learning models like SVM and KELM have poor results with accuracy rates of only about 76%, while the classic BP network has an accuracy rate of only 72.56%. It shows that the typical network cannot effectively diagnose incipient faults of batch process. When the LSTM network is used for feature extraction without data noise reduction and deep feature extraction, the effect of the model is still not good. Both the training set and the test set have a fault recognition rate of about 85%. However, it indicts that the LSTM network has a better effect than the traditional shallow networks in time series problems.

When the deep LSTM with two-ways four-layers is used for training, the accuracy of the training set can reach 99.33%. However, due to the complexity of the model, over-fitting occurs and results in poor generalization ability; the accuracy rate is only 85.65% in test data. In this paper, the LN layer and dropout layer are added to the LSTM (LD-LSTM), which introduces the over-fitting and network standardization methods. The over-fitting phenomenon is effectively solved, and the generalization ability and robustness are improved. The DAE and SAE are added to the model to realize time series data purification and key feature extraction. The final accuracy rate has reached 98.87%, which is 13.22% higher than before the improvement. The comparative experiment fully demonstrates the effectiveness and superiority of the proposed deep time-series feature extraction network.

5 Conclusions and Future Work

In this paper, we propose an incipient fault diagnosis model of batch process based on deep time series feature extraction. LSTM network as a basic model introduces DAE and SAE to make the network have denoising and deeper feature extraction abilities. Besides, LN and dropout layers are introduced to improve the network and accelerate the network convergence, which can solve the problems of over-fitting and improve the accuracy of diagnosis. The penicillin fermentation process is utilized as the experiment example; we conduct 20 repeated experiments and record the results to observe the stability. Finally, to investigate the effectiveness of the proposed models, five other models (SVM, KELM, BP, LSTM, and deep LSTM) are designed and evaluated. The proposed model with deep time series features extraction (LD-LSTM) performs best among the models used, which reflects the benefits of the proposed incipient fault diagnosis model. Currently, the proposed model is quite effective in diagnosing the incipient faults of batch process.

Currently, the proposed model is quite effective in diagnosing the incipient faults of batch process. However, getting field data for validation of the approach is very difficult, but we generate the simulated data using a fed-batch penicillin fermentation process, then to analyze the incipient fault diagnosis problem of batch process. Studying real data is undoubtedly our next research task. Besides, our next research would be to improve the model stability, analyze specific failure causes, and predict fault in a few time steps in advance. More works will be contributed to promoting the widespread use of the proposed algorithm in the actual batch production process.