1 Introduction

Manufacturing industry refers to the use of specific resources (e.g., materials, equipment, tools) to create products that can be used by people through the manufacturing process [1]. It is an important pillar industry of national economy and social development, with process safety and product quality being two critical issues in modern industries [2]. In general, the manufacturing industry is divided into two categories: discrete industry and process industry. Typical discrete manufacturing industry mainly includes mechanical processing and assembly, electronic and electrical appliances, automobile manufacturing, etc. [1]. Process industry mainly includes chemical industry, metallurgy, pharmacy, etc. [2]. The manufacturing process refers to the entire process from raw material input to finished product production. For the process industry, the actual industrial processes are mostly complex processes that will involve complex physical and chemical reactions, and each subsystem is interconnected. The main data characteristics in the complex process industries are high dimensionality, non-Gaussian distribution, nonlinear relationships, time-varying and multimode behaviors, data autocorrelations, and other data characteristics [2]. With the expansion of production scale and the rapid development of technology, fault detection and diagnosis (FDD) in modern industry becomes more and more complex and important. A small fault in the industrial process may spread through the system, eventually leading to equipment damage or product quality degradation. Effective FDD models can detect process faults in the early stage of production and classify faults accurately for manufacturing process improvement [3]. Process fault refers to the abnormal operation of production, which means that at least one feature or variable appears some unexpected deviations for process industrial system, while fault detection method has been employed to monitor process data and determine whether some faults happened [4]. Fault diagnosis is to determine which kind of fault occurs, specifically to determine the type of fault, fault magnitude, fault location, and time to mitigate potential risks. MSPC has long been recognized as one of the most essential tools for FDD in processes industries [5,6,7,8].

Currently, prevalent data-driven techniques include projection-based methods, traditional machine learning methods, and deep learning methods. Over the past few decades, these methods have been intensively investigated for solving various process FDD issues. The most widely used data-driven methods, e.g., principle component analysis (PCA) [9], independent component analysis (ICA) [10, 11], partial least squares (PLS) [12], fisher discriminant analysis (FDA) [13], subspace-aided approach (SAP) [14] as well as their variants, have shown great significance in process FDD because their simplicity. Among these techniques, PCA and ICA are two typical unsupervised methods. PCA is a data projection-based feature extraction method for the process data with Gaussian distribution. ICA is able to extract independent components from process measurements. PLS is a supervised method that can extract the correlation model of the process inputs for prediction. FDA is a typical dimensionality reduction technique and has been widely applied for process FDD. SAP offers a unique way for data-driven design of observer-based process monitoring system without identification of the complete process model.

The nonlinear extensions of the aforementioned models (e.g., kernel PCA (KPCA), kernel PLS (KPLS), kernel FDA (KFDA)) were developed about two decades ago and have achieved wide applications in process monitoring systems [15,16,17,18]. These methods basically utilize a kernel function to map the original data into a higher-dimensional space in which they vary linearly. The nonlinear structure of input data space is more likely to be linear after high-dimensional nonlinear mapping [16]. Then, the latent variables can be extracted from the higher-dimensional space. Although kernel-based methods are attractive in nonlinear nonstationary process monitoring, there are still a lot of critical issues. For instance, conventional KPCA only considers normal operation data for statistical modeling and ignores prior fault data in the historical database. A wealth of surveys given by [19,20,21,22] provided the readers with more comprehensive information on popular data-driven methods and their applications in process FDD. These methods rely heavily on statistical models to determine the existence of process faults. However, the explosion of data volume and dimension makes these traditional data-driven methods limited in process feature extraction.

Modern industrial processes are usually featured with complexity and distributed. Manifold learning is widely acknowledged as an effective technology to enhance the performance of regular classifiers because it can find the distribution of input data and preserve their local and global manifold information [23]. Typical manifold learning algorithms including isometric feature mapping (IsoMap) [24] local linear embedding (LLE) [25], Laplacian eigenmaps (LE) [26], local tangent space alignment (LTSA) [27], and locality preserving projections (LPP) [28, 29] have achieved remarkable successes in diverse applications. More recently, global/local variance-preserving algorithms are embedded in conventional projection-based methods to [30,31,32] discover the intrinsic manifold in the data for more precise feature extraction. Manifold learning exhibits great popularity in dimensionality reduction and learning geometric distribution of data. It has been intensively studied in different fields. Other manifold learning approaches can refer to [33,34,35,36].

The ever-increasing amount of big data produced in modern process systems, coupled with the complexities of correlating process variables, could result in barriers that were not anticipated by the practitioners in the manufacturing process. This eventually results in high-level of uncertainty during process fault diagnosis [37]. The advent of artificial intelligence (AI) accelerates the development of process FDD techniques. AI in a broad sense refers to the realization of human mind thinking through computers (machines), so that machines like human decision-making. Machine learning is a method to implement AI, and deep learning is a subarea of machine learning. The concept of deep learning stems from the study of artificial neural networks (ANNs), and multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning generates abstract high-level representations of the given data. The most traditional machine learning models are usually simple in structure and have only a few shallow layers. The features for traditional machine learning methods are usually predetermined and selected manually according to specific scenarios, which bring many difficulties for practitioners. Moreover, these methods mainly focus on various classification tasks.

In this context, it is essential to exploit effective process FDD models that can automatically extract highly representative features from complicated process data. In recent years, deep learning has attracted many attentions in process FDD due to its unparalleled feature learning ability. Different from conventional machine learning-based feature learners, a deep neural network (DNN) usually consists of multiple layers that can hierarchically convert input data into hidden abstractions and use these extracted features for FDD tasks. The most popular DNNs, e.g., autoencoder (AE) [38] and its variant, convolutional neural network (CNN) [39], deep belief network (DBN) [40], recurrent neural network (RNN) [41], residual network (ResNet) [42], have been widely employed as representation learners to extract comprehensive features for different pattern recognition tasks in the fields of computer vision, natural language processing, and speech recognition. These DNNs now have become promising techniques for handling complicated process signals with high nonlinearity and correlations. There is a wealth of literature that investigated the effectiveness of DNN-based models for solving process FDD issues. Their excellent performance in learning high-level features remarkably increases the process FDD accuracy of the classifiers.

Different from the model-based approaches that require prior knowledge of the process, the data-driven methods only need the availability of a large amount of historical process data. Feature extraction can transform these data and present them to the FDD system as prior knowledge. This extraction process can be either qualitative or quantitative. The three main symbolic AI-based methods for extracting qualitative historical information are expert systems, fault tree, and signed diagraph. These machine learning-based quantitative extraction methods can be divided into four categories: unsupervised learning, supervised learning, semi-supervised learning, and reinforcement learning. Supervised learning and unsupervised learning have been widely adopted in process control, while semi-supervised learning methods and reinforcement learning methods have rarely been used in this field. The data-driven-based machine learning methods for process FDD are shown in Fig. 1.

Fig. 1
figure 1

Applications of machine learning methods in process FDD

Anupam et al. [43] presented a general framework common to all the process monitoring fault detection (PMFD) and discussed the future challenges of this research. Ge et al. [2] provided a review about data-based process monitoring. Yin et al. [22] reviewed the wide applications of data-driven methods in process monitoring and fault diagnosis, which provides a reference for industrial process monitoring of large-scale industrial processes. Qin et al. [44] provided a review about data-driven FDD methods and their applications in industrial processes from the perspective of multivariate statistical methods and finally summarized the challenges, opportunities, and expansion. From the perspective of machine learning, Ge et al. [45] reviewed the application of data mining and analysis in process industry in recent decades. Nor et al. [46] reviewed a comprehensive literature review on applications of data-driven methods in FDD systems for chemical process systems. Taqvi et al. [47] reviewed the methods of chemical process FDD based on supervised and unsupervised learning technology and presented the challenges in this field. Lei et al. [48] presented a review and roadmap to deep learning-based machinery fault diagnosis and offered a future perspective of intelligent fault diagnosis (IFD). Wang et al. [1] presented a comprehensive survey of commonly used deep learning algorithms and discussed their applications in smart manufacturing. Although numerous studies and reviews have been dedicated to process detection and diagnosis, the coverage about data-driven methods using deep learning is still rather limited. For examples, Refs. [2, 43,44,45,46,47] just reviewed applications of traditional machine learning to process FDD. Refs. [1, 22, 48] mainly focused on the applications on specific application areas using machine learning models. Therefore, it is still a blank to systematically review the development of process FDD from the past to the future. Furthermore, these reviews have not given a detailed process FDD developing route for forecasting future trends, which is very meaningful for readers. In recent years, a large number of new deep learning methods have emerged, which have not been explained in detail in other review papers. Therefore, it needs a new review to summarize the current research progress of process FDD.

This paper provides a systematic review on the development of process (especially complex industrial processes) FDD by using machine learning (especially deep learning) methods and presents a future development on this field. It is a special and completely different topic. The developments of deep network, their applications, and future prospects in process FDD in this paper are quite different from the previous review articles. The discussions on the applications of these deep learning methods (i.e., deep residual shrinkage network (DRSN), transformer, large-scale neural network, edge computing, etc.) in process FDD are presented in this review. The aims of this review are refined as follows: (1) The developments of data-driven process FDD are summarized into three periods from traditional machine learning to deep learning. In the future, these new deep learning techniques, e.g., transfer learning, graph neural network (GNN), deep Gaussian process (DGP), are viewed to promote the further development of process FDD; (2) A developing route of process FDD is presented in this review. The developing route includes potential research trends and can provide direction and valuable guidance for researchers.

The rest of the study is structured as follows. Section 2 presents a comprehensive review on the development of process FDD in the past. Section 3 reviews the application of deep learning, which are considered as the present period in the development of process FDD. Section 4 argues applications of transfer learning to process FDD. Section 5 displays a developing route in deep learning-based process FDD. Conclusions are drawn in Sect. 6.

2 Traditional methods of process fault detection and diagnosis in the past

This section reviews the traditional machine learning-based methods for process FDD. The process of process FDD mainly includes three steps: data acquisition and preprocessing, feature extraction and selection, and model training and feature classification. Different models have their own characteristics, and they are applicable to different industrial scenarios. In general, there are these representative processes in industrial processes, e.g., continuous processes, batch processes, multimode processes [2]. (1) The continuous process always operates through a continuous way [49]. After the process has been started up, it runs around the best state most of the time and produces constant output. Continuous process is a traditional industrial process, which has been widely existed in chemical, petrochemical and metallurgical industries. (2) Batch process is a discontinuous process with a limited operation duration [50]. Compared with continuous process, the set point of batch process always changes, which means that the process usually operates under different process conditions. Thus, batch process can produce various grades of products in a single batch process. It is inherently nonlinear, time varying, and often has a strong dynamic data behavior. The batch processes exist in the plastic engineering, food engineering and biochemical industries. (3) Multimode process refers to an industrial process with multiple modes, and its operating conditions are always switched from one operating mode to another [51]. There are many multimode methods for process monitoring [8, 52,53,54]. In these methods, the predefined model matches the corresponding operation mode of the process.

The process FDD methods based on machine learning are mainly based on supervised learning or unsupervised learning [45]. The data in supervised learning method consist of input samples and their corresponding labels. By learning the relationship between samples and labels, the model predicts the labels of unknown samples for process fault classification. Common supervised learning models include PLS, k-nearest neighbor (kNN), ANN, etc. Accordingly, the data in the unsupervised learning method consist of only inputs without any corresponding labels. The goal of this unsupervised learning problem may be to find a group of similar samples in the data (i.e., clustering problem), or determine the distribution of the data in the input space (i.e., density estimation), or project the data from the high-dimensional space to the low-dimensional space (i.e., dimensionality reduction and data visualization). Common unsupervised learning models include PCA, KPCA, ICA, etc.

The effectiveness of these machine learning methods needs to be verified in real industrial processes. Most of the proposed FDD methods are applied to chemical process benchmarks, e.g., Tennessee Eastman process (TEP) [49], Fed-batch fermentation penicillin process (FBFP) [55], continuous stirred tank reactor (CSTR) [56]. Other industrial processes include semiconductor manufacturing process [57], air separation process [58], grinding process [59], boring processes [60], laminar cooling process [61], assembly process [62], and aluminum smelting process [63]. In these modern industrial processes, sensors are often used to monitor and evaluate process variables and obtain a large number of process history data. Industrial process data usually involve the following main characteristics, i.e., high dimensionality, non-Gaussian distribution, nonlinear relationships, time varying, autocorrelations, and multivariate [2]. According to different data characteristics, different FDD models should be reasonably selected. For example, PLS is a linear estimation method, which is not suitable for the monitoring of nonlinear processes. In addition, although GMM can handle the nonlinearity of the process, it may not be able to model all types of non-Gaussian data. According to the characteristics of different methods, the advantages and disadvantages of these traditional machine learning methods and their applications in process FDD are summarized in Table 1.

Table 1 Summary of applications of the traditional machine learning methods in process FDD

2.1 Overview

Process monitoring refers to the continuous monitoring of the industrial process to detect abnormal conditions or abnormal behavior. Once the fault is detected in the industrial process, the fault diagnosis is needed to determine the root cause of the fault. Through the fault detection and diagnosis technology to eliminate the fault causes, it is helpful to maintain the smooth operation of the process industry.

Some traditional machine learning methods (e.g., ANN, SVM, PCA) are applied to process FDD. The procedure includes three steps, e.g., data acquisition and preprocessing, feature extraction, feature selection and model selection, training and validation, as shown in Fig. 2. Each step will be detailed in the following subsections.

Fig. 2
figure 2

Detection and diagnosis procedure of process FDD using traditional machine learning methods

2.2 Step 1: data acquisition and preprocessing

Data include history data and real-time data in process industry. Data acquisition is the first step of process monitoring. There are several main types of data collected through sensors: vibration signal, speed, pressure, temperature, force, current signal, audio signal, images, etc. In this step, the process data structure is checked, different data features are analyzed, the operation area of the current process is identified, and the modeling and evaluation datasets are determined. After data collection, data preprocessing is needed to improve the quality of data. The methods of data preprocessing include data cleaning, data integration, data transformation, data reduction, etc. Data cleaning is to process the missing data through sample deletion, missing value estimation, Bayesian inference, and other methods to eliminate the inconsistency of data. Data integration removes outliers and gross errors in datasets through data combination and unified storage. Data transformation is to transform data into a form suitable for data mining by means of smooth aggregation, data generalization, and normalization.

2.3 Step 2: feature extraction/feature selection

Feature extraction includes two steps. Firstly, some common features are extracted from the collected data. Secondly, feature selection methods, such as filter, wrapper, and embedding method, are used to select features that are sensitive to process status from the extracted features.

2.3.1 Feature extraction

Feature extraction generates a subset of new features through the combination of existing features. Its purpose is to obtain the essential features of process data, remove useless noise and realize data visualization. These common feature extraction methods consist of PCA, ICA, linear discriminant analysis (LDA) and manifold learning. In general, LDA is used to reduce the dimension if the data have class labels. Otherwise, PCA is used if the training data have no class labels.

2.3.2 Feature selection

These feature selection methods, e.g., filters, wrappers, and embedded methods, are used to select sensitive features to process state from the extracted features. It is beneficial to remove the redundant information and further improve the process FDDs.

Filter-based methods. The filter directly preprocesses the collected features, which are independent of the training of the classifier [129]. Firstly, feature selection is carried out, and then, the learner is trained, so the process of feature selection has nothing to do with the learner. It is equivalent to filtering the features first and then training the classifier with feature subset.

Wrapper-based methods. The final classifier is directly used as the evaluation function of feature selection, and the feature subset is selected for a specific classifier. Different from the filter-based method, wrapper focuses on the interaction between feature selection and training classifier [129].

Embedded methods. Embedded method combines the process of feature selection with the process of classifier learning and selects features in the process of learning. The most common feature selection method is L1 regularization or L2 regularization [129].

2.4 Step 3: model selection, training, and validation

Different models can be selected according to the characteristics of data relationship and process variables. For example, if the data relationship is linear and most process variables are Gaussian distribution, PCA or PLS can be selected for process FDD. If there are process variables that are non-Gaussian, ICA and other non-Gaussian modeling methods can be selected. If the relationship between different process variables is nonlinear, the nonlinear modeling methods, e.g., ANN and SVM, can be selected for classification.

The next step is to train the model and evaluate its effectiveness. Based on the training model of labeled samples, input unlabeled samples can achieve the purpose of feature classification. Several typical process FDD methods using traditional machine learning are briefly introduced in the following section.

2.4.1 Supervised learning methods

The data in supervised learning methods must be classified and labeled with tags that indicate the system conditions, such as health, fault, and fault type. In supervised learning-based process FDD, the labeled data are used to train the machine learning model, and the trained model can classify the unlabeled data [22].

2.4.1.1 PLS-based approaches

PLS is a basic multivariate statistical method and extensively used for process FDD. It establishes a linear regression model by simultaneously projecting the predicted variables and observable variables into the latent variable space. PLS needs to use label information in modeling procedure, which is a commonly used in supervised learning. Collect the data under normal operation to generate an input matrix \({X} = {[}x_{{1}} , x_{{2}} , \ldots, x_{n} {]} \in {R}^{n \times m}\) and an output matrix \({Y = [}y_{{1}} , y_{{2}} \ldots, y_{n} {]} \in {R}^{n \times p}\) with p process variables. PLS projects X and Y to a low-dimensional space defined by l latent variables as follows:

$$\left\{ {\begin{array}{*{20}c} {{X = }\sum\limits_{{i{ = 1}}}^{l} {{t}_{i} {p}_{i}^{{T}} { + E}} { = TP}^{{T}} { + E}} \\ {{Y = }\sum\limits_{{i{ = 1}}}^{l} {{t}_{i} {q}_{i}^{{T}} { + F}} { = TQ}^{{T}} { + F}} \\ \end{array} } \right.$$
(1)

where \({T = [t}_{{1}} , \ldots{,t}_{l} {]}\) denote the latent score vectors, \({P = [p}_{{1}} , \ldots{,p}_{l} {]}\) and \({Q = [q}_{{1}} , \ldots{,q}_{l} {]}\) denote the loadings for X and Y, respectively, E and F denote the residuals of PLS corresponding to X and Y, respectively, and l is generally determined cross-validation. The details of the PLS algorithm can refer to [130, 131].

The latent vectors \({t}_{i}\) are computed sequentially from the data such that the covariance between the deflated input data, \({X}_{i} { = X}_{{i{ - 1}}} { - } {t}_{{i{ - 1}}} {p}_{{i{ - 1}}}^{ {T}} ; {X}_{ {1}} = {X}\), and output data Y for each factor can be maximized, and \({w}_{i}\) is the weight vectors to compute the scores \({t}_{i} = {X}_{i} {w}_{i}\). The scores can be denoted as:

$${T} = {XR}$$
(2)

where \({R} = {W(P}^{ {T}} {W)}^{{ { - 1}}}\) with the following relation [132]:

$${P}^{ {T}} {R = R}^{ {T}} {P = I}_{l}$$
(3)

PLS uses an oblique projection toward the input data space, such that the model estimate and residual on the new sample x can be obtained:

$$\hat{x} = PR^{ {T}} x$$
(4)
$$\hat{y} = QR^{ {T}} x$$
(5)
$$\tilde{x} = (I - PR^{ {T}} {)}x$$
(6)

where \(\hat{x}\) and \(\tilde{x}\) denote the oblique projections of \(x\) [44].

Although the FDD technology based on PLS has been widely used in industrial process, there are still two problems: (1) PLS needs to select more principal components to describe the process related changes, which makes the interpretation of the model very difficult; (2) PLS does not extract principal components according to the order of variance in the process variable matrix. It is not suitable to monitor the residual subspace with Q statistic. In order to solve the above problems, the following extended models are proposed based on the basic PLS model.

Zhou et al. [133] proposed a PLS-based scheme, total projection to latent structure (TPLS), to deal with skew decomposition in standard PLS. However, TPLS does not clearly explain the reason that the principal component space of PLS contains the change independent of the fault in the practical application, and the principal component space does not need to be decomposed into four subspaces, which can be completely decomposed into the subspace related to the process fault and the subspace related to the input. Qin et al. [134] proposed concurrent projection of latent structure (CPLS), which simplified the structure of TPLS. Based on the consistent projection of input and output data spaces, CPLS provides complete monitoring of process faults occurring in predictable output subspace and unpredictable output residual subspace. In order to solve the dynamic problem of industrial process, dynamic principal component analysis (DPCA) and dynamic PLS (DPLS) method are proposed [64,65,66,67,68]. Similar to TPLS, CPLS does not change the prediction ability of PLS for fault variables, but further decomposes the measurement variable space according to the fault variable space. Ding et al. [69] and Yin et al. [69, 70] constructed the modified PLS (MPLS), which cleverly used SVD to decompose the process variable space into two subspaces, but required that the principal component subspace should not contain components orthogonal to the fault variables. MPLS avoids the complex iterative calculation process of CPLS in practical application, which is conducive to the prediction of process faults, but the residual space has no contribution to its prediction. In order to ensure the completeness of spatial decomposition, Peng et al. [71] proposed an efficient latent structure projection (EPLS) based on MPLS. Das et al. [135] combined cluster analysis with multiple multi-block PLS (MBPLS) for process monitoring.

2.4.1.2 kNN-based approaches

kNN is one of the simplest machine learning algorithms, which are often used to complete classification tasks [136]. In this method, a distance metric is used to search for k samples near a given unlabeled sample. As shown in Fig. 3, in the decision-making of classification, kNN only determines the category of the sample to be classified according to the category of the nearest one or several samples.

Fig. 3
figure 3

Illustration of the kNN algorithm

The kNN-based method is widely used in semiconductor manufacturing process. He et al. [57] proposed a fault detection method based on k-nearest neighbor rule (FD-kNN) for semiconductor manufacturing process. In order to enable FD-kNN to monitor online process, He et al. [72] proposed a principal component-based kNN (PC-kNN) for fault detection. For nonlinear, multimodal, and non-Gaussian batch processes, Guo et al. [73] proposed an improved fault detection method based on KNN. The results show that the proposed method can achieve better fault detection performance compared with MPCA, FD-kNN and PC-kNN. Li et al. [74] proposed a diffusion mapping-based kNN rule (DM-kNN) technology, which can reduce the cost of data storage and improve the performance of fault detection. Zhang et al. [75] proposed a new fault detection model based on weighted distance of kNNs (FD-wkNNs), which is more suitable for multimodal process monitoring than kNN. kNN and its extensions are simple and can be applied in the process FDD effectively. When the number of features increases, however, the amount of calculation for kNNs increases significantly. Thus, it is difficult to effectively solve the problem of sample imbalance.

2.4.1.3 ANN-based approaches

ANN generally consists of a complex network structure formed by the interconnection of a large number of processing units (neurons). It is a kind of abstraction, simplification, and simulation of human brain organization structure and operation mechanism.

Multi-layer perceptron (MLP) trained by back propagation (BP) algorithm is the most successful neural network model. An ANN consists of three components: input layer, hidden layer, and output layer. The signal propagates forward, and the error propagates backward. ANN is based on numbers of simple processors and neurons, as shown in Fig. 4. Given the training dataset \(\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots ,\left({x}_{m},{y}_{m}\right)\}\), where \({x}_{m}\in {R}^{d}\) includes \(d\) features and \({y}_{m}\in {R}^{l}\) includes \(l\) health states, the output of the \(h\)th hidden layer is expressed as:

$${{(x}_{i}^{h})}_{j}={f}^{h}\left(\sum_{i=1}^{{n}_{h-1}}{\omega }_{j}^{h}\bullet {x}_{j}^{h-1}+{b}_{j}^{h}\right),j=\mathrm{1,2},\dots ,{n}_{h},h=\mathrm{1,2},\dots ,H$$
(7)

where \({{(x}_{i}^{h})}_{j}\) is the output of the \(j\)th neuron in the \(h\)th hidden layer, and \({x}_{i}^{0}={x}_{i}\), \({n}_{h}\) is the number of neurons in the \(h\)th hidden layer, \({f}^{h}\) is called the activation function of the \(h\)th hidden layer, often chosen to be the sigmoid function, \({n}_{h-1}\) is the number of neurons in the \((h-1)\)th hidden layer, \({\omega }_{j}^{h}\) is the weights between the neurons in the previous layer and the \(j\)th neuron in the \(h\)th hidden layer, and \({b}_{j}^{h}\) is the bias of the \(h\)th hidden layer. The output of BPNN can be obtained by:

$${y}_{k}={f}^{out}\left(\sum_{i=1}^{{n}_{H}}{\omega }_{j}^{out}\bullet {x}_{j}^{H}+{b}_{j}^{out}\right),k=\mathrm{1,2},\dots ,l$$
(8)

where \({y}_{k}\) is the predicted output of the \(k\)th neuron in the output layer, \({f}^{out}\) is the activation function of the output layer, and \({\omega }_{j}^{out}\) and \({b}_{j}^{out}\) are the weights and bias of the output layer, respectively.

Fig. 4
figure 4

An artificial neural network with two hidden layers

Numerous research activities have shown that ANN has powerful pattern classification and recognition. As a result, ANN is one of the classifiers commonly used in intelligent fault diagnosis [137, 138]. MLP is an ANN made of units arranged in layers with only forward connections to units in subsequent layers [139]. This model has been successfully applied to fault detection and identification of turning processes [76], air separation process [58], grinding process [59], boring processes [60], multivariate attribute process [78].

The radial basis function (RBF) network has a feedforward structure, consisting of only one hidden layer with no weighted connections and fully interconnected to the output layer. Compared with MLP, RBF is faster to train [140]. RBF has been used in several applications [61, 77, 81]. Probabilistic neural network (PNN) [141] is similar with MLP in structure, but due to the smaller number of connections, PNN is normally easier to train than MLP. PNN and its variants have been used at fault diagnosis of nonlinear process [82, 83] and batch process [84]. In addition, Yu et al. [78, 79] developed a selective neural network ensemble method (DPSOEN, discrete particle swarm optimization) to accurately locate the source of runaway signals in multivariate manufacturing processes. Du et al. [62] explored a selective neural network ensemble algorithm for detecting and isolating process fault in assembly processes.

2.4.1.4 SVM-based approaches

SVM is a computational learning method for small samples classification [142]. A hyperplane \(f\left(x\right)=0\) is expected to be found to separate the given datasets into two classes, and the hyperplane is defined as:

$$f\left(x\right)={\omega }^{T}x+b=\sum_{i=1}^{m}{\omega }_{i}{x}_{i}+b=0$$
(9)

where \(\omega\) is a m-dimensional vector and b is a scalar. As shown in Fig. 5, SVM constructs two parallel hyperplanes as the interval boundary, that is, the maximum margin, to distinguish the classification of samples.

Fig. 5
figure 5

Classification by the linear SVM

Due to the data classification ability of SVM, it has been used for process FDD in the past years. Peng et al. [85] proposed a novel process fault detection and classification approach via non-negative matrix factorization with sparseness constraints (NMFSC) and structural SVMs. Yin et al. [86] reviewed the research and development of FDD based on SVM in complicated industrial processes. Onel et al. [49] proposed a new feature selection algorithm based on nonlinear (kernel correlation) SVM and applied it to continuous process monitoring and fault detection. Yang et al. [55] combined PCA and recursive feature elimination (RFE) with SVM for process FDD. Onel et al. [87] presented a novel data-driven framework for process monitoring in batch processes, where a feature selection algorithm based on nonlinear SVM is used to exploit high-dimensional process data.

2.4.1.5 FDA-based approaches

FDA is an optimal dimension reduction method that has been intensively studied in process fault diagnosis. It can separate process faults from normal samples by searching for the directions with the maximized discrimination between faults and normal data instead of the largest variability only within normal samples [88]. The objective of FDA is to seek a mapping to maximize the scatter between the classes and minimize the scatter within each class.

In the past few years, FDA is mainly used in process monitoring, fault classification and fault diagnosis in process industry. Xi et al. [17] proposed a novel nonlinear biological batch process monitoring and fault identification approach based on kernel FDA. In order to deal with highly correlated, complex, and noisy high-dimensional databases effectively, Nor et al. [91] proposed a process FDD method based on wavelet analysis, kernel FDA (KFDA), and SVM. Nor et al. [94] proposed a multi-scale KFDA adaptive neuro-fuzzy inference system (KFDA-ANFIS) framework. Yu [88] proposed a novel localized FDA (LFDA)-based process monitoring approach to monitor the processes containing multiple types of steady-state or dynamic faults. Yu [89] proposed a multiway kernel localized FDA (MKLFDA) for batch bioprocess monitoring. Zhao et al. [90] proposed a FDD method based on extreme learning machine (ELM) and multiway FDA (MFDA). Ren et al. [95] integrated local consistency Gaussian mixture model (DLCGMM) with modified local FDA (MLFDA) for multimode process monitoring. Tang et al. [92] proposed a novel data-driven process monitoring method named Fisher discriminant global–local preserving projection (FDGLPP) and applied it to diagnosis fault in industrial process. Yang et al. [93] presented a class-incremental scheme of FDA to improve the performance of process fault diagnosis.

2.4.2 Unsupervised learning methods

The difference between supervised learning and unsupervised learning is whether the training data of the model need labels. Unsupervised learning models can be constructed on training data without class labels [22].

2.4.2.1 PCA-based approaches

PCA is a typical statistical approach that has been widely used in process monitoring field. PCA does not use label information in the modeling procedure and is a unsupervised learning method for dimension reduction [45]. It is capable of projecting the high-dimension data onto a lower-dimension space that contains the most variance of the original data and accounts for correlation among variables.

Denote the dataset \({X = [}x_{ {1}} , x_{ {2}} , \ldots , x_{n} {]}\), PCA seeks a mapping axis \(a\), such that the mean square of the Euclidean distance between all pairs of the projected samples \({Y = [}y_{ {1}} , y_{ {2}} \ldots , y_{n} {]}\), i.e.,\(y_{i} = a^{T} x_{i}\)(\({i = 1,}\ldots , n\)), is maximized as follows:

$$J_{g} (a) = \frac{1}{n}\sum\limits_{i = 1}^{n} {(y_{i} - \overline{y})^{2} = } \frac{1}{n}\sum\limits_{i = 1}^{n} {a^{T} (x_{i} - \overline{x})(x_{i} - \overline{x})^{T} a = } a^{T} Ca$$
(10)

where \(\overline{y} = \frac{1}{n}\sum {y_{i} }\), \(\overline{x} = \frac{1}{n}\sum {x_{i} }\) and \(C = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {(x_{i} - \overline{x})(x_{i} - \overline{x})^{T} }\) is the covariance matrix. The eigenvectors of the data covariance matrix associated with the largest eigenvalues are the basis functions of PCA. The output of principal vectors \(a = a_{1} ,\ldots a_{l}\) is an orthonormal set of vectors representing the eigenvectors of the sample covariance matrix associated with \(l < m\) (\(l\) is the selected number of principal components (PCs) and \(m\) is the number of all PCs).

PCA and PLS are two typical multivariate statistical approaches in FDD [143]. Some literatures reviewed the applications of PCA and PLS in process analysis and control, FDD [96,97,98, 144]. The first attempt of applying PCA in FDD can be found in [99]. This method was extended to batch processes by using multiway PCA [97]. To deal with nonlinearity, a nonlinear PCA [97] and a neural net PLS [145] were proposed. In [100], an integral statistical methodology combining PCA and discrimination analysis techniques was proposed. Yu [31] presented a local and global principal component analysis (LGPCA), which is a linear dimensionality reduction technique through preserving both of local and global information in the observation data. Chaouch et al. [146] developed an off-line monitoring method based on multilayer neural PCA and nonlinear gain scheduling and applied it to a photovoltaic system.

For high-dimensional-related process variables, the main data information can be retained, and the data information can be greatly compressed through the dimensionality reduction in PCA. After the dimension of process variables is reduced, further data analysis will become easier. In the past few years, PCA has many applications for dimensionality reduction and data visualization [101,102,103].

2.4.2.2 KPCA-based approaches

Kernel methods, e.g., kernel PCA (KPCA), KPLS, are utilized to avoid nonlinear optimization and has been widely applied in nonlinear process monitoring. The basic idea of kernel methods is first projecting the process data into a high-dimensional feature space and then performing linear PCA or PLS in the feature space [105]. Take KPCA as an example, the specific implementation of KPCA is presented as follows:

Denote a dataset \({X = [}x_{ {1}} , x_{ {2}} , \ldots , x_{n} {]}\), \(x_{i} { = [}x_{{i {1}}} , x_{{i {2}}} , \ldots , x_{ip} {],}i { = 1,2,}\ldots , n\) with p measured process variables. The covariance matrix of the dataset is calculated as follows:

$$\frac{ {1}}{ {n}}\sum\limits_{{j { = 1}}}^{n} {x_{j}^{ {T}} x_{j} }$$
(11)

where n is the number of samples. Suppose the nonlinear mapping:

$$\Phi_{x} \to {F}$$
(12)

Thus, \({F}\) is generated as \(\Phi {(}x_{ {1}} {),}\Phi {(}x_{ {2}} {),}\ldots , \Phi {(}x_{n} {)}\). Denote the kernel function as:

$$K = \Phi {(X)}^{ {T}} \Phi {(X) = [}k {(}x_{i} , x_{j} {)]}_{n \times n}$$
(13)

The kernel function must satisfy the Mercer theorem. In other words, the kernel is the inner product of the feature space. Through the kernel function, the calculation in the feature space can be calculated in the original input space, without knowing the high-dimensional transformation:

$$k (x_{i} , x_{j} ) = \left\langle {\Phi {(}x_{i} {)}^{ {T}} ,\Phi {(}x_{j} {)}} \right\rangle = \Phi {(}x_{i} {)}^{ {T}} ,\Phi {(}x_{j} {)}$$
(14)

Then, the covariance matrix in the feature space is calculated:

$$C = \frac{ {1}}{n}\sum\limits_{{j = {1}}}^{n} {\Phi {(}x_{j} {)}^{ {T}} ,\Phi {(}x_{j} {)}}$$
(15)

Thus, the operation of PCA in the feature space is:

$$\lambda {V = CV}$$
(16)

where \(\lambda\) is the eigenvalue and \({V}\) is the eigenvector,\({V} \in {F}\left\{ {0} \right\}\). Since \({V}\) belongs to the generated space of \(\left\{ {\Phi {(}x_{ {1}} {),}\Phi {(}x_{ {2}} {),}\ldots , \Phi {(}x_{n} {)}} \right\}\), the following formula can be obtained:

$$\lambda {(}\Phi {(}x_{k} {)} \cdot {V) = }\Phi {(}x_{k} {)} \cdot {CV(}k = {1,2,}\ldots , n {)}$$
(17)

There exists a parameter \(\alpha { = \{ }\alpha_{ {1}} , \alpha_{ {2}} , \ldots , \alpha_{n} {\} }\), such that \({V}\) can be expressed linearly by \(\Phi {(}x_{k} {)(}k = {1,2,}\ldots , n {)}\), namely:

$${V = }\sum\limits_{{i = {1}}}^{n} {\alpha_{i} \Phi {(}x_{i} {)}}$$
(18)

By combining Eqs. (1) and (2), the following expression is obtained:

$$\lambda \sum\limits_{{i = {1}}}^{n} {\alpha_{i} \left[ {\Phi {(}x_{k} {)}\Phi {(}x_{i} {)}} \right]} = \frac{ {1}}{n}\sum\limits_{{i = {1}}}^{n} {\alpha_{i} \left[ {\Phi {(}x_{k} {)}\sum\limits_{{j = {1}}}^{n} {\alpha_{i} \left[ {\Phi {(}x_{j} {)}} \right]} } \right]} \left[ {\Phi {(}x_{j} {)}^{ {T}} \Phi {(}x_{i} {)}} \right]$$
(19)

Then, the kernel function is introduced:

$$n\lambda \alpha = {K}\alpha$$
(20)

Finally, the eigenvalues and eigenvectors of \({K}\) are obtained, and the principal components are extracted by performing PCA.

KPCA has been used to monitor nonlinear processes, but it is not very suitable for process fault diagnosis. Thus, it often combines other methods to solve the problems of FDD. Li et al. [106] presented a new method of FDD for nonlinear processes based on KPCA and least squares support vector machine (LSSVM). A new detection and diagnosis system combining the decision directed acyclic graph (DDAG) with KPCA is proposed in [147]. Xu et al. [107] used an improved multi-scale KPCA to analyze the multi-scale and nonlinear property of chemical data.

2.4.2.3 ICA-based approaches

ICA is a multivariate statistical approach that can extract statistically independent components (ICs) from the given data. Because it is suitable for process measurement with non-Gaussian distribution, it has been intensively applied for solving process fault detection issues during the past few decades [148].

Denote a given input vector \(x {(}i {)} = {[}x_{ {1}} {(}i {),}x_{ {2}} {(}i {),}\ldots , x_{m} {(}i {)]}^{ {T}}\) with m observed measurements at sample i, which can be described as a linear combinations of k unknown independent components (IC) \(s_{ {1}} , s_{ {2}} , \ldots , s_{k}\), then [11]:

$$x {(}i {) = }\sum\limits_{{j { = 1}}}^{k} {a_{j} s_{j} {(}i {)}} = {A}s {(}i {)}$$
(21)

The relationship between the original data and ICs is expressed as:

$${X} = {AS + E}$$
(22)

where \({X} = {[}x {(1),}x {(2),}\ldots , x {(}n {)]} \in {R}^{m \times n}\) denotes the input matrix with n samples, \({A} = {[}a_{ {1}} , a_{ {2}} , \ldots , a_{k} {]} \in {R}^{m \times k}\) denotes the mixing matrix, \({S} = {[}s {(1),}s {(2),}\ldots , s {(}n {)]} \in {R}^{k \times n}\) denotes the IC matrix. \({E} \in {R}^{m \times n}\) denotes the residual matrix. ICA aims to calculate a separating matrix W to enable the components of the reconstructed data matrix \({\hat{S}}\) to be as independent of each other as possible. \({\hat{S}}\) is expressed as follows:

$${\hat{S}} = {WX}$$
(23)

There are many types of ICA algorithm. The fixed-point algorithm (also known as Fast ICA) algorithm has fast convergence speed and has achieved remarkable performance. Fast ICA is widely used in signal processing because it is easy to converge, and it has good separation performance. The algorithm can estimate the original signals that are statistically independent and mixed by unknown factors from the observed signals.

ICA is a multivariate statistical tool to extract statistically independent components from observed data, which has drawn considerable attention in FDD. Lee et al. [108] analyzed the defects of the original ICA algorithm and presented a novel multivariate statistical process monitoring (MSPM) method based on modified ICA. Zhang et al. [109] proposed a modified ICA algorithm based on particle swarm optimization (PSO-ICA) for the purpose of MSPM. Hsu et al. [110] integrated ICA and SVM (ICA-SVM) for monitoring multivariate processes, where ICA was used to extract the hidden information of a non-Gaussian process. In order to solve the problem of complex industrial process monitoring, Zhang et al. [111] combined kernel ICA (KICA) and LSSVM to establish the industrial process monitoring model. Li et al. [112] advised a correlated and weakly correlated process fault detection approach based on variable division and ICA. Wang et al. [113] proposed a totally data-driven ICA model to improve the ability of monitoring non-Gaussian process.

2.4.2.4 GMM-based approaches

Gaussian model is to use Gaussian probability density function (normal distribution curve) to accurately quantify things, and decompose a thing into several models based on Gaussian probability density function. Gaussian mixture model (GMM) is a clustering algorithm, which belongs to unsupervised learning. It organizes itself according to the nature of input data with complex distribution (e.g., multimodal, or nonlinear distribution) and can monitor health status of industrial process without prior knowledge of abnormal patterns.

GMM can be used in general process applications, e.g., data clustering analysis, process monitoring, dimension reduction, data visualization. Yu et al. [56] proposed a multi-modal process monitoring method based on finite Gaussian mixture modes and Bayesian inference strategy for complex multi condition industrial processes. In order to solve the problems of the particularity of semiconductor process, the nonlinearity of most batch processes, and the multipeak intermittent trajectory under multiworking conditions, Yu [114] proposed a principal component-based GMM (PCGMM). After that, in order to find meaningful low-dimensional information hidden in high-dimensional observations, Yu [115] proposed a manifold learning feature extraction algorithm based on local and nonlocal preserving projection (LNPP) and then used GMM to process data with nonlinear or multimodal features. Yu [32] proposed local/nonlocal manifold regularization-based GMM (LNGMM) to estimate process data distributions with nonlinear and multimodal characteristics. Yu [116] presented a GMM-based process patterns recognition (PPR) model, which employs a collection of several GMMs trained for process pattern recognition. Jie [117] proposed a nonlinear kernel Gaussian mixture model-based inferential monitoring approach for FDD of chemical processes.

2.4.2.5 HMM-based approaches

Hidden Markov model (HMM) is a statistical model, which is used to describe a Markov process with hidden unknown parameters. The difficulty is to determine the hidden parameters of the process from the observable parameters and then use these parameters for further analysis, e.g., pattern recognition. HMM contains a limited number of states, in which each state generates an observation at a certain time point. HMM performs two random processes: one is a random transition from one state to another, and the other is to generate a random output symbol in each state. Thus, these models can only be observed through another set of random processes. The actual state sequence is hidden and cannot be observed directly.

For HMM with N states and M observations, the observation sequence up to time \(T\) is denoted as \(O\in \{{O}_{1},{O}_{2},\dots ,{O}_{T}\}\), where \({O}_{t}\in \{{v}_{1},{v}_{2},\dots ,{v}_{M}\}\), the corresponding state sequence is denoted as \({O}_{T}\in \{{q}_{1},{q}_{2},\dots ,{q}_{T}\}\), where \({q}_{t}\in \{{S}_{1},{S}_{2},\dots ,{S}_{N}\}\). Assume that the underlying state sequence is transferred based on Markov process, the transition probability matrix is expressed as\(A=\left\{{a}_{ij}\right\},1\le i,j\le N\), where \({a}_{ij}=P({q}_{t+1}={S}_{j}|{q}_{t}={S}_{i})\), and \({q}_{t}\) represents the hidden state at time \(t\). For a HMM, these states are observable. On the contrary, the observed value depends on whether it is visible in this state and is represented by a matrix according to the conditional probability distribution. In addition, if the initial state probability distribution is \({b}_{j}(k)=P({O}_{k}|{q}_{t}={S}_{j})\), an HMM can be represented by \(B=\{{b}_{j}\left(k\right)\}(1\le j\le N)\). HMM adopts these basic algorithms \({\pi }_{i}=P({q}_{t}={S}_{j})\), namely forward backward algorithm \(\uplambda =\{A,B,\pi \}\), Baum Welch algorithm, and Viterbi algorithm, which are used \(\uplambda\) for model parameter learning and recognition.

In the past years, HMMs have been applied to the process industry. In order to effectively monitor multimode processes, Wang et al. [52] proposed a process monitoring scheme based on orthogonal nonnegative matrix factorization (ONMF) and HMM. Afzal et al. [53] proposed a monitoring method based on HMM for multimodal processes with mode reachability constraints. Lou et al. [54] combined hidden semi-Markov model (HSMM) with PCA (HSMM-PCA) and introduced mode duration probability into HMM. Regarding the complex nonlinear industrial process, Peng et al. [118] combined HMM and KPCA for fault detection of nonlinear multimodal process.

2.4.2.6 SVDD-based approaches

Support vector data description (SVDD) is a single-value classification algorithm, which can distinguish target samples from non-target samples, and is usually applied in anomaly detection and fault detection. In this method, there is no restriction that the process data should be assumed to be Gaussian. Thus, SVDD is considered as a promising method for non-Gaussian process monitoring. Cho [149] dealt with the data description and noise filtering issues based on SVDD and applied it for process fault detection. Ge et al. [119] further proposed an SVDD-based reconstruction algorithm for sensor fault identification and isolation of multivariate processes. Based on FDD, Yao et al. [120] extended the single class classification method of SVDD to functional SVDD (FSVDD) and applied it to batch process monitoring. For further improving the competence of ICA, Chen et al. [121] integrated ICA, Durbin–Watson (DW) criterion, and SVDD to monitor non-Gaussian process for detecting faults.

2.4.2.7 K-means-based approaches

K-means is an iterative clustering analysis algorithm, and the calculation steps are as follows: (1) Divide the data into k groups; (2) Randomly select k objects as the initial clustering center; (3) Calculate the distance between each object and each seed clustering center; (4) Assign each object to the nearest clustering center. K-means is a widely used unsupervised machine learning clustering algorithm. The main application of this method in process industry is to divide the process data into different operation modes, different fault types, or different grades of products. For example, Lv et al. [122] employed K-means to derive the segmentation rule of variable subspace for batch process monitoring. Majid et al. [63] used K-means as a data mining tool to isolate different types of faults in aluminum smelting process. Zhou et al. [123] combined K-means and PCA for fault detection and identification in multiples processes. Tong et al. [51] proposed an adaptive multimode process monitoring strategy based on K-means.

2.4.2.8 SOM-based approaches

Self-organizing map (SOM) is an unsupervised learning algorithm derived from competitive learning. It uses neighborhood function to maintain the topological properties of input space. It is usually represented by low dimensional discretization to train the input space of samples. SOM can reduce dimension and model for highly discrete and nonlinear data.

When SOM is used to monitor the production process, a certain amount of controlled data is collected for the first time to create and train the SOM model and generate the data feature space model of the controlled process. Then, the observation vector is collected online from the manufacturing process, and the best match unit (BMU) is obtained by comparing with the weight vector of all primitives in SOM. The distance between the input vector and BMU is defined as MQE [124] as follows:

$$\mathrm{MQE}=\left|\left|D-{W}_{\mathrm{EMU}}\right|\right|$$
(24)

where \(D\) is the input vector and \(W_{{{\text{BMU}}}}\) is the weight vector of BMU. The size of MQE represents the distance between the input vector and the normal state space. MQE can be used as a process monitoring indicator.

In the past few years, SOM has been widely used in process industry. Yu et al. [124] proposed a process monitoring method based on SOM and developed a novel minimum quantization error (MQE) chart for monitoring process changes. Corona et al. [125] introduced the advantages of SOM in combination with industrial data and its application in a series of process measurements in industrial gas processing plants. In order to clearly show the occurrence of complex chemical process fault, Chen et al. [128, 150] proposed a novel fault diagnosis method that combines SOM with correlative component analysis (CCA). Yu et al. [126] proposed a SOM-based methodology for process FDD with nonlinear and non-Gaussian features. In order to solve the problems of highly dimensional input variables, high correlation among some input variables, overlap among the input variable spaces of different fault classes, and invisible distribution of fault classes, Song et al. [127] integrated canonical variate analysis (CVA) with multiple SOM (multi-SOM) for process monitoring and fault diagnosis.

3 Process fault detection and diagnosis based on deep learning at present

The generalization performance and self-learning performance of traditional machine learning-based process FDD models are insufficient because it is difficult for them to extract effective features from the process data. Thus, it is necessary to extract features from the original data, and then achieve effective process FDD. Process control in industry refers to the automatic control using the real-time data collected by computer and taking the process parameters (e.g., temperature, pressure, flow, liquid level and composition) as the controlled variables. It usually includes control of the whole industrial process. This section reviews the typical deep learning methods and their applications in process FDD in recent years. First, this section presents detection and diagnosis procedure of industrial process by using deep learning. The process FDD procedures are mainly divided into data collection, deep learning model construction, and process monitoring. Second, the application of deep learning in industrial processes is reviewed.

3.1 Overview

With the rapid development of internet technologies and internet of things (IOT), the amount of data collection is unprecedented. Due to the constraints of the number of model parameters and operation speed, the traditional machine learning-based process FDD method is not suitable for industrial big data scenarios. At this time, the deep learning-based process FDD method is proposed in recent years. It uses deep hierarchical structure to automatically represent abstract features from process data and then directly establishes the relationship between learning features and target output, which can effectively solve the data explosion. As shown in Fig. 6, the deep learning-based process FDD procedure mainly includes three steps, namely big data collection, deep learning-based model construction, and process monitoring. Each step is presented in the following subsections.

Fig. 6
figure 6

Fault detection and diagnosis procedure of deep learning-based methods

3.2 Step 1: big data collection and preprocessing

Big data are massive information. Massive information in modern industrial process comes from the data generated by equipment operation. With the development of sensor and computing technology, the amount of data increases almost exponentially. Different sensors are usually employed, e.g., vibration, acoustic emission, temperature, current transformer. There are four main types of big data collected based on these sensors (i.e., vibration signal, current signal, audio signal, and images.) [151]. However, the big data are not fully utilized in knowledge mining or intelligent decision-making. Big data have the following four characteristics, i.e., volume, variety, velocity, and value [152, 153]: (1) Volume. Due to the development of data storage technology, in the long-term operation of the machine, the amount of information collected continues to grow, forming a huge historical database. (2) Variety. Multisource data collected by different types of sensors, a wide range of data sources, determine the diversity of the form of big data. (3) Velocity. Data generation is very fast; mainly through the Internet transmission, high-speed transmission channel can immediately collect data from the machine. (4) Value. There is incomplete information in the collected big data. In addition, a large number of data are mixed with some poor-quality data.

With the big data era coming, machine learning has attracted increasing interest for various applications. Lavasani et al. [154] discussed the latest applications of big data in the chemical industry and stressed the necessity of big data analysis in various fields of process engineering. Shu et al. [155] proposed a new abnormal situation management (ASM) framework to solve the big data problem in the cloud computing environment of a big chemical corporation. Onel et al. [87] presented a new data-driven batch process monitoring model, which uses the nonlinear SVM-based feature selection for big data fault detection and diagnosis. Aiming at plant-wide processes with big data, Yao et al. [156] proposed a distributed parallel modeling and monitoring framework for process fault detection and diagnosis. Jiang et al. [157] proposed a local–global modeling and distributed computing method to achieve efficient fault detection and isolation for nonlinear plant-wide processes.

After obtaining big data, data preprocessing is needed to improve data quality, because the quality of data determines the effectiveness of deep learning model. If the quality of process data cannot be well guaranteed, the deep learning model may produce misleading results. If the data are too noisy, a data processing step is required. For example, Guo et al. [158] used the improved local entropy method to eliminate the multimodal and non-Gaussian characteristics of the raw data and improve the fault detection performance of locality preserving projection (LPP) in multimodal industrial processes. Geng et al. [159] used the parameters of the variable modulus decomposition for data preprocessing to suppress the data noise of nonlinear complex chemical process and fed the data into a sparse principal component analysis (SPCA). Alrifaey et al. [160] used wavelet packet transform (WPT) to process the collected photovoltaic voltage signal and fed the preprocessed signal into long short-term memory (LSTM) model. Thus, the data cleaning and quality assessment should be carried out before they are fed to those deep learning models [45]. Xu et al. [161] reviewed data cleaning in process industry and introduced different data cleaning methods (i.e., missing data interpolation [162], anomaly detection [163], noise elimination [159], and time alignment [164]). Ding et al. [163] proposed a system for industrial time series cleaning, which can discover knowledge from high-quality time series and be used for industrial production optimization and anomaly detection. To sum up, data cleaning and quality assessment mainly include the following aspects: (1) Appropriate methods should be used to deal with missing data (e.g., sample deletion, missing value estimation, Bayesian inference). (2) The outliers in the data should be detected, and the problem of different sampling rates between process variables should also be considered. (3) The noise from the process itself and measuring equipment should be fully considered. The noise can be removed by filtering methods. (4) The scale difference between process variables needs to be considered. In the face of data imbalance, the performance of the learners can be improved by data scaling and data conversion.

3.3 Step 2: deep learning model construction

The process control method based on deep learning can effectively deal with big data scenarios, and the development of deep learning is shown in Fig. 7. Overall, the development of deep learning can be divided into four stages: embryonic stage, the first climax, the second climax, and the third climax [150].

Fig. 7
figure 7

Development of deep learning

In embryonic stage, the McCulloch-Pitts (MP) model was proposed in 1943 by Warren McCulloch and Walter Pitts. As the origin of ANN, MP model not only creates a new era of ANN, but also lays the foundation of neural network model. In the late 1950s, based on the research of MP model and Hebb learning rules, a learning algorithm similar to human learning process, perceptron learning, was proposed. In 1958, a neural network composed of two layers of neurons was formally proposed, which is called "perceptron." The proposal of perceptron has attracted a large number of scientists' interest in the research of ANN, which is of milestone significance to the development of neural network.

In the second climax, the famous physicist John Hopfield invented Hopfield neural network in 1982. However, due to the defect that it is easy to fall into local minimum, this algorithm did not cause a great sensation at that time. Until 1986, a back-propagation algorithm for multilayer perceptron, BP algorithm, was proposed. The BP algorithm perfectly solves the nonlinear classification problem, so that ANN has attracted extensive attention again. However, due to the limited hardware level of computers in the 1980s, the problem of “vanishing gradient” will appear when the scale of neural network increases. This has greatly limited the development of BP algorithm, and the development of ANN has entered the bottleneck period again.

In the third climax, the concept of deep learning was formally proposed in 2006. This method describes the solution of the problem of vanishing gradient in detail. The proposal of deep learning immediately aroused great repercussions in the academic circle and then quickly spread to the industry.

In recent years, many novel deep learning methods have been proposed for complex real production and manufacturing systems, e.g., GAN, GNN, DGPs, which solve the problems of unbalanced data types and model migration under different datasets.

3.3.1 AE-based approaches

3.3.1.1 A brief introduction to AE

AE is an unsupervised deep learning model, which consists of the encoder and decoder [165]. As depicted in Fig. 8, an AE is a special neural network consists of three layers: input layer, hidden layer, and output layer. The difference is that in the structure of AE, the input and output layers have the same number of neurons. Given the dataset \(\{\left({x}_{1},{y}_{1}\right),\left({x}_{2},{y}_{2}\right),\dots ,\left({x}_{m},{y}_{m}\right)\}\) with \(m\) samples, the represented features \({h}_{i}\) are defined as:

Fig. 8
figure 8

Structure of AE

$${h}_{i}={f}_{e}({\omega }^{T}\bullet {x}_{i}+b)$$
(25)

The decoding step tries to reconstruct input values from hidden values; the reconstructed sample \(\widehat{{x}_{i}}\) is expressed as follows:

$$\widehat{{x}_{i}}={f}_{d}({\omega \mathrm{^{\prime}}}^{T}\bullet {h}_{i}+b\mathrm{^{\prime}})$$
(26)

where \({f}_{e},{f}_{d}\) are the activation function of the encoder and decoder network, respectively, and \(\left\{\omega ,b\right\},\left\{\omega {^{\prime}},b{^{\prime}}\right\}\) represent the training parameters of the encoder and decoder network, respectively. The loss function is defined the following equation with squared error as:

$$J\left(\omega ,{\omega }^{\mathrm{^{\prime}}},b,{b}^{\mathrm{^{\prime}}}\right)=\frac{1}{2m}\sum_{i=1}^{m}{(\widehat{{x}_{i}}-{x}_{i})}^{2}$$
(27)
3.3.1.2 Application of AE and its variations-based approaches to fault detection and diagnosis

AE and its common varieties have been applied to process FDD. Gavneet et al. [166] compared deep stacking networks and sparse stacked AE for process fault detection. Yan et al. [167] separated the feature extraction and model construction by designing teacher and supervise dual stacked auto-encoder (TSSAE) for quality-relevant fault detection in industrial process. Yu et al. [168] presented an effective deep learning method known as stacked denoising AE (SDAE) for process pattern recognition (PPR) in manufacturing processes. Yu et al. [169] concentrated on developing a SDAE model for multivariate process pattern recognition to learn effective discriminative features from the process signals through deep network architecture. Zhang et al. [170] integrated one-dimensional CNN (1-DCNN) and SDAE to extract high level features from complex process signals. Li et al. [171] proposed a distributed ensemble stacked AE (DE-SAE) model based on deep learning technology for monitoring non-linear, large-scale, multi-unit processes. Liu et al. [172] proposed a new DNN, residual attention convolutional AE (RACAE) for complex nonlinear process monitoring. Yu et al. [173] suggested a one-dimension residual convolutional AE (1DRCAE) model, which used unsupervised learning to extract representative features from complex industrial processes. Li et al. [174] proposed a slow feature analysis-aided autoencoder (SFA-AE) for interpretable process monitoring. It enables the learning of deep slow variation patterns from the high-level features extracted by the AE.

Zhang et al. [175] presented a new DNN, manifold regularized stacked AE (MRSAE) for fault detection in complex industrial processes. In this study, the fault detection process based on MRSAE is divided into two stages: offline modeling and online monitoring, as shown in Fig. 9. Their testing results reveal that MRSAE is effective in learning representative features from the complicated data for process fault detection.

Fig. 9
figure 9

Procedure of MRSAE-based process fault detection

The off-line modeling phase includes six steps. Step1 is to collect the training dataset under the normal operation. Step 2 is to normalize samples in the dataset within 0 and 1. Step 3 is to train a MRSAE model in an unsupervised way with greedy layer-wise training algorithm. Step 4 is to generate the feature and residual spaces from the well-trained model. Step 5 is to calculate the monitoring statistic (i.e., T-squared (\({T}^{2}\)) and squared prediction error (SPE)), respectively. Step 6 is to setup the thresholds by using the kernel density estimation (KDE) method.

The on-line modeling phase includes five steps. Step 1 is to collect a new observation x from the process online. Step 2 is to normalize the input within 0 and 1. Step 3 is to input x to the well-trained MRSAE model and project it into feature space and residual space. Step 4 is to compute the \({T}^{2}\) and SPE statistic based on the feature maps, respectively. Step 5 is to trigger an alarm once any of the two statistics exceed its corresponding threshold.

3.3.2 CNN-based approaches

3.3.2.1 A brief introduction to CNN

CNN, as a deep feed-forward neural network, has commonly used in supervised learning problems in image recognition, computer vision, and target tracking [176]. A CNN model includes convolutional layers, pooling layers, and full-connected layers [39]. The fully connected layer has the same structure and operation mode as the conventional feedforward neural network. The structure diagram of CNN for processing two-dimensional data is given in Fig. 10.

Fig. 10
figure 10

Network structure of CNN

A convolutional layer consists of multiple learnable kernels. Each kernel has a trainable weight and bias. The math operation in the \(l\) th layer between the specific \(j\) th and the input data \({x}^{l-1}\) can be described by:

$${x}_{j}^{l}=f\left(\sum_{i\in {M}_{j}}{x}_{j}^{l-1}*{k}_{j}^{l}+{b}_{j}^{l}\right)$$
(28)

where (*) represents the convolution operation and \(f\) is the activation function of rectified liner unit (ReLU). Assume that the input data \({x}^{l-1}\) include \(m\) 2-D matrices. Every input matrix \({x}_{i}^{l-1}(i\in m)\) is convolved with the kernel \({k}_{j}\), and the sum of all convolution operation results will be added with the bias. Finally, the result will be fed into the activate function \(f\) to produce the final output of kernel \(j\).

3.3.2.2 Application of CNN

CNN can map low-dimensional shallow features to high-dimensional high-level features, which have been applied to many fields, e.g., image identification, text generation, machine translation, video classification. Zhang et al. [177] proposed an amplitude-frequency images-based CNN (ConvNet) for FDD in chemical processes. Each ConvNet works on a specific fault, so a flexible FDD framework can be trained and extended. Kim et al. [178] proposed a self-attentive CNN to detect and diagnose faults directly from the variable-length status variables identification (SVID). In order to extract effective features of complex multivariable process and improve fault diagnosis performance, Chen et al. [179] proposed a multivariable process fault diagnosis model based on CNN feature learning. Lee et al. [180] proposed a CNN model, which uses the receptive field of multivariable sensor signal to slide along the time axis to extract fault features of semiconductor manufacturing process. Wu et al. [181] developed a deep CNN (DCNN) model for chemical process fault diagnosis and verified the superiority of this method in TEP. Chen et al. [182] proposed a one-dimensional convolutional auto-encoder (1D-CAE) model for FDD of multivariable processes. Zheng et al. [183] proposed a hybrid system integrating SVM and CNN for pattern recognition of multivariable processes. Zheng et al. [184] presented a fault detection method based on convolutional gated recurrent unit auto-encoder for TEP. Hsu et al. [185] presented a multiple time-series CNN (MTS-CNN) model for FDD in semiconductor manufacturing process. In order to update the diagnosis model effectively to include new coming abnormal samples, Yu et al. [186] proposed a broad convolutional neural network (BCNN) with incremental learning capability for industrial processes.

Zhang et al. [187] proposed a new DNN, multichannel one-dimensional CNN (MC1-DCNN) to investigate feature learning from high-dimensional process signals. The application procedure of MC1-DCNN-based fault diagnosis comprises an off-line modeling phase and an on-line testing phase, which is presented in Fig. 11. The recognition rate of MC1-DCNN for 21 faults of TEP is shown in Fig. 12. It can be seen that MC1-DCNN has significant performance in feature extraction and process fault diagnosis.

Fig. 11
figure 11

Application flowchart of the MC1-DCNN-based fault diagnosis model

Fig. 12
figure 12

Recognition rates (%) of MC1-DCNN in a confusion matrix

3.3.3 RNN-based approaches

3.3.3.1 A brief introduction to RNN

RNN [41] is a state-of-the-art neural network with recurrent hidden layers to perform sequence data processing and prediction. However, RNN has encountered the challenge of vanishing gradient and exploding gradient in the training procedure [188]. Thus, long short-term memory network (LSTM) is proposed to solve this problem [189]. The unit structure of RNN and LSTM is shown in Fig. 13.

Fig. 13
figure 13

Unit structure of RNN and LSTM, a RNN, b LSTM

The hidden layer information of RNN only comes from the current input and the previous hidden layer information, and the calculation formula is as follows:

$${h}_{t-1}=tanh({W}_{h}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{h})$$
(29)

where \({h}_{t-1}\) represents the output of the previous cell, \({x}_{t}\) represents the input of the current cell, and \({W}_{h}\) and \({b}_{h}\) represent the weight and bias of forget gate, respectively.

An LSTM layer is composed of recurrently connected memory blocks, each of which contains one or more memory cells, along with three multiplicative “gate” units: the input, output, and forget gate. LSTM overcomes the problems of vanishing gradient and exploding gradient.

Forget gate The forget gate of LSTM can selectively discard part of the information from the memory unit, and the calculation is as follows:

$${f}_{t}=\sigma ({W}_{f}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{f})$$
(30)

where \({h}_{t-1}\) represents the output of the previous cell, \({x}_{t}\) represents the input of the current cell, \(\sigma\) represents the sigmoid function, and \({W}_{f}\) and \({b}_{f}\) represent the weight and bias of forget gate, respectively.

Input gate The input gate decides to add new information to the memory unit. This process includes two steps: one is to determine the information to be updated through sigmoid layer; the other is to generate alternative information by the tanh function.

$${i}_{t}=\sigma ({W}_{i}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{i})$$
(31)
$${\widetilde{C}}_{t}=tanh({W}_{c}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{c})$$
(32)

where \({i}_{t}\) and \({\widetilde{C}}_{t}\) are the output of the input gate and the alternative information of the memory unit, respectively, and tanh is the hyperbolic tangent function.

Memory unit The memory unit updates the memory information as follows:

$${C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\widetilde{C}}_{t}$$
(33)

where \({C}_{t}\) is the information saved by the memory unit.

Output gat Finally, the output gate will determine the final output based on the previous calculation. First, a sigmoid layer is used to determine the information, and then, the tanh function is used to calculate the final result.

$${o}_{t}=\sigma ({W}_{o}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{o}$$
(34)
$${h}_{t}={o}_{t}*\mathrm{tanh}({C}_{t})$$
(35)

where \({o}_{t}\) and \({h}_{t}\) are the output results of the output gate and LSTM unit at the current time, respectively.

3.3.3.2 Application of RNN

LSTM is a kind of specific RNN, which is mainly used to solve the problems of gradient disappearance and gradient explosion in the process of long-term training. Chadha et al. [190] presented a bidirectional RNN-based process condition monitoring and fault diagnosis method. Cheng et al. [191] proposed a new process monitoring method based on variational recurrent AE (VRAE). Ouyang et al. [192] proposed a fault detection and identification method based on the multidimensional gated recurrent unit (GRU) network for monitoring the blast furnace ironmaking process. Wang et al. [193] presented a high-level spatiotemporal feature extraction based on deep convolutional bidirectional encoder–decoder representation network with GRU cell for dynamic process fault diagnosis. Chen et al. [194] developed a deep RNN model of process variables with different time delays and established a residual graph to detect the mean shift of autocorrelated processes. Zhang et al. [195] employed a bidirectional RNN (BiRNN) to construct process FDD models with sophisticated RNN cells. Liu et al. [196] proposed a method based on the combination of LSTM and DBN. DBN-LSTM was used for feature extraction, time correlation analysis and fault diagnosis, and the method was applied to a class of semiconductor etching process. Yu et al. [197] proposed a convolutional LSTM-AE (CLSTM-AE) for feature learning from complex process signals.

3.3.4 DBN-based approaches

3.3.4.1 A brief introduction to DBN

Restricted Boltzmann machine (RBM) is a generative stochastic neural network that can learn probability distribution from input data set. DBN is a deep model constructed by stacking multiple RBMs, where the input of a layer is the output of the preceded layer.

RBM includes visible units \(v=\{{v}_{1},{v}_{2},\dots ,{v}_{m}\}\) and hidden units \(h=\{{h}_{1},{h}_{2},\dots ,{h}_{n}\}\) [198]. It is noted that all the units are binary, i.e., \(v,h=\{\mathrm{0,1}\}\). The relationship of the visible layer and hidden layer is defined by energy function as follows:

$$E\left(v,h\right)=-\sum_{i=1}^{m}\sum_{j=1}^{n}{\omega }_{ij}{v}_{i}{h}_{i}-\sum_{i=1}^{m}{{b}_{i}v}_{i}-\sum_{j=1}^{n}{a}_{j}{h}_{j}$$
(36)

where \(\theta =\{\omega ,a,b\}\) represents the parameters of RBM. With the energy function, the probability distribution of the visual units can be assigned as:

$$p\left(v,h\right)=\frac{1}{Z}\mathrm{exp}[-E\left(v,h\right)]$$
(37)

where \(Z\) is the partition function, and is calculated by summing all possible visible-hidden node pairs.

$$Z=\sum_{v,h}\mathrm{exp}[-E\left(v,h\right)]$$
(38)

The conditional probability distributions of each unit are defined as follows:

$$p\left({v}_{i}=1|h\right)=\frac{1}{1+\mathrm{exp}(-{a}_{j}-{\sum }_{j}{\omega }_{i,j}{h}_{j})}$$
(39)
$$p\left({h}_{j}=1|v\right)=\frac{1}{1+\mathrm{exp}(-{b}_{i}-{\sum }_{i}{\omega }_{i,j}{v}_{i})}$$
(40)

Deep Boltzmann machine (DBM) is a deep model with many hidden layers stacked into a hierarchy structure. In order to obtain the DBN model, Bayes belief network is used at the part closer to the visible layer, while RBM is used at the part away from the visible layer, as shown in Fig. 14.

Fig. 14
figure 14

Network structure of RBM, DBM and DBN, a RBM, b DBM, c DBN

3.3.4.2 Application of DBN

Compared with the traditional fault diagnosis, DBN does not need too much signal technology and diagnosis experience to support, and has relatively strong adaptability, versatility, the ability to deal with high-dimensional and nonlinear data. The comparison between DBN-based fault diagnosis is shown in Fig. 15. The raw data in Fig. 15 include the training data and testing data, which are used for DBN model training and testing, respectively. Data preprocessing is a critical step for data-based process monitoring. It can transform the raw data into an appropriate manner (i.e., features) and can be effectively used for process modeling.

Fig. 15
figure 15

Comparison of DBN-based process FDD with other process FDD methods

DBN is an effective method to solve the problem of FDD in industrial process. Zhang et al. [199] presented an extensible DBN-based fault diagnosis model for complex chemical processes. Tang et al. [200] proposed a fault detection model based on DBN for nonlinear processes and used the feature variables and residual variables generated by DBN to establish test statistics for abnormal monitoring of industrial processes. Kim et al. [201] proposed a DBN-based multi-classifier for fault detection prediction in the semiconductor manufacturing process. In order to select the features that are beneficial to process monitoring, Yu et al. [202] presented the concept of active feature, and applied the active features extracted by DBN (AF-DBN) to process monitoring. Based on the theoretical analysis and experimental study of unstable neurons, Yu et al. [203] proposed a novel method (UN-DBN) based on the unstable neurons in hidden layers for process monitoring. Wang et al. [204] proposed an extended deep belief network (EDBN) to make full use of the useful information in the original process data. Yu et al. [205] proposed multiple DBNs (M-DBN) to extract abstract and high-order information from each pattern in large-scale industrial processes. The training efficiency, accuracy of feature extraction, and monitoring performance of the model system are better than traditional DBN.

3.4 Step 3: process monitoring

After construction of the deep learning-based models, they are used for the process FDD. Control chart is a functional chart that can detect the out-of-control signals in processes. It is an important statistical tool in on-site quality management. In process monitoring, control charts are often used to analyze and judge whether the process is in a stable state. For example, two monitoring statistics T.2 and SPE are typically constructed in the multivariate statistical approaches-based process monitoring. When a possible fault is detected in the process, the next step is to diagnose it to find the root cause of the fault, reconstruct its direction and size, and identify the type of fault. Based on the above analysis, the operation process can be restored under normal conditions through process control strategies. Finally, process recovery can be completed through fault isolation. [2]

3.4.1 Fault detection

Fault detection refers to the real-time monitoring of process data by various sensors and equipment to determine whether a fault occur in the process. The process stage is usually judged according to the selected monitoring statistics (e.g., T2, SPE). If the monitoring statistical value exceeds its corresponding threshold, a fault alarm shall be triggered. The steps of fault detection based on deep learning are shown in Fig. 16, which includes two stages: offline modeling and online monitoring.

Fig. 16
figure 16

Procedure of deep learning-based process fault detection

3.4.2 Fault diagnosis

Fault diagnosis is to determine what kind of fault occurs in the process, specifically to determine the type of fault, fault magnitude, fault location and time. The basic framework of process fault diagnosis based on deep learning is shown in Fig. 17, which includes five steps: state definition, data preprocessing, training, testing and diagnosis result evaluation. The fault diagnosis method based on deep learning has the following advantages: (1) unlike those shallow networks, deep learning does not need fault feature links manually, but integrates fault feature learning and classification model; (2) deep learning is a multi-hidden layer network, which can avoid the limitation of dimension disaster and insufficient diagnostic ability of shallow networks.

Fig. 17
figure 17

A basic framework of process fault diagnosis based on deep learning

3.4.3 Fault reconstruction

Fault reconstruction is to estimate the nominal normal measurement value according to the measured value of the process variables that have been affected by the fault when the fault occurs. The purpose is to explore the direction and size of the fault and minimize the influence of fault factors on the normal part of the data. Through fault reconstruction, the severity of the fault can be estimated, and corresponding measurements can be taken to eliminate the impact of the fault on the process.

3.4.4 Fault identification

Fault identification is to find out the most relevant observation variables after a fault occurs. The information of fault type can help the operator quickly understand the fault and find the appropriate maintenance strategy to make the process return to normal as soon as possible. Fault recognition can be regarded as a pattern matching and recognition problem. Different data-based pattern analysis methods can be used for fault identification.

3.4.5 Fault isolation and process recovery

The process needs to be adjusted according to the analysis results. Fault isolation is to determine the location of the fault in the process and isolate the fault from other parts of the process after detecting the detailed information of the fault. The industrial process after fault isolation will not be significantly affected. Then, based on the process control strategies, the operation process can be restored under normal conditions through process maintenance and repair (e.g., parameters selection, process optimization, equipment maintenance). Control strategies were set according to different production objectives.

4 Process fault detection and diagnosis using transfer learning and deep learning-based approaches in the future

Under the premise of sufficient labeled data and balanced data, traditional deep learning methods presented in Sect. 3 can effectively mine and extract the relationship between the input process variables and industrial process faults. However, in actual industrial process, the fault data and normal data are often unbalanced, and the data are often incomplete and noisy. According to the above characteristics, it is necessary to train a reliable FDD model for industrial process.

This section presents the most potential research prospects of process FDD from four parts. The first part presents the novel deep learning models evolved from traditional deep learning models, i.e., deep residual shrinkage network (DRSN), independent RNN (IndRNN), and transformer. The second part introduces the applications of transfer learning and GAN in the process FDD. The third part presents the advantages of capsule network (CapsNet), GNN, DGP, and deep Gaussian mixture model (DGMM) in solving process FDD problem in the future. Finally, the application prospects of large-scale neural networks in process monitoring are presented.

Generally, it requires a large number of process data in the training of deep learning model. The possible data problems (e.g., inaccurate data, missing data, unbalanced data) will have a great negative impact on process monitoring. Moreover, it is very difficult for traditional machine learning methods to solve these problems. In recent years, DNNs have been applied to the process FDD widely. For example, CNN can process high-dimensional data effectively by sharing convolution kernel. However, it needs to be trained with sufficient labeled samples. On the basis of CNN, CapsNet considers the spatial relationship of data and can achieve good performance in small sample data. RNN can mine the sequence information in the data to solve time-varying process, but it takes a large time cost to train the model. As a variant of RNN, transformer can effectively solve the problem of high time cost of RNN training.

The deep learning-based process FDD model automatically learns features from the input data, and identifies the state of industrial process [206]. The DNNs mainly include feature extraction layer and classification layer. The feature extraction layer usually includes multiple hidden layers, which can be superimposed by the networks (e.g., AE, CNN, RNN, DBN) to learn abstract features layer by layer. The last layer of the DNN is the classification layer, which is used to classify the extracted features. In the training process, the back propagation (BP) algorithm is often used to update the training parameters of the DNNs to minimize the error between the actual output and the target. In recent years, various novel deep network architectures have been proposed (e.g., DRSN, transfer learning, GAN, CapsNets, GNN). Transfer learning and GAN can solve the problem of small sample problems in process FDD through knowledge transfer. However, the performance of the model is affected by the source domain and network parameters. CapsNets considers the spatial relationship of data, which makes it obtain higher classification performance. However, the time cost of the model training is high, which limits its further applications in process FDD. In addition, some emerging deep models, e.g., GNN, DGP, DGMM, are different from traditional DNNs in network architecture. It will be interesting to apply these specific deep learning models in process FDD in the future. According to the characteristics of these models, the advantages and disadvantages of them and their applications in process FDD are summarized in Table 2.

Table 2 Summary of applications of deep learning in process FDD

4.1 Deep learning

These traditional DNNs, e.g., CNN, RNN, AE, DBN, have been widely applied to industrial process FDD. However, these DNNs still have these defects: (1) the feature resolution of CNN will decrease with the growth of layers, resulting in the loss of information. In addition, due to the fixed size of the core, CNN is not suitable for fault diagnosis under non-stationary conditions; (2) RNN has the problem of gradient disappearance and explosion and cannot be stacked into deeper networks; (3) AE does not capture the correlation of information. In recent years, some novel deep learning models have been developed on the basis of traditional DNNs to improve the performance of the model. The subsection presents several new network models of CNN and RNN, which will effectively solve these problems, e.g., data imbalance, multi fault categories, time correlation, attention mechanism in industrial process control.

4.1.1 Deep residual shrinkage network

Deep residual shrinkage network (DRSN) is an improved method of ResNet. Its characteristic is "shrinkage," which refers to soft thresholding that is almost the necessary step of signal denoising algorithm [215]. As shown in Fig. 18, the cross-layer identical path has the following improvements compared with CNN: (1) soft thresholding is used to eliminate redundant information; (2) SENet type subnet structure is adopted to automatically set the threshold. Under the effect of soft thresholding, compared with ordinary CNN, deep residual shrinkage network is more suitable for noisy data classification task.

Fig. 18
figure 18

Deep residual shrinkage network, a basic residual shrinkage module, b overall architecture of DRSN [215]

4.1.2 Independent RNN

RNNs are usually difficult to train and learn long-term patterns due to gradient vanishing and exploding problems. The gated cycle unit (GRU) is used to solve the above problems, but the use of hyperbolic tangent and S-type action function will lead to gradient attenuation on the layer. Thus, it is challenging to construct an effective and trainable deep network. To solve these problems, Li et al., [216] proposed a new neural network called IndRNN. The neurons in the same layer are independent of each other and connected across layers. IndRNN is easy to adjust, which can prevent the problem of gradient explosion and disappearance, and allows the network to learn long-term dependence.

The basic architecture of IndRNN is shown in Fig. 19, where "weight" and "Recursion + ReLU" represent the input process and the recursive process of each step with ReLU as the activation function. Compared with LSTM, sigmoid and hyperbolic tangent functions are used to attenuate the gradient on the layer, and unsaturated activation function such as ReLU reduces the gradient vanishing problem on the layer. In addition, before or after the activation function shown in Fig. 19, the batch standardization represented as "BN" can also be used in the IndRNN network.

Fig. 19
figure 19

Basic network structure of IndRNN

4.1.3 Transformer

Transformer improves the most criticized shortcoming of RNN, which is slow training, and uses self-attention mechanism to achieve fast parallel. The structure of transformer is composed of encoder and decoder [217]. The overall architecture of transformer is shown in Fig. 20a, which uses stacked self-focus and point type. Figure 20a shows the full connection layer of encoder and decoder, respectively. Transformer uses multi attentions to linearly project queries, keys, and values to \({d}_{k}\), \({d}_{k,}\) and \({d}_{v}\) dimensions and uses different linear projections for \(h\) times. On each projection version of queries, keys and values, it performs the attention function in parallel to generate \({d}_{v}\)-dimension output value. These values are concatenated and projected again to output the final value, as depicted in Fig. 20b.

Fig. 20
figure 20

Transformer, a network architecture, b multi-head attention consists of several attention layers running in parallel

4.2 Applications of transfer learning

4.2.1 Transfer learning

In the typical intelligent fault diagnosis model, it is generally assumed that the training dataset and the testing dataset follow the same distribution. Thus, these models directly use the pretrained fault diagnosis model on the training set to diagnose the testing set. However, in the real industry, due to the influence of various factors, there are inevitably some differences between the source domain and the target domain. The intelligent fault diagnosis based on transfer learning considers that there are differences between the target domain and the source domain. The transfer learning algorithm learns the difference between the source domain and the target domain on the basis of learning the source domain, so as to further improve the generalization ability of the model. The comparison between the intelligent algorithm based on transfer learning and the traditional algorithm is shown in Fig. 21.

Fig. 21
figure 21

Fault diagnosis model based on traditional methods and transfer learning-based methods, a traditional methods, b transfer learning-based methods

Transfer learning aims at the defect of traditional machine learning, which is based on the assumption of the same distribution and needs a large number of labeled data. It solves the problem of data distribution difference and labeled data overdue in practical work and ensures the model accuracy in new tasks by making full use of labeled data (existing knowledge). Xu et al. [218] proposed an online fault diagnosis method based on deep transfer CNN. By transferring the shallow layer of trained offline CNN to online CNN, the real-time performance of process FDD can be significantly improved. Zhu et al. [207] developed a transfer learning framework to improve effectiveness of process monitoring in similar batch scenarios and verified the feasibility and effectiveness of the framework. Wu et al. [208] proposed a FDD method based on transfer learning for multimode chemical processes. Chai et al. [209] proposed a multisource-refined transfer network (MRTN) to solve the problem of fault diagnosis when the source and target fault category sets are inconsistent, and the superiority of this method is proved in a case of industrial multiphase flow process.

4.2.2 GAN

The idea of adversarial learning in GAN is inspired by the Nash equilibrium in game theory, and the structure of GAN is shown in Fig. 22. In the model, both sides of the game are realized by generator (G) and discriminator (D). The generator G is mainly responsible for generating samples subject to real distribution through random hidden variables z, and the discriminator D is responsible for identifying whether the input samples are false samples or real samples. Through the iterative adversarial learning between the two generators, the potential distribution of real samples is obtained and new samples are generated.

Fig. 22
figure 22

Architecture of GAN

In real-world scenarios, it is usually very difficult to collect enough faulty training samples to generate a balanced training dataset. Thus, some data generation methods are proposed to identify those process faults with unbalanced data distribution in industrial process. Gao et al. [210] proposed a data augmentation method based on Wasserstein GAN with gradient penalty (WGAN-GP) to improve the fault diagnosis accuracies. Yang et al. [211] proposed a bidirectional GAN (BiGAN) and applied it to process fault detection. Inspired by the GAN, Huang et al. [212] proposed an end-to-end attentive DesmokeGAN, which realized the visual attention in the generation network and effectively applied to the smoke detection process. Yan [213] proposed a new GAN to generate high-quality synthetic fault samples for data-driven fault detection and diagnosis of chiller.

4.3 Other deep learning-based methods

4.3.1 CapsNet-based methods

The traditional DNNs, e.g., CNN, ignore the location relationship of features in classifying fault time–frequency graphs. In order to overcome the limitations, CapsNet takes into account the size and location of the image. The architecture of CapsNet is shown in Fig. 23. CapsNet can solve the problem that the traditional DNNs cannot reflect the internal relationship. Although the application of CapsNet in industrial process is still blank, there are some successful applications in machinery fault diagnosis. For example, Wang et al. [219] proposed a capsules network combined with the Xception module (XCN) to improve the classification accuracy of intelligent fault diagnosis. Chen et al. [220] proposed a novel method called deep CapsNet with stochastic delta rule (DCN-SDR) for rolling bearing fault diagnosis. Chen et al. [221] used a contemporary novel neural network architecture called CapsNet to accomplish the recognition and classification of seven working conditions of a high-speed train (HST) bogie. These applications demonstrate the advantages of CapsNet over traditional DNNs in machinery fault diagnosis. Applications of CapsNet and the expansion to industrial process FDD will be an interesting issue in the future.

Fig. 23
figure 23

Architecture of CapsNet

4.3.2 GNN-based approaches

Graph is a data structure that models a group of objects (nodes) and their relationships (edges). GNN [222] is a deep learning method based on graph domain. Although deep learning has achieved great success in Euclidean space data, the data in many practical application scenarios are generated from non-Euclidean space and need to be analyzed effectively. The complexity of graph data poses a great challenge to existing machine learning algorithms. Graph analysis is a data structure that models nodes and their relationships. Graph analysis can be used for node classification, link prediction and clustering.

The core assumption of existing deep learning algorithms is that data samples are independent of each other. However, this is not the case for a graph. Each data sample (node) in the graph has edges related to other real data samples (nodes) in the graph, which can be used to capture the interdependence between instances. GNN can well construct the influence relationship among the key components in process control problems and provides a new idea for process FDD from the perspective of node association.

In order to apply GNN to process control, it is necessary to construct one-dimensional process data into graph structure data. There are two common methods of graph construction. One is unsupervised composition by measuring joint feature statistics. The other is supervised composition by using initial network as a proxy for the estimation [223]. Feng et al. [214] proposed a multichannel diffusion graph convolutional network (MCDGCN) to solve the problem of composition prediction in realistic converter steelmaking process. At present, there are still less researches on GNN-based process FDD. However, it is very interesting to apply GNN to process FDD in the future.

4.3.3 DGP-based approaches

DGP is multilayer predictive model that is highly flexible and can accurately model uncertainty [224]. As depicted in Fig. 24, the input to the hidden layer is the input data \(x\) and the output of the hidden layer \({f}_{1}\) serves as the input data to the output layer, which itself is formed by GPs. The output of the layer is probabilistic rather than exact, so uncertainty propagates through the network [225]. Santiago et al. [226] presented a hybrid deep learning-Gaussian process method for Diabetic Retinopathy diagnosis and uncertainty quantification. Tagade et al. [227] proposed a deep Gaussian process algorithm for lithium-ion battery health monitoring. The application of this method in industrial process is still blank.

Fig. 24
figure 24

Deep Gaussian process with a single hidden layer. a Deep Gaussian process illustration. b Histograms of a random selection of inducing outputs. The best-fit Gaussian distribution is denoted with a dashed line. Some of them exhibit a clear multimodal behavior. c P-values for 100 randomly selected inducing outputs per dataset. The null hypotheses are that their distributions are Gaussian [225]

4.3.4 DGMM-based approaches

DGMM is a network of multiple layers of latent variables [228]. In each layer, the variables follow the mixture of Gaussian distribution. The deep hybrid model consists of a set of nested linear hybrid models, which provides a nonlinear model that can describe the data in a very flexible way.

As depicted in Fig. 25, structure of a DGMM model with \(h=3\) and the number of layer components be \({k}_{1}=3\), \({k}_{2}=3\) and \({k}_{3}=2\). In the first layer, the conditional distribution of the observed data for a given \({Z}^{(1)}\) is a mixture of three components. If \(y={Z}^{(0)}\) is regarded as zero level, all conditional distributions follow the first order Markov property, that is, \(f({z}^{\left(l\right)}|({z}^{\left(l+1\right)},({z}^{\left(l+2\right)},\dots ,({z}^{\left(h\right)};\theta )= f({z}^{\left(l\right)}|({z}^{\left(l+1\right)};\theta )\) for \(l=0,\dots ,h-1\). At each layer:

$$f({z}^{\left(l\right)}|\left({z}^{\left(l+1\right)};\theta \right)=\sum_{i=1}^{{k}_{l+1}}{{\pi }_{i}}^{\left(l+1\right)}N({\varphi }_{i}^{\left(l+1\right)}+{\gamma }_{i}^{\left(l+1\right)}{z}^{\left(l+1\right)},{\rho }_{i}^{\left(l+1\right)})$$
(41)

where \({z}_{i}^{\left(n\right)}\sim N\left(0,{{\varvec{I}}}_{p}\right)\left(i=1,\dots ,n\right)\) and \({u}_{i}^{\left(1\right)},\dots ,{u}_{i}^{\left(n\right)}\) are specific random errors that follow a Gaussian distribution with zero expectation and covariance matrices \({\rho }_{s1}^{\left(1\right)},\dots ,{\rho }_{sh}^{\left(h\right)}\). \({\varphi }_{s1}^{\left(1\right)},\dots ,{\varphi }_{sh}^{\left(h\right)}\) are vectors of length \(p\), and \({\gamma }_{s1}^{\left(1\right)},\dots ,{\gamma }_{sh}^{\left(h\right)}\) are square matrices of dimension \(p\). Purohit et al. [229] suggested a deep autoencoding GMM with hyper-parameter optimization (DAGMM-HO) to solve the problem of unsupervised anomaly detection.

Fig. 25
figure 25

Structure of a DGMM model with \(h=3\) and the number of layer components \({k}_{1}=3\),\({k}_{2}=3\) and \({k}_{3}=2\)

4.4 Large-scale neural network

DNN is usually over parameterized, which leads to a huge waste of calculation and storage. In view of the large number of parameters of large-scale DNNs and the need to rely on the hardware platform with large storage space and excellent computing performance, it is necessary for the structure compression and optimization acceleration of large-scale neural network for process FDD. The goal is to shorten the training of DNNs and broaden the application scope of DNNs.

With the development of artificial intelligence technology, the depth of neural network is getting deeper and deeper, followed by the disadvantages of high storage and high-power consumption. This restricts the application of DNNs in the resource limited application environment. A large amount of redundant information is stored in a DNN model with more than one million levels. Thus, the compression and acceleration of DNNs are a feasible and effective solution.

Process FDD in modern industry involves a large number of equipment and data. Small scale DNNs cannot describe and model such process control system well. As a result, large-scale neural network will gradually replace small-scale neural network in the application in modern industrial process control. Although large-scale neural network can effectively improve the accuracy of process FDD, "large-scale" also brings many problems, such as complex neural network structure, network parameter explosion and model operation difficulty. There are five main methods to compress and accelerate large-scale neural networks: parameter pruning, parameter sharing, low-rank decomposition, designing compact convolutional filters, and knowledge distillation [230]. Parameter pruning can remove redundant parameters by a criterion that can judge whether the parameters are important or not. Low-rank decomposition uses matrix or tensor decomposition to estimate and decompose the original convolution kernel in the depth model. The compact convolutional filters mainly reduce the storage and computational complexity of the model by special structured convolution kernel or tight convolution computing unit. This study mainly uses the knowledge of large-scale network and transfers it to the model of compact distillation. Knowledge distillation mainly uses the knowledge of large-scale network and transfers its knowledge to the compact distillation model.

5 Discussion about future challenges and opportunities in process fault detection and diagnosis

With the continuous development of technology and production, the process of modern industrial system has become extremely complex. It is necessary to provide effective guidance for future industrial process FDD. Relevant technologies for process FDD are essential to improve the safety and efficiency of the production process. A developing route for process FDD is shown in Fig. 26. Figure 26 presents three main problems of machine learning-based process monitoring, i.e., data acquisition and preprocessing, machine learning model training and validation, results visualization, and process recovery. The problems, characteristics, and corresponding methods of each stage are listed in the developing route. Future challenges and research directions related to process monitoring are discussed in detail below.

Fig. 26
figure 26

Development route of applications of deep learning to process FDD

5.1 Collection of high-quality data in complex industrial processes

In modern industrial production, the process data are more numerous and complex than ever before. However, due to the interference of working environment or the abnormality of data acquisition equipment, the collected data are often uncertain and incomplete. Generally, deep learning requires a large amount of process data to be trained, and these inaccurate data will have a large negative impact on the related research of process monitoring. Continuous processes, batch processes, and multimode processes are three common types in industrial processes. Modern industrial processes always consist of various parts, and each of these parts may have a significant impact on process variables. As a result, these main characteristics are often involved in industrial process data: high dimensionality, non-Gaussian distribution, nonlinear relationships, autocorrelations, time varying, data autocorrelation, and multimode behaviors. These complex characteristics of the process also pose a great challenge to process monitoring. In order to ensure the quality of process data and model with the most appropriate data, the inspection and selection of data are a necessary step. Data preprocessing is also a critical step for data-based process monitoring. It can transform raw data into a more appropriate manner and can be effectively used for system modeling. Finally, a complete database should be constructed to manage these data and improve data quality for future process monitoring based on deep learning.

In many application scenarios, the uncertainty is very important, but the accuracy is not so important. DGP is a hierarchical generalization of Gaussian process. The model combines the uncertainty estimation of multi-layer model and has high scalability. It can model uncertainty and deal with missing data and abnormal data. Besides, GAN can be used to solve the problem of data missing and data imbalance in industrial processes.

5.2 Traditional methods-based process detection and diagnosis in complex environments

Over the past decades, the data projection-based and machine learning-based methods have been widely used in process monitoring. (1) These data projection-based methods, e.g., PCA, PLS, ICA, FDA, solve the problem of process monitoring (especially for fault detection) to some extent by dimension reduction or data projection. However, in the face of complex industrial environment, e.g., non-linearity, autocorrelation, multimode, the fault detection results of these methods are often unsatisfactory. A common method is to modify or extend these projection-based methods to adapt to different process environments. In addition, based on the advantages of feature extraction of deep learning, these methods can be further combined with deep learning and applied to fault detection in complex environments. This strategy can also obtain better results of process fault detection. (2) Traditional machine learning methods, e.g., kNN, ANN, SVM, adopt a simpler network structure and are suitable for various classification tasks, and they have been widely used for process fault diagnosis. To further improve the performance of machine learning for process fault diagnosis in complex environments, these methods can be combined with MSPM methods. Firstly, dimension reduction and data projection are used to reduce the complexity of datasets. Secondly, machine learning techniques are used to model and perform process FDD tasks.

5.3 Deep learning-based process monitoring in complex environments

The traditional MSPM methods (e.g., PCA, PLS, LDA, LCA) directly use the raw data as model input. There is no effective way to deal with high-dimensional and noisy data, resulting in inefficient identification of process faults. In addition, it is difficult for traditional methods to model auto-correlation processes accurately, resulting in a large number of false alarms triggered by control charts. Due to the limited parameters and calculation units, the regular ANNs have limited representation learning. Thus, it is unable to extract effective features from process data.

Compared to the MSPM method, deep learning simplifies the preprocessing of raw data and directly inputs the raw data into the model. In the pre-training process, the model filters the input data by a deep network structure and multi-layer nonlinear transformation, reduces the dimension, and uses the extracted features for regression prediction or classification model construction. Compared with the shallow model, it has better convergence and optimization mechanism, and has a broad application prospect in the FDD of complex process. For example, CNN uses convolution layer to extract features automatically and processes high-dimensional data effectively by sharing convolution kernel. DBN does not need too much signal technology and diagnosis experience to support, and has relatively strong adaptability, versatility, the ability to deal with high-dimensional and nonlinear data. RNN is a neural network with recurrent hidden layers, which can effectively mine the sequencing information in the data.

5.3.1 Different process control tasks

Due to the powerful feature extraction ability, deep learning has been widely applied in process FDD. (1) The typical deep learning methods, e.g., AE, DBN, CNN, have achieved great success in process monitoring, which is discussed in Sect. 3. A great advantage of deep learning is the flexibility of its network, and users can change their network structure according to their demands to adapt to different process environments. For example, combining CNN with RNN can effectively process autocorrelation data. The attention mechanism can improve its ability of adaptive learning. (2) In recent years, transfer learning, GAN, and CapsNet have been applied to process FDD. Transfer learning imposes constraints on the parameters by minimizing the distance metric to distribution discrepancy, which can correct the serious cross-domain discrepancy. GAN uses an adversarial game mechanism to generate better samples from generators and discriminators. Considering the size and location of images, CapsNet can effectively locate and identify process faults. These three deep learning models can be further developed for future research in process FDD.

5.3.2 Optimization of deep network structure

The fault diagnosis method based on deep learning has the characteristics of large amount of training data and large amount of calculation, which is one of the bottlenecks for deep learning in online process FDD. At present, the main optimization methods of network model are as follows: (1) Combination of deep learning models and traditional shallow models to construct a hybrid model shows good performance in feature learning and classification [231, 232]. Through the organic combination of various methods to learn from each other's strong points and complement each other's strong points, the fault diagnosis and prediction of complex industrial system with high efficiency and high accuracy can be realized; (2) During and after training, different calculation strategies are adopted to adjust the network parameters [233,234,235]; (3) Network structure optimization. The computational complexity of various deep learning architectures is reduced by imposing structural constraints. On the premise of ensuring the accuracy, the computing speed of the model is improved, and the parallel operation, GPU acceleration, data parallel, and model parallel are realized.

5.3.3 Edge computing

The model of deep learning has a multi-layer structure, and the amount of calculation parameters increases exponentially with the increase in the number of network layers. The most outstanding performance is the number of parameters of convolution layer and full connection layer of DNN. In order to improve the training speed of DNN and realize online process monitoring, edge computing emerges as the times require. Edge computing is an open platform that integrates network, computing, storage, and other core capabilities on the side close to the object or data source. IOT devices are not suitable for running deep learning model, but edge computing-based DNN model can effectively solve this problem. In terms of timeliness, edge computing is faster than the traditional centralized cloud computing method, and it can also reduce the burden of cloud computing. Deep learning for edge computing can automatically extract high-level features of data and can also mine accurate information from sensor data. To sum up, edge computing and deep learning provide solutions for mining big data of IOT devices and can effectively improve process FDD.

5.3.4 The prospect of transfer learning and GAN in FDD

The existing large amount of data is used to provide solutions to the problem of insufficient data in industrial process monitoring. Because transfer learning does not need to be trained from the beginning, it can effectively reduce the training time. In addition, transfer learning only needs a small amount of target domain data, which can effectively solve the problem of less data in the process industry.

In the process monitoring model training, it is usually assumed that the data are labeled and balanced. However, data are often unlabeled and unbalanced in real process industry. To solve these problems, GAN can generate fault data according to specific probability distribution, which balances process dataset. Moreover, a GAN with unsupervised training can realize the training of unlabeled data and improve the accuracy of the FDD model.

5.3.5 Application prospect of GNN in FDD

In the face of massive monitoring data and huge distributed system, maintenance personnel need to locate faults and make rapid and accurate maintenance decisions, which requires intelligent FDD technology to improve the availability of the system. GNN can effectively describe and store all kinds of information of the process control system, including the attributes of each sensor node, the connection relationship between different fault nodes and the node importance. The intelligent process FDD system based on GNN can realize fault location and fault correlation analysis and improve the reliability and availability of process control system. At present, GNN has been widely used in community network, social network, recommendation system, and so on by virtue of its powerful information description and relationship expression. GNN is also suitable for process control systems to describe the interaction between faults, which will be a valuable exploration direction in the future.

5.4 Interpretability and visualization of deep learning models

In general, these DNN models are considered as a black box, and researchers cannot understand their internal operations. Although deep learning has obtained tremendous success in process FDD, it is not known how it learns from process data about FDD. (1) Statistical learning theory can better explain the operation mechanism of network through strict theoretical knowledge. For example, the composition of the network, related parameters, and monitoring results can be described in a statistical way. In addition, adaptive filter theory is useful for analyzing the physical significance of deep learning-based models. (2) At present, the methods commonly used in deep learning models are visualization techniques. For the features extracted from the network, t-SNE and other techniques can be used to project the features to two-dimensional or three-dimensional planes, which can help readers understand the distribution of features. In addition, the deep network model can also be visualized to understand its internal structure and operation mechanism. (3) Knowledge extraction of neural network. By combining symbolic rules with deep neuron model, knowledge is extracted from deep network, and the dynamic operation mechanism of deep network is described [236, 237]. Yu et al. [236] proposed the knowledge-based artificial neural network (KBANN) model and a genetic algorithm-based rule extraction approach (GARule) to discover the causal relationship between manufacturing parameters and product quality. These rules are applied to the diagnosis of manufacturing process, to provide guidance for improving product quality, and to construct knowledge-based neural network. Based on this study, Liu et al. [238, 239] explained the principle of neural network fault diagnosis by symbolic language, realized accurate prediction, and effectively extracted the key knowledge of manufacturing process.

6 Conclusions

In this study, the technology of process FDD in recent decades is reviewed. The development of process FDD is divided into three stages. In the past, the prevalent techniques of FDD are data projection-based methods and machine learning methods. Although these methods have achieved some successes in process FDD in the past few decades, they are difficult to extract effective features of complex processes. In recent years, deep learning has attracted extensive attentions and brought many constructive ideas and methods to process FDD. They learn and extract deep features of the process signals through multi-layer networks, which is conducive to detect and diagnose those faults in the process. The sample imbalance of training data sets in industrial processes poses a certain challenge to the development of deep learning in process FDD. At present, transfer learning and GAN overcome these problems and have been gradually applied to process FDD. Transfer learning and GAN will attract extensive attention and be widely used in process FDD in the near future. In addition, CapsNet, GNN, DGP, and DGMM also provide new ideas for solving process control problems. Finally, a developing route is provided and the future development and research plan of process FDD is discussed, which provides some valuable ideas for future research in process FDD.