Introduction

Chemical mechanical planarization (CMP) is a critical process widely adopted in the semiconductor industry, since the surface flatness largely influences the manufacturing quality. The CMP process can be used to planarize numerous materials, such as: dielectrics, semiconductors, metals, and composites. The contact area and pressure of the wafer play an essential role for the polishing speed. Meanwhile, the synergistic mechanism between chemical reaction and mechanical abrasion has an extensive effect on the contact area, which in turn affects the wafer surface removal rate (Ludwig & Kuna, 2012). Excessive material removal rate (MRR) leads to the defect and depression of wafers material, which increases the fault rate of CMP (Hong et al., 2020). On the contrary, low MRR represents that the wafer is not polished sufficiently, which affects its final quality. Therefore, MRR serves as one of the important indicators to measure its final quality of the polished surface.

Despite its significance, wafer is normally wrapped in the CMP tool between the pad and the wafer carrier, resulting a difficulty to estimate the MRR until it finishes the whole process. Therefore, it is necessary to predict the MRR during the CMP process for prognostics and health management. Conventionally, research studies focus on investigating the components (Evans et al., 2003) and manufacturing environment (Xu et al., 2020) of CMP that affect the MRR. Meanwhile, various physics-based mathematical models have been established to fit a curve to predict the MRR (Lee et al., 2013) or simulate the manufacturing process (Lee & Jeong, 2011). Furthermore, empowered by the capability to collect multimodal CMP data, and the high computation power, machine learning, and deep learning approaches have been ever-increasingly implemented to predict the MRR.

Most CMP equipment owns a pre-defined and clear operation mechanism that indicates its corresponding connection among the inner components and parts (Jia et al., 2018). Nevertheless, the structural knowledge contained in the equipment is often neglected in the existing MRR prediction models, which can play a significant role. On one hand, it can reflect the dependency between various components/parts, which serves as the fundamental basis for determining the sources of data to be considered. On the other hand, although recent work started to establish the knowledge graph-based model, it only considers the interrelationships such as ‘is part of’, ‘lead to’, ‘has a function’ (Yan et al., 2020), while ignoring the impact propagation among component/parts.

To address this issue, a proper industrial graph representing the structural knowledge of CMP equipment and its interrelationship mechanisms should be first established. Meanwhile, advanced graph convolution network (GCN) approaches (Wu et al., 2021), as the potential solution for solution recommendation and prediction, can be further leveraged and enhanced to support the MRR prediction process. Motivated by this, this paper proposes a novel temporal hypergraph convolutional network-based approach for MRR prediction in CMP. The rest of this paper is organized as follows. In “Related work” section reviews the related work of the MRR prediction, industrial graph applications, and state-of-the-art methods of graph-based reasoning. In “CMP hypergraph construction” section introduces the proposed methodology of constructing an equipment hypergraph model of CMP. Meanwhile, in “HGCN-based model” section presents the proposed combined HGCN with GRU model for MRR prediction. To validate its effectiveness, in “Case study” section undertakes a comparative study based on an open-source MRR dataset, and the experimental results are further discussed in “Discussion” section. At last, in “Conclusion” section outlines the contributions of this work and highlights the future directions.

Related work

This section summarizes the related work about MRR prediction and provides a comprehensive review of the development and categories of industrial graph and the graph-based reasoning approaches.

MRR prediction

The existing MRR prediction approaches can be divided into physics-based and data-driven ones. One of the most popular physical-based approaches is the Preston equation (Evans et al., 2003), which indicates \(MRR = K_{p} P^{\alpha } V^{\beta }\), where P represents the downward pressure push to a wafer, V represents the rotating speed, \(K_{p}\) is the Preston coefficient. Following this model, many efforts have been done by adding contact stress, relative velocity, and chemical reaction rate into the Preston coefficient (Lee & Jeong, 2011). Also, other research takes the size, concentration, distribution of particles, slurry flow rate, polishing pad surface topography into consideration (Lee et al., 2013). However, the major limitation of physical-based approaches lies in the prior assumptions of the model, which often may not be correct in practice.

For the data-driven based approaches, machine learning and statistical methods have been widely adopted. For instance, the nonlinear Bayesian model (Kong et al., 2010) and the decision tree-based model were introduced for MRR prediction (Li et al., 2019). Recently, with the rapid development of deep learning, a deep belief network was proposed (Jia et al., 2018; Wang et al., 2017). Furthermore, some derived deep learning approaches have been adopted to MRR prediction, such as a feature-incorporated approach combined with a recurrent neural network and a convolutional neural network (Lee & Kim, 2020) and least squares generative adversarial network (Kim et al., 2020). Similar to MRR prediction in the industrial scenario, when facing the prediction problem (e.g. RUL estimation), there have been already mature solutions based on deep learning and machine learning (Ushakov & Zhang, 2019). Nevertheless, they often neglect the structural knowledge and underlined interactive mechanism of the equipment itself.

Industrial graph

Recent work on industrial graph can be mainly categorized into twofold: knowledge management and operation simulation.

The objective of the former one is to organize the data and knowledge from various resources in graph form systematically. It normally includes four steps: (1) schema design, (2) knowledge extraction, (3) knowledge fusion, and (4) reasoning. Firstly, a schema design is performed to define the node and edge in such domain-specific knowledge graph, since the node/edge types vary much (Wang et al., 2019). Next, knowledge extraction aims to collect triples (i.e., head entity, edge, and tail entity) from semi-/un-/constructed data by leveraging the natural language processing (Yan et al., 2020) and disassembly analysis (Weise et al., 2019). Then, it is essential to fuse the similar entities of extracted knowledge by creating an ontology link and building a concept graph (Li et al., 2020). Finally, after constructing the industrial knowledge graph, the querying process can be conducted by navigating potential key entities for making intelligent decisions (Wang et al., 2019) and recommendations (Li et al., 2021).

The latter one aims to digitize parts of the equipment information, the system working process, or even the entire production process, and connects the data in different vertical fields to construct a corresponding industrial graph. The most straightforward manner is to transform the working process into a graph (Alsafi & Vyatkin, 2010) or disassembling the components as nodes in the graph (Hedberg et al., 2020). Furthermore, an event graph is generated to simulate and understand the manufacturing process and represent the event logic by setting events as entities in graph form (Tiacci, 2020).

However, both methods fail to represent the synergistic impact relation among components and parts in the equipment.

Graph-based reasoning

Graph neural network (GNN) is a prevailing methodology utilized to reflect the impact of interactions of graph-based structural data (Wu et al., 2021). GNN propagates the node attributes until convergence and generates embedding vectors for each node. Encouraged by the success of CNNs in computer vision, a graph convolution network (GCN) was proposed, which utilizes convolution for the spectral graph. After that, numerous researchers had developed improved and extensive versions of GCN by re-defining the convolution in the graph, such as lighten (He et al., 2020), and localize (Wang et al., 2018). Besides, some approaches are based on the spatial graph which convolutions on the graph directly (Yan et al., 2018). Among spatial theories, GraphSAGE (Hamilton et al., 2017) has achieved impressive performance, which inductively generates node embedding vectors. Furthermore, the attention mechanism had used to adjust the weight of the node base on their neighbor node (Velicković et al., 2017). In industrial applications, GCN has been utilized in manufacturing optimization (Hu et al., 2020), and modeling the equipment structure by determined the dependencies of sense data (Narwariya et al., 2018) or based on the Pearson Correlation Coefficient among their feature (Zhang et al., 2020).

Although some previous efforts attempt to establish the connection between pairwise sense data to form a graph. However, one interaction or synergistic mechanism in complex equipment may be related to more than two components and parts in a ‘one-to-many’ or ‘many-to-many’ relationship, which is out of the capability and expression of the conventional graph.

To address the abovementioned research gaps, this paper aims to propose a novel hypergraph convolution network (HGCN) based approach for MRR prediction in CMP, considering both the impact relationships between inherent component/parts and temporal features of data collected.

CMP hypergraph construction

To represent the complex impact relationships of multiple nodes in the CMP tool, this paper adopts the concept of hypergraph (Feng et al., 2019), of which an edge can join any number of nodes. This paper further introduces a CMP hypergraph model including three steps: (1) CMP graph data modelling; (2) hypergraph construction; (3) heterogeneous data correlation by the proposed HGCN-based model.

CMP graph data model

Different from the existing industrial graph, the CMP graph data model, aims to reflect the impact among various components or parts, and to manage and represent the impact relationship and store their features in a graph form.

In the initial stage, it is essential to determine the components or parts involved as nodes in the graph, which are based on the physical structure and the operating mechanism. However, they are normally constructed in a hierarchical structural relationship. Hence, it is necessary to classify the hierarchical affiliation of all the nodes of the CMP graph data model into the following three levels, as shown in Fig. 1: (1) product-level node, as the top-level node in the hierarchical structural relationship, representing the product itself; (2) part-level node, denoting the individual product module in the second layer; and (3) component-level node, referring to the ones decomposed by the product modules in the third layer to nth layer, of which the nodes in the nth layer contains its corresponding features.

Fig. 1
figure 1

CMP schematic diagram and corresponding hierarchical structure

In the CMP hierarchical structure (see Fig. 1), the top-level node is a product-level node representing the CMP tool product entity. Meanwhile, the CMP equipment modules are regarded as the part-level nodes (i.e., the wafer, slurry, wafer carrier, pad, and dresser), and the components of each module (e.g., conduit of slurry) are depicted in the component-levels, of which the nth layer are linked to the features.

Apart from the hierarchical structure, there also lies the impact relationship among various nodes of the same layer horizontally based on the equipment mechanism, such as the downward force of the wafer carrier to wafer. To better describe and summarize their impact, edges between the nodes can be utilized to represent their relationship in a graph-based form. Due to the complex relationship lying in the physical or chemical reactions between nodes, the types of edges should also be categorized as: (1) undirected edge, representing the two nodes that have hidden or fuzzy interaction; (2) directed edge, denoting that one node has a certain effect/action to the other, while not the other way around; and (3) bi-directed edge, referring to the certain effect/action on the nodes of each other. Based on the hierarchical and horizontal structure, the CMP graph data model can be established, as shown in Fig. 2.

Fig. 2
figure 2

CMP graph data modelling

According to the mechanism of CMP (Evans et al., 2003), for the part-level nodes (hollow circle in Fig. 2), for instances a downward physical force applies to the wafer carrier to push the wafer toward the pad, and therefore, a directed edge connects from the wafer carrier node to the wafer node. Meanwhile, the wafer material is passivated and etched by the slurry chemicals, which represents the slurry node has an impact on the wafer node. Also, the chemical interaction effect leads to an undirected edge connecting from the slurry node to the wafer node. Moreover, a downward force applies to the wafer to against the pad, and therefore, a directed edge connects from the wafer node to the pad node. Furthermore, the dresser is used to roughen the pad surface while the pad does not have effect/action to it reversely, leading to a directed edge from the dresser to the pad.

For the component-level node (filled circle in Fig. 2), first, in the Dresser node, the arm uses to fix the position of the head, so an undirected edge is connected from the arm to the head. Besides, in the pad node, the pad cooling device and the pad heating device heat conduct to the platen, so there are two directed edges from the pad cooling and the pad heating to the platen respectively. Meanwhile, in the wafer carrier node, as shown in Fig. 3, the backing film lay in the bottom, and due to the physical downward force to the wafer carrier, directed edges connect from the rest of the component-level nodes to the backing film. Moreover, the retaining ring and gimbal point lean on the carrier house without force, which have undirected edges among them. Furthermore, the last component layer nodes are connected with their corresponding data features (dashed line in Fig. 2).

Fig. 3
figure 3

Assembly of the wafer carrier

Hypergraph construction

The CMP graph data model has clarified the relationship among different level nodes, while it is still difficult to determine the exact mathematical expression or weight of each edge due to the limitation of data and prior knowledge available.

To fill this gap, this paper proposes a hypergraph to represent their complex relationship in the CMP equipment. The main characteristic of the hypergraph is using a hyperedge to connect with multiple nodes which indicates the impact interaction among the connected nodes. There are three types of hyperedge and summarizes in Table 1.

Table 1 Different type of hyperedge and its vector

After constructing the CMP graph data model in Fig. 2, it needs to consider which edge can be merged as a hyperedge based on their operation mechanism. For the part-level nodes, firstly the wafer node is influenced by both the wafer carrier node and the slurry node. Because the downward force is applied on the wafer, leading to its contact area is changed in the chemical reaction with slurry. Simultaneously, the wafer is removed by the chemical reaction of the slurry which also influent the effect of original downward pressure on the wafer node. Therefore, it is difficult to distinguish how the slurry and the wafer carrier influence the wafer module respectively, it needs to merge these two edges as a hyperedge to represent the associated impact relationship. Secondly, the wafer and the dresser are setting up on the pad vertically, both of which are applied a downward force pushing the pad node indirectly. Both the wafer node and the dresser node have certain actions on the same pad node, thereby it is difficult to divide them separately. Accordingly, a hyperedge connecting the wafer node and the dresser node to the pad node should be used to represent this associated impact relationship. After analyzing the relationship between different part-level nodes, a hypergraph is generated and each of the part-level nodes contains one or more associated component-level nodes, as shown in Fig. 4.

Fig. 4
figure 4

CMP hypergraph

Furthermore, the hypergraph construction of component-level nodes also follows the same analysis logic. In the Pad module, the heat conduct transfers from the pad cooling device and the pad heating device to the platen, because the heat conduct is discrete and hard to calculate separately, a directed hyperedge connects from pad cooling and pad heating to the platen. Meanwhile, in the wafer carrier node, the retaining ring, the carrier housing, and the gimbal point set up on the backing film with a downward physical force. Hence there is a hyperedge connects from the former three component-level nodes to the backing film node. Additionally, the retaining ring and the gimbal point place nearby the carrier housing horizontally, therefore there are undirected edges connects from the retaining ring and the gimbal point to carrier housing separately. After the analysis of the CMP mechanism, the visual hypergraph can be seen in Fig. 4 which contains directed hyperedges and undirected hyperedges, and its corresponding hypergraph matrix can be applied according to Table 1.

HGCN-based model

This paper introduces the HGCN-based model to predict the wafer removal rate in the CMP. The input data in the proposed model have samples across different time dimensions and each feature in the sample belongs to a corresponding part-level node or component-level node. The schematic diagram of the HGCN-based model is shown in Fig. 5 and the main notations is shown in Table 2. This paper focuses on modeling the interrelationships among the part-level nodes, and the component-level nodes follow the same modeling process.

Fig. 5
figure 5

The schematic diagram of the HGCN-based model

Table 2 The main notations and definitions in this paper

Embedding layer

The different part-level nodes contain different number of features which are uneven and difficult for the subsequent modules to use. Therefore, this paper proposes the embedding layers to transfer the different dimensions vector into the same fixed dimension (128 dimensions). The embedding equation is as follows:

$$ z_{j} = z_{j}^{^{\prime}} w_{z} + b_{z} , $$
(1)

where \(z_{j}^{^{\prime}}\) denotes the part-level node with original features, \(z_{j}\) denotes the embedding vector of part-level node, \(w_{z} \in {\mathbb{R}}^{od \times ed}\) denotes the embedding matrix, od is the original dimension and ed is the embedding dimension. For instance, part-level node wafer has 3 features, therefore for each timestamp t, its vector is \(z_{pad,t}^{^{\prime}} \in {\mathbb{R}}^{1 \times 3}\). After embedding layer, \(z_{pad,t}^{^{\prime}}\) transfers to \(z_{pad,t} \in {\mathbb{R}}^{1 \times 128}\), which contains larger representation spaces.

Piecewise aggregate approximation

The length of timestamps of each wafer sample is different. Therefore, it is necessary to reduce to the same timestamp length for training efficiently. This paper introduces Piecewise aggregate approximation (PAA) to convert different wafer samples into the same length, and the targeted length sets as the minimal time length among all the wafer samples. The mathematical algorithm of the PAA can be written as:

$$ x_{i}^{^{\prime}} = \frac{n}{m}\mathop \sum \limits_{{j = \frac{m}{n}\left( {i - 1} \right) + 1}}^{{\frac{m}{n}i}} z_{j} , $$
(2)

where \(z_{1} , \ldots ,z_{m}\) denote a wafer sample with m timestamps, and n is the minimal time length. \(x_{1}^{^{\prime}} , \ldots ,x_{n}^{^{\prime}}\) are the n number of 128-dimensional vectors. Each \(x_{t}^{^{\prime}} \in {\mathbb{R}}^{1 \times 128}\) represents the specific node embedding vector in timestamp t. Because the CMP tool has five part-level nodes, we concatenate them into \(x_{t} \in {\mathbb{R}}^{5 \times 128}\), and represent the equipment structure. Hence, we employ \( x_{1} , \ldots ,x_{n}\) to denote embedding representations of five part-level nodes in n timestamps.

Hypergraph convolution network

The hypergraph convolution network (HGCN) (Feng et al., 2019) is introduced to learn the data correlation and output refined embedding vectors with the same dimensions. By applying the Fourier transform to the spectral convolution and inverse Fourier transform, the HGCN can be iterated as the following function:

$$ x_{t}^{l} = \sigma \left( {D_{v}^{{ - \frac{1}{2}}} AWD_{e}^{ - 1} A^{T} D_{v }^{{ - \frac{1}{2}}} x_{t}^{l - 1} \Theta^{l - 1} } \right), $$
(3)

where \(x_{t}^{0} = x_{t}\), \(\sigma\) denotes the sigmoid function, W denotes the trainable diagonal matrix, \(D_{v}\) and \(D_{e}\) denote the diagonal matrices of edges degrees and the nodes degrees, A denotes hypergraph matrix which calculates from Table 1 and CMP hypergraph (Fig. 4), and \(\Theta \in {\mathbb{R}}^{{C_{1} *C_{2} }}\), denotes the convolution filter to inverse transform to the spatial domain, \(C_{1}\) and \(C_{2}\) are the feature dimensions before and after convolution. This hypergraph iteration equation utilizes the core idea of graph convolutional networks. As shown in Fig. 6, the HGCN can achieve node-edge-node transformation so that it can extract the high order features base on the hypergraph structure. Initially, \(x_{t}^{l}\) multiplies of \(A^{T}\) can transform the node level embedding vectors into hyperedge embedding vectors, representing gather information to the hyperedges. Subsequently, by multiplying matrix A, it can generate the refined node embedding vectors which means aggregated their related hyperedge embedding vectors (the lower part of Fig. 6). Therefore, by utilizing this node-hyperedge-node mechanism, the HGCN can extract the high-order feature efficiently.

Fig. 6
figure 6

The illustration of HGCN

For the hypergraph of part-level nodes in Fig. 4, it contains two hyperedges and five part-level nodes, hence \({\text{A}} \in {\mathbb{R}}^{5 \times 2}\). The \(x_{t}^{l} \in {\mathbb{R}}^{5 \times 128}\) in Eq. (3) is one timestamp unit of the whole temporal data, and its dimensions remain the same through the HGCN layer.

Gated recurrent unit

After applying HGCN in each timestamp, it generates sequence data \(x_{1}^{l} , \ldots ,x_{n}^{l}\). Establishing a Gated recurrent unit (GRU) model for the sequence data to obtain the prediction. The main idea of GRU is to use a gate mechanism (i.e., update gate and reset gate). The mathematical algorithm of GRU is as follows:

$$ z_{t} = \sigma \left( {W_{z} x_{t}^{l} + U_{z} h_{t - 1} } \right), $$
(4)
$$ r_{t} = \sigma \left( {W_{r} x_{t}^{l} + U_{r} h_{t - 1} } \right), $$
(5)
$$ \widehat{{h_{t} }} = \tanh \left( {W_{h} x_{t}^{l} + U_{h} \left( {r_{t} \odot h_{t - 1} } \right) + b_{h} } \right), $$
(6)
$$ h_{t} = \left( {1 - z_{t} } \right) \odot h_{t - 1} + z_{t} \odot \widehat{{h_{t} }}, $$
(7)

where \(h_{t} { } \) is the output vector, and \(W_{z}\), \({ }W_{r}\), \({ }W_{h}\), \(U_{z}\), \(U_{r}\), \(U_{h}\), \(b_{h}\) are the trainable parameters. Setting the hidden layer number as same as the input dimensions, hence \(h_{t} \in {\mathbb{R}}^{5 \times 128}\).

Hypergraph attention mechanism

GRU module’s output \(h_{t}\) represents the vertical concatenation of the nodes’ embedding vectors, denotes \(h_{tk} \in {\mathbb{R}}^{1 \times 128}\) as the kth node in the graph. Also, \(h_{tk}\) can be refined by applying graph attention mechanism. In this hypergraph attention mechanism, it considers its first order neighbor to calculate its attention coefficient \(a_{ij}\). Also, the nodes will treat as neighbors if they connect with a hyperedge in the hypergraph. The updated \(h_{ti}^{^{\prime}} \in {\mathbb{R}}^{1 \times 128}\) can be iterated by:

$$ h_{ti}^{^{\prime}} = \sigma \left( {\mathop \sum \limits_{{j \in N_{i} }} a_{ij} W_{a} h_{tj} } \right), $$
(8)

where \(W_{a}\) is the trainable weight matrix, and \(a_{ij}\) is the impact factor, which can be calculated as follows:

$$ a_{ij} = \frac{{\exp \left( {LeakyReLU\left( {\lambda^{T} \left[ {W_{a} h_{ti} ||W_{a} h_{tj} } \right]} \right)} \right)}}{{\mathop \sum \nolimits_{{k \in N_{i} }} \exp \left( {LeakyReLU\left( {\lambda^{T} \left[ {W_{a} h_{ti} ||W_{a} h_{tj} } \right]} \right)} \right)}} $$
(9)

where \(N_{i}\) denotes the neighbor of the ith node, λ is the weight vector applies in the LeakyReLU function, and || is the concatenation process. The refined output from the hypergraph attention mechanism can be readout as a graph embedding vector by concatenating them horizontally, denotes the graph embedding vector as \(H^{\prime} = \left[ {h_{t1}^{^{\prime}} , \ldots ,h_{t5}^{^{\prime}} } \right]\) and \(H^{\prime} \in {\mathbb{R}}^{1 \times 640}\).

Comprehensive representation

Overall, the architecture of the HGCN-based model is shown in Fig. 7. Although it can handle the heterogeneous vectors of the equipment structure, statistical features also benefit to the prediction result. Therefore, this proposed algorithm concatenates three statistical metrics of each feature: standard deviation, skewness, and kurtosis with the graph embedding vector as the comprehensive representation and denotes it as \(H^{\prime\prime} = \left[ {H^{\prime},X_{extra} } \right]\), where \(X_{extra}\) is the feature set of statistical metrics. Hence, the final estimated value can be calculated through a fully connected layer as:

$$ x_{hidden} = ReLU\left( {W_{d} *H^{\prime\prime} + b_{h1} } \right), $$
(10)
$$ y_{output} = W_{o} *x_{hidden} + b_{h2} , $$
(11)
Fig. 7
figure 7

The detailed structure of HGCN-based model for part-level nodes

where ReLU is the non-linear active function, \(W_{d}\), \(W_{o}\), \(b_{h1}\), \( b_{h2}\) are the trainable parameters. Finally, the model trains through backpropagation with the mean square error as the loss function:

$$ L = \frac{1}{n}\mathop \sum \limits_{i}^{n} \left( {y_{output} - y_{true} } \right)^{2} $$
(12)

Case study

To demonstrate the effectiveness of the proposed approach in a generic manner, one open dataset obtained from the competition of PHMS 2016 of the wafer CMP (Wang et al., 2017) is adopted to predict the average material removal rate.

Data description

The dataset contains multiple sensory signals collecting from a CMP that removes the material from wafers. This paper selects 14 features out of 25 total features, which are relevant to the parts and components in the CMP tool. They mainly include the usage of the polish-pad backing film, dresser, polishing table, dresser table, wafer carrier sheet, the flow rate of slurry, and the pressure of different components. Besides, the time length ranges from 199 to 5492, but they all correspond to one MRR (target). The dataset includes two stages: A and B. The number of the total dataset of stage A is 376,859 and corresponding to 1166 wafers records (i.e., a distinct wafer id has many timestamps but one corresponding MRR) and the total dataset of stage B is 295,885 and corresponding to 815 wafers records this experiment split 80% of the dataset as the training dataset and the rest as the test dataset. Table 3 provides numerical details on the training and test dataset.

Table 3 Training and test dataset numerical statistics

Average removal rate prediction

Hypergraph matrix

Due to the limited features, it fails to construct the complete CMP graph data model as shown in Fig. 5. Nevertheless, since all the features in the open dataset are related to the part-level nodes, this paper considers only involves those ones holistically. Following the same analysis described in “Hypergraph construction” section, its hypergraph data model and corresponding hypergraph matrix H can be represented, as shown in Fig. 8.

Fig. 8
figure 8

The matrix of the CMP hypergraph structure

Performance metrics

To evaluate the performance, the error will be measured by the following metrics:

$$ MSE = \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2} $$
(13)

the full name of MSE is mean squared error, which measures the average squared difference between the estimated values and the actual value.

$$ MAE = \frac{1}{m}\mathop \sum \limits_{i = 1}^{m} \left| {y_{i} - \widehat{{y_{i} }}} \right|, $$
(14)

the full name of MAE is mean absolute error, which measures the errors between the estimated values and the actual value expressing the same phenomenon.

Hyperparameters

This experiment uses an Adam optimizer with an initial learning rate of 0.01, the dimensions number of each vector is 128, the dropout rate of 0.1 for all feedforward layer, the MSE as the loss function, and the heads number of graph attention mechanism is 1, the number of HGCN layer is 2, the epoch is 100, and the batch size is 128.

Comparable cutting-edge models

To validate the advantages of the proposed model, it is compared with cutting-edge models adopted in the prognostic and health management field with the same hyperparameters, as listed below:

CNN-MR: Deep convolutional neural network-based regression approach (Babu et al., 2016).

LSTM-MR: Long Short-Term Memory approach for prediction (Zheng et al., 2017).

GRU-MR: Gated Recurrent Unit model for prediction (Yan et al., 2019).

Auto-Encoder + DNN: Using Auto-Encoder to generate additional features and feed them with the original features into DNN (Ren et al., 2018).

The experiment is conducted with a fivefold cross-validation mechanism, data normalization, and early stop for generating a stable and better result. The comparison results with cutting-edge approaches can be seen in the upper part of Table 4 of each metric.

Table 4 Performance comparison

Meanwhile, the effectiveness of the HGCN-based model is validated by comparing the proposed model with the models without different submodules, and the verification result can be seen in the lower part of Table 4 of each metric. Hereby, (1) the proposed model without the HGCN layer represents the node embedding vectors remain the same after the construction of hypergraph and before the hypergraph attention mechanism; (2) the proposed model without hypergraph means all the operations related to the graph will be removed, such as hypergraph convolution layer and graph attention mechanism, and graph readout process; (3) the proposed model without statistical features represents the model does not concatenate with the statistical feature before DNN; and (4) the proposed model without temporal features represents train the DNN model only with the statistical features.

Furthermore, to validates the correctness of the hypergraph matrix form, the experiment compares the proposed hypergraph matrix with other matrix and the random matrix, of which the experiment results are shown in Table 5.

Table 5 HGCN-based model performance of different matrix

Discussion

Based on the experiment results obtained from Tables 4 and 5, some further analysis can be conducted as follows.

Comparison with baselines

According to Table 4, the proposed HGCN-based model outperforms the other cutting-edge models: CNN-MR, LSTM-MR, GRU-MR, and Auto-Encoder + DNN, in both Stage A and Stage B of MRR. The result shows that by combining the equipment structure as a hypergraph form into a deep learning approach, this structure can provide meaningful and beneficial knowledge for the prediction task, and hence the proposed hypergraph construction method is effective. Theoretically, the hyperedge links with more than two nodes, representing the synergistic mechanism involves more than two components in the complex equipment. The convolution layer exploits the complex and high-order relationships in the hypergraph for representation learning. Therefore, the proposed model outperforms other cutting-edge models which neglect the structural knowledge.

Effective of HGCN-based structure

One unique characteristic of the proposed HGCN-based model is that it contains a hypergraph structure and uses hypergraph convolution layers to learn the hidden data correlation. To validate its effectiveness, four scenarios are considered as shown in Table 4, where the proposed model achieves the lowest MSE and MAE compared with the ones without different submodules. Also, the HGCN, hypergraph, statistical features have positive contributions to the prediction accuracy.

The correctness of hypergraph matrix

The experiment also compares the difference performance brought by mechanism-based hypergraph matrix and different matrices. As shown in Table 5, the proposed hypergraph matrix achieves better performance than the proposed undirected hypergraph matrix (all hyperedges are undirected), identity matrix, and random matrix. This experiment verifies the correctness of the proposed hypergraph matrix and further proves that the proposed hypergraph construction method can express the impact relationship efficiently.

Limitations

Despite the above advantages, some parts of the model in this research work are simplified, for instances: (1) Weighting. The proposed model only reflects the different impaction by training the node’s weight matrix, but assuming all the hyperedge have the same weight. However, the impact relationship is varying from different components, which are influenced by its nodes and hyperedges. (2) Hyperedge. The hypergraph attention mechanism treats the nodes connected with the same hyperedge as the first order neighbor, which may not be precise enough as a fully connected edge.

In summary, this proposed model can effectively predict MRR in the CMP tool, by learning the complex and high-order correlations among the heterogeneous data in the representative hypergraph. As a generic methodology proposed, it can also be further implemented in similar scenarios in the manufacturing process with complex impact relationships.

Conclusion

MRR prediction plays a critical role in the CMP process. However, existing methodologies normally neglect the structural knowledge of the CMP tool, which contains a large amount of hidden information that can also improve the MRR prediction. To tackle this challenge, this paper firstly provided a novel framework to construct a CMP hypergraph data model, which represents the impact relationship of different components and parts in the CMP tool. Secondly, this paper proposes a novel HGCN-based model to learn the data correlation and to aggregate the node information in hypergraph for MRR prediction with temporal data. A case study was conducted revealing that the proposed HGCN-based model is capable to combine the hypergraph structure and node features effectively, and it outperforms the cutting-edge models in MRR prediction. The key contributions of this research can be summarized as follow:

  1. 1.

    Proposed a systematic manner to transform the complex equipment structure into the representative hypergraph data model, which can reflect the complex impact relationship among components and parts effectively.

  2. 2.

    Introduced a novel approach to embedding the node with various features and different time lengths into the fixed dimensions and time length, which benefits subsequent model training effectively and rapidly.

  3. 3.

    Proposed the HGCN-based model for MRR prediction. This model integrated the HGCN, hypergraph attention mechanism and GRU, which can learn the heterogeneous data correlation more efficiently. As the experiment result shown, it outperformed previous cutting-edge models in several metrics.

Apart from the case study of MRR prediction in the CMP tool, it is envisioned that this research can also bring insightful ideas or guide to relevant tasks among other complex manufacturing process. However, this research work still has some limitations as pointed out in “Discussion” section. Taking all these factors into consideration, it is recommended that future works can be done to: (1) involve the environmental effect of the complex equipment (e.g., the chamber pressure), which may also affect the performance of equipment; (2) consider the weightings of hyperedge; and (3) describe the neighbor relationship of different orders in the hypergraph.