Keywords

1 Introduction

Time series data refers to data arranged in chronological order that reflects the state of things as they change over time [25]. Typical time series data includes factory sensor data, stock trading data, climate data for a particular region, power data, transportation data, and so on. In practical applications, multivariate time series data is more common. However, due to the complexity of real-world conditions, multivariate time series data often contains missing values for various reasons [14, 26], such as sensor malfunctions, communication errors, and accidents [13]. At the same time, in reality, due to equipment issues, such as the sampling machine itself not having a fixed sampling rate, even if the dataset is complete, the intervals between the data points are often irregular, which can also be considered a problem of missing data. These missing values undermine the interpretability of the data and can seriously affect the performance of downstream tasks such as classification, prediction, and anomaly detection [12]. To further study this incomplete time series, imputing the missing values is an inevitable step.

In order to solve the problem of missing data, many time series imputation methods have been proposed to infer missing values from observed values. Traditional data imputation methods are based on mathematical and statistical theory and fill in missing values by analyzing the data. However, this method ignores the inherent time correlation imputation in time series data. In contrast, time series imputation methods based on deep learning consider the time information in multivariate time series data and achieve better imputation accuracy. Recurrent neural networks rely on the calculation of hidden states for each unit to capture time dependence, but they still place more emphasis on the outputs of adjacent time steps and have difficulty capturing and utilizing long-term dependencies in time series [9].

In recent years, Generative Adversarial Networks (GANs) have shown outstanding performance in the field of data generation. However, general GANs do not consider the time correlation of time series data and ignore the potential correlation between different features of multivariate time series. Therefore, in this paper, we propose a new multivariate time series imputation model called DAGAN. DAGAN consists of a generator and a discriminator. The generator utilizes a gated recurrent neural network to learn the time information in multivariate time series data. It performs a weighted sum of hidden states in the recurrent neural network using an attention mechanism, which improves the model’s ability to learn long-term dependencies in time series while ensuring that the model focuses more on important information. Furthermore, we use a self-attention mechanism to link different variables in multivariate time series, allowing all time steps to participate in each layer to maximize the accuracy of multivariate time series imputation. The task of the discriminator is to distinguish between real and generated values. In this paper, we use a temporal cueing matrix to improve the performance of the discriminator, which contains some missing information from the original time series data. This further forces the generator to learn the real data distribution of the input dataset more accurately.

In summary, compared with existing time series imputation models, the model proposed in this paper has the following contributions:

  1. 1)

    We propose a time series imputation method that combines attention mechanisms and gated recurrent neural networks to enhance the ability of recurrent neural networks to utilize the long-term dependencies of time series data, while ensuring that the model focuses on important information and improves the quality of imputation.

  2. 2)

    We use the masked self-attention mechanism to capture relationships between different features in the multivariate time series. By incorporating self-attention at each layer, all time steps participate, maximizing the accuracy of imputation for multivariate time series data.

  3. 3)

    We conduct experiments on two different real datasets, using the root mean square error (RMSE) as the performance indicator. The results show that, in most cases, our model outperforms the baseline method.

2 Related Work

There are two main methods for dealing with missing values in time series analysis: direct deletion and imputation [11]. The main idea behind the direct deletion method is to delete samples or features containing missing values directly. Although this method is simple and easy to implement, it can lead to a reduction in sample size and the loss of important information [10]. The imputation method can be divided into statistical-based, machine-learning-based, and neural network-based methods, based on different imputation techniques and technologies.

Statistical-based imputation methods use simple statistical strategies, such as imputing missing values with mean or median values [11, 14]. For example, SimpleMean [7]imputes missing values by calculating the mean value. Although these methods can quickly fill in missing values, they can affect the variance of original data and have poor imputation accuracy.

Common machine-learning-based imputation methods include k-nearest neighbor (KNN) method, expectation-maximization method, and random forest. The basic idea of KNN algorithm is to find the k nearest neighboring data points with known values to a missing data point and then calculate a weighted average by some rule, using the known values of these neighbors, to obtain the imputed value. Random forest is a decision-tree-based imputation method [24]. Although these methods exhibit better imputation performance than the previous two methods, they lack the ability to utilize the time correlation and complex correlation between different variables.

In time series data, variables change over time and are interrelated. Therefore, it is crucial to use the time correlation of time series data to improve imputation accuracy. With the rapid development of neural network technology, researchers have started to apply it to the field of time series data imputation [6, 10, 18]. To properly handle time series data with missing values, some researchers have proposed using RNN-based methods to process missing values. RNN-based data imputation models can capture the time dependency of time series data [19, 22]. Among them, GRUD [3], which is based on gated recurrent neural networks, uses hidden state decay to capture time dependence. BRITS [2] treats missing values as variables and imputes them based on the hidden states of bidirectional RNN. GLIMA [23] uses a global and local neural network model with multi-direction attention to process missing values, which partly overcomes the problem that RNN heavily depends on output from nearby timestamps. The problem with RNN-based models is that after a period of time, the weight of current inputs becomes negligible, but they should not be ignored [2, 28]. However, this issue does not exist in GAN-based models. GANs generate data with the same distribution as the original data through the game between generator and discriminator. Examples of GAN-based imputation models include GRUIGAN [15], GAIN [28], BiGAN [8], and SSGAN [17]. These methods take advantage of the benefits of GANs and combine them with RNN to improve their ability to capture time dependence.

3 Preliminary

Before introducing the model proposed by us, we provide a formal definition of multivariate time-series data and explain some symbols used in certain models.

Given a set of multivariate time series \(X=\left( x_{1}, x_{2} \cdots , x_{T}\right) , \quad \in \mathbb {R}^{T \times D}\) and timestamps \(T S=\left\{ t_{1}, t_{2}, \ldots , t_{T}\right\} \), the t-th observation \(\textbf{x}_{t} \in \mathbb {R}^{D}\) consists of D features \(\left\{ x_{t 1}, x_{t 2}, \ldots , x_{t D}\right\} \) and corresponds to timestamp \(t_{t}\). We define a mask matrix M, which is used to indicate the positions of missing values in the dataset. When \(x_{i j}\) is missing, the corresponding element in the mask matrix is equal to 0, otherwise, it is 1. The mask matrix M has the same size as the input dataset. The specific formula is shown as follows.

$$\begin{aligned} m_{i j}=\left\{ \begin{array}{ll} 0 &{} \text{ if } x_{i j} \text{ is } \text{ missing } \\ 1 &{} \text{ otherwise } \end{array}\right. \end{aligned}$$
(1)

We also introduce the time interval matrix \(\delta \), which is represented as follows.

$$\begin{aligned} \delta _{i j}=\left\{ \begin{array}{lr} \delta _{i j-1}+t_{i}-t_{i-1}, &{} \text{ if } m_{i j-1}=0, i>0 \\ t_{i}-t_{i-1}, &{} \text{ if } m_{i j-1}=1, i>0 \\ 0, &{} i=0 \end{array}\right. \end{aligned}$$
(2)

The time retention matrix represents the time interval between the current time and the last valid observation. It is also a matrix of the same size as the input data set. Then, we introduce the time decay factor \(\alpha \), which is used to control the influence of past observations. When the \(\delta \) value is higher, \(\alpha \) becomes smaller. This indicates that the further the missing value is from the last true observation value, the more unreliable its value is.

$$\begin{aligned} \alpha _{i}=\exp \left( -\max \left( 0, W_{\alpha } \delta _{i}+b_{\alpha }\right) \right) \end{aligned}$$
(3)

4 Model

4.1 The Overall Structure

In this section, we will introduce the overall architecture of the DAGAN model. Figure 2 shows the overall architecture of DAGAN. The input to DAGAN includes time-series data, mask matrix, and time interval matrix. DAGAN consists of a generator and a discriminator. The generator generates imputed data based on the observed values in the time-series data. Its objective is to deceive the discriminator with the generated data. The discriminator takes the estimated time-series matrix and the temporal attention matrix as inputs. It attempts to differentiate between the generated data and the real data. Specifically, we introduce the time-attention matrix to encode a part of the missing information in the time-series data stored in the mask matrix. Below, we will describe the structure of each module in detail (Fig. 1).

Fig. 1.
figure 1

The overall architecture diagram of DAGAN.

4.2 Generator Network

The goal of the generator is to learn the distribution of multivariate time series data and generate missing values. Our generator includes a Temporal Attention layer, a Relevance Attention layer, and a Feature Aggregation layer. The purpose of the Temporal Attention layer is to capture the temporal dependencies of time series data. Its input is the time series data with missing values, mask vector, and time decay defined in the third section. The Relevance Attention layer captures the correlations between different features. Its input is the output of the Temporal Attention layer. Finally, the time information obtained from the Temporal Attention layer and the Feature Correlation information obtained from the Relevance Attention layer are inputted into the Feature Aggregation layer for aggregation to obtain the final output. The complete dataset obtained in this manner retains true values, replacing missing values with generated values, according to the calculation formula shown below.

$$\begin{aligned} x_{\text{ imputed } }=\textrm{M} \odot x+(1-\textrm{M}) \odot G(x) \end{aligned}$$
(4)

Here, \(x_{i m p u t e d}\) is the complete data set interpolated from the input data set x and the mask matrix M, which represents the distribution of missing values in the input data set. G(x) is the output of the generator.

The loss function of the generator includes two parts: adversarial loss and reconstruction loss. The adversarial loss is similar to that of a standard GAN. The reconstruction loss is used to enhance the consistency between the observed time series and the reconstructed time series. The loss function of the generator is shown below:

$$\begin{aligned} {\text {Loss}}_{G}={\text {Loss}}_{g}+{\text {Loss}}_{r} \end{aligned}$$
(5)
$$\begin{aligned} {\text {Loss}}_{g}=\left\| x \odot M-{x}_{\text{ imputed } } \odot M\right\| _{2}^{2} \end{aligned}$$
(6)
$$\begin{aligned} {\text {Loss}}_{r}=\log \left( 1-D\left( x_{\text{ imputed } } \odot (1-M)\right) \right) \end{aligned}$$
(7)
Fig. 2.
figure 2

The structure of DANGAN’s Generator.

Improved GRU. A deeper structure is beneficial for a recurrent neural network to better model the time series structure, in order to capture more complex relationships in the time series [21]. Based on this, we designed a feature mapping module to represent the input data in terms of features and map them into potential representations. This can improve the learning ability of RNN without increasing complexity in aggregating at multiple time steps.

$$\begin{aligned} \boldsymbol{x}_{t}=\tanh \left( \boldsymbol{W}_{m} x+\boldsymbol{b}_{m}\right) \end{aligned}$$
(8)

where the parameters \(W_{m}\) and \(b_{m}\) are the parameters to be learned, x represents the input data, tanh is a nonlinear activation function.

In our proposed model, we choose a gated recurrent neural network (GRU) to handle the time series input of the generation, which is a network structure adapted from the classical RNN and controls the information transfer in the neural network by adding a gating mechanism (nonlinear activation function, in this paper we use sigmoid activation function). GRU requires fewer parameters to train and converge faster. In a standard GRU model, the input to each GRU unit is the hidden state \(h_{t-1}\) of the previous unit’s output and the current input \(x_t\). Each GRU has an internal update gate and a reset gate. The data flow inside the GRU can be expressed as follows.

$$\begin{aligned} \textbf{h}_{t}=\left( 1-\mu _{t}\right) \odot \textbf{h}_{t-1}+\mu _{t} \odot \tilde{\textbf{h}}_{t} \end{aligned}$$
(9)
$$\begin{aligned} \tilde{\textbf{h}}_{t}=\tanh \left( W_{h} \textbf{x}_{t}+U_{h}\left( \textbf{r}_{t} \odot \textbf{h}_{t-1}\right) +\textbf{b}_{h}\right) \end{aligned}$$
(10)
$$\begin{aligned} \mu _{t}=\sigma \left( W_{\mu } \textbf{x}_{t}+U_{z} \textbf{h}_{t-1}+\textbf{b}_{\mu }\right) \end{aligned}$$
(11)
$$\begin{aligned} \textbf{r}_{t}=\sigma \left( W_{r} \textbf{x}_{t}+U_{r} \textbf{h}_{t-1}+\textbf{b}_{r}\right) \end{aligned}$$
(12)

where \(\mu _{t}\) and \(\textbf{r}_{t}\) are the update and reset gates of the GRU, respectively. tanh, \(\sigma \), and \(\odot \) denote the tanh activation function, the sigmod function, and the element multiplication.

We integrate the previously introduced Time Retention matrix and time decay factor \(\alpha \) into the GRU unit. We multiply the time decay factor alpha with the GRU hidden state \(\boldsymbol{h}_{t-1}\) to obtain the new hidden state \(\boldsymbol{h}_{t-1}^{\prime }\) (Fig. 3).

$$\begin{aligned} \boldsymbol{h}_{t-1}^{\prime }=\boldsymbol{\alpha }_{t} \odot \boldsymbol{h}_{t-1} \end{aligned}$$
(13)
$$\begin{aligned} \textbf{h}_{t}=\left( 1-\mu _{t}\right) \odot \boldsymbol{h}_{t-1}^{\prime }+\mu _{t} \odot \tilde{\textbf{h}}_{t} \end{aligned}$$
(14)
Fig. 3.
figure 3

Standard GRU Cell vs. Our Improved GRU Cell.

Temporal Attention. In order to solve the limitation of memory and excessive attention to adjacent time steps in the recurrent neural network when facing long time sequences, and to enhance the ability of the recurrent neural network to capture important information and long-term dependencies within the time series, we have designed a time recurrent attention mechanism. Our time recurrent attention mechanism weights the hidden states of each time step, and the weighted processing of attention will make the hidden states extracted from each time series contain comprehensive temporal information.

Given a set of hidden states \(\textrm{H}=\{\textrm{h}_1, \mathrm {~h} _2, \mathrm {~h} _3 \ldots , \mathrm {h_t}\}\), calculating the importance score \(\theta \) of each hidden state and then calculate the weighted sum of the hidden states to obtain the Context Vector \(v_{t}\). In this way, we can effectively alleviate the disadvantage of GRU’s tendency to forget the first few steps in long sequences [1]. The specific calculation formula is shown in Eqs. 15 and 16.

$$\begin{aligned} \theta _{i}=\frac{\exp \left( {\text {func}}\left( \textbf{h}_{t}, \textbf{h}_{i}\right) \right) }{\sum _{j=1}^{t} \exp \left( {\text {func}}\left( \textbf{h}_{t}, \textbf{h}_{j}\right) \right) } \end{aligned}$$
(15)
$$\begin{aligned} \boldsymbol{v}_{t}=\sum _{i=1}^{t} \boldsymbol{\theta }_{i} \boldsymbol{h}_{i} \end{aligned}$$
(16)

where func is the function that calculates the attention score between the current state and the historical hidden states. In this article, we use the hidden state at time t as the query vector and use dot product as the calculation function.

Relevance Attention. In multivariate time series data, there are different variables. By analyzing the correlations between different variables, the ultimate imputation accuracy can be improved. Therefore, in order to utilize the potential correlations between different variables, we designed a correlation attention mechanism based on the self-attention mechanism to capture the potential correlations between different variables. The self-attention mechanism is a mechanism used to calculate the representation of the sequence data. It automatically assigns different weights to each element in the sequence to better capture the relationships between different elements. The self-attention mechanism includes three parts: query, key and value, and calculates the attention representation vector of the sequence through similarity calculation, softmax function and weighted sum. The calculation formula of the self-attention mechanism is shown below.

$$\begin{aligned} {\text {SelfAttention}}(Q, K, V)={\text {Softmax}}\left( \frac{Q K^{\top }}{\sqrt{d_{k}}}\right) V \end{aligned}$$
(17)

In order to enhance the model’s interpolation ability, we refer to the self-attention models in DISAN [20], SAITS [4], and XLNet [27]. We apply the diagonal mask matrix to the self-attention mechanism, and set the diagonal items in the attention map to \( -\infty \). Therefore, the diagonal attention weight approaches 0 after the softmax function. We use vector \(\textbf{Z} \in \mathbb {R}^{d \times l}\) as inputs, and vector Z is stacked by Context Vector \(v_t\). D is the number of variables in the multivariate time series datasert, and l is the length of \(v_t\). The specific formula for relevance attention is as follows:

$$\begin{aligned}{}[{\text {Mask}}(x)]_{(i, j)}=\left\{ \begin{array}{ll} -\infty &{} i=j \\ x_{(i, j)} &{} i \ne j \end{array}\right. \end{aligned}$$
(18)
$$\begin{aligned} \textbf{Q}=\textrm{ZW}_{\textrm{q}} \quad \textrm{K}=\textrm{ZW}_{\textrm{k}} \quad \textrm{V}=\textrm{ZW}_{\textrm{v}} \end{aligned}$$
(19)
$$\begin{aligned} T={\text {Softmax}}\left( {\text {Mask}}\left( \frac{Q K^{\top }}{\sqrt{d_{k}}}\right) \right) \end{aligned}$$
(20)
$$\begin{aligned} \text{ MaskSelfAttention } (Q, K, V)=T V \end{aligned}$$
(21)

Q, K, and V respectively refer to the query vector, key vector, and value vector. \(W_q\), \(W_k\), and \(W_v\) are trainable parameters. In the calculation, we first use Q and K to calculate the similarity and then use the softmax function to process and obtain similarity scores, reflecting the correlation between different variables. By using a diagonal mask, the input values at time step t cannot ’see’ themselves and are not allowed to calculate their own weight in this estimation. Therefore, their estimates only depend on the input values at other time steps. With this setting, we can better capture the relevance between different features and improve the model’s interpolation ability using relevance attention.

Feature Aggregation Layer. We refer to the idea of residual connection, where we aggregate Z with temporal information, and \(\hat{\textbf{Z}}\) with correlation information, and finally obtain the final interpolation result after a linear layer.

$$\begin{aligned} \hat{x}={\text {Linear}}(W(\textbf{Z}+\hat{\textbf{Z}})+\textrm{b}) \end{aligned}$$
(22)

where W, b are learnable parameters, \(\textbf{Z} \in \mathbb {R}^{d \times l}\) is a Context Vector with temporal information, and \(\hat{\textbf{Z}}\) is a correlation matrix with correlation information. The linear layer produces the final interpolated values.

4.3 Discriminator Network

Following the standard GAN model, we use a discriminator to compete with the generator, which helps the generator to generate more realistic data. Unlike general generative adversarial networks, our discriminator outputs a matrix in which each value indicates the truthfulness of the generated value. To help the discriminator can better distinguish the true values from the generated values, we introduce a temporal cueing matrix inspired by GAIN [28], which contains a portion of missing information. The temporal cueing matrix C is defined as:

$$\begin{aligned} \textbf{C}=\textbf{Y} \odot \textbf{M}+0.5(1-\textbf{K}) \end{aligned}$$
(23)
$$\begin{aligned} \textbf{Y}=\left( \textbf{y}_{1}, \cdots , \textbf{y}_{i}, \cdots , \textbf{y}_{n}\right) \in \{0,1\}^{d \times n} \end{aligned}$$
(24)

where each element in y is randomly set to 0 or 1. The discriminator connects the generated time series data and the temporal cueing matrix as input, and the network structure of the discriminator consists of a GRU layer and a linear layer.

The loss function of the discriminator is shown below:

$$\begin{aligned} {\text {Loss}}_{D}=-\left( \log \left( D\left( x_{\text{ impute } } \odot M\right) \right) +\log \left( 1-D\left( x_{\text{ impute } } \odot (1-M)\right) \right) \right) \end{aligned}$$
(25)

where \({\text {Loss}}_{D}\) refers to the classification loss of the discriminator, we want the discriminator to be able to distinguish the generated values from the true values as much as possible.

5 Experiment

5.1 Datasets

To validate the performance of our proposed model, we conducted experiments on two real datasets: PM2.5 Dataset, Health-care Dataset.

PM2.5 Dataset: This dataset is a public meteorological dataset consisting of air pollutant measurements from meteorological monitoring stations in Chinese cities. The data are collected from 2014-05-01 to 2017-02-28. To evaluate the interpolation performance, we first divide the dataset into a training set and a test set in the ratio of 80% and 20%, and then we randomly remove data from the dataset and use them as missing values for training and testing.

Health-care Dataset: This dataset is from PhysioNet Challenge 2012 [5], and it contains 4000 multivariate clinical time series data, each sample is recorded within the first 48 h after ICU admission. Each time series contains 37 time series variables such as body temperature, heart rate, blood pressure, etc. During training, we will randomly remove data points in the dataset as missing values and then use zeros to fill these missing values.

5.2 Baseline

This section describes the methods and models commonly used in time series interpolation and applies them to the previously mentioned dataset, and finally compares them with the model proposed in this paper.

  1. 1)

    Zero, a simple method of data interpolation, which fills all missing values to zero;

  2. 2)

    Average, where the missing values are replaced by the average of the corresponding features;

  3. 3)

    GRUD [3], a recurrent neural network based data interpolation model which estimates each missing value by the weighted combination of its last observation and the global average and the recurrent component;

  4. 4)

    GRUIGAN [15], an interpolation network combining gated recurrent network and generative adversarial network;

  5. 5)

    BRITS [2], a time series imputation method based on bi-directional RNN;

  6. 6)

    E2GAN [16], an end-to-end time series interpolation model based on generative adversarial networks;

  7. 7)

    SSGAN [17], a semi-supervised generative adversarial network based on bidirectional RNN.

Among the above methods, mean is a simple interpolation method. KNN is a commonly used algorithm for interpolating machine learning data. GRUD and BRITS are both bi-directional RNN-based methods. GRUIGAN, E2GAN, SSGAN are all generative adversarial network based models, we choose these as comparison methods to show the advantages of our model.

5.3 Experimental Setup

On the above dataset, we select 80% of the dataset as the training set and 20% of the dataset as the test set. For all tasks, we normalize the values to ensure stable training. In all the deep learning baseline models, the learning rate is set to 0.001. The number of hidden units in the recurrent network is 100, and the training epoch is set to 30. The dimensionality of random noise in GRUIGAN and the dimensionality of feature vectors in E2GAN are both 64. For SSGAN, we set the cue rate to 0.9 and the label rate to 100%. For the DAGAN model in thiTs paper, the discriminator cue rate is set to 0.9 and the hidden layer cells are set to 100. We apply an early stopping strategy in the model training. We also evaluate the performance of all models with interpolation at different missing rates. The missing rate is the ratio of missing values to the total number of data. It reflects the severity of missingness in the dataset. We randomly remove 10%–70% of the test data points to simulate different degrees of missingness. We use the root mean square error (RMSE) to evaluate the experimental results. A smaller value of RMSE means that the generated value is closer to the true value, and the following is the mathematical definition of the evaluation metric.

$$\begin{aligned} R M S E=\sqrt{\frac{1}{m} \sum _{i=1}^{m}( \text{ target-estimate } )^{2}} \end{aligned}$$
(26)

where target and estimate are the true and generated values, respectively, and m is the number of samples.

5.4 Experimental Results and Analysis

Table 1 displays the experimental results of the model with PM2.5 and Health-care Datasets. Bolded results indicate the best-performing. The first set of experiments compares the performance of various methods at different missing data rates. DAGAN performs well on all datasets, demonstrating good generalization capability. DAGAN outperforms the best available baseline model on average by 13.8%, 12.2%, 8.07%, and 3.9%, respectively, at 10%, 30%, 50%, and 70% missing rates. Its use of temporal attention and relevance attention captures temporal correlation and potential correlation between features to improve interpolation accuracy by leveraging the available information from the time series. Overall, deep learning-based approaches exhibit better interpolation performance compared to statistics-based approaches. Table 1 reveals that the interpolation accuracy of all models declines as missing data increases gradually. This can be attributed to the reduced availability of information for the models to interpolate. Nonetheless, our models still outperform the baseline model.

5.5 Ablation Experiment

To ensure the validity of our model, we conducted ablation experiments, each repeated ten times, and recorded the average root mean square error (RMSE) at a 10% missing rate for the test set. The ablation experiments evaluated the model’s performance by removing the temporal attention mechanism, the relevance attention mechanism, and the temporal cueing mechanism, respectively. Our experimental results reveal that the removal of the temporal attention mechanism, the relevance attention mechanism, and the temporal cueing mechanism led to a reduction in performance, indicating that the optimal use of temporal and correlation information between distinct features in the dataset is crucial for achieving high interpolation accuracy. Additionally, the results indicate that our adversarial training benefits from the incorporation of a temporal cueing matrix (Table 2).

Table 1. Performance comparison of time series imputation methods under different missing rates.
Table 2. The results of ablation experiments (RMSE).

6 Conclusion

In this paper, we propose a new multivariate time series interpolation model called DAGAN. For time series data, DAGAN consists of two parts: a generator and a discriminator. In the generator, we use a gated recurrent neural network to learn the temporal information in the multivariate time series data and use an attention mechanism to weight the summation of the hidden states in the recurrent neural network, which enhances the ability of the model to learn the long-term dependence of the time series while ensuring that the model can focus more on the important information and compensate for the disadvantages of memory limitation and excessive focus on adjacent time steps of the recurrent neural network, thus improving the interpolation quality. In addition, this study also utilizes a masked self-attentiveness mechanism to correlate different variables in the multivariate time series so that all time steps are involved in each layer through self-attentiveness, thus maximizing the accuracy of multivariate time series interpolation. The generator inputs incomplete data with missing values and outputs the complete interpolation, and the discriminator heaps the generated values and the true values to distinguish them. Experimental results demonstrate that, when compared with other baseline models, our model’s imputation performance is better.