1 Introduction

Multistate time series data recording time-varying values of variables not only reflects the mode fluctuation trend of each variable, but also implies the coupling relationships among variables. Thus, the analyses of multistate time series data are important in various actual applications, especially in intelligent transportation system (ITS). Most traffic services for smart cities, like traffic signal control [1], traffic congestion forecasting [2] and incident detection [3], have been mining the multistate temporal characteristic of traffic flow for precision management. However, the unavoidable data missing problem caused by hardware malfunction, failure of transmission and data corruption may bring great difficulties to accurate data analysis. The descent of analysis efficiency, more complicated analysis procedure and even bias conclusion owing to the differences between observed and missing data are the typical serious problems caused by data missing [4, 5]. To avoid these problems, it is necessary to properly process the missing values (MVs) in multistate time series data.

Generally, missing values processing methods can be divided into three categories. The first category is simply to delete the samples with MVs. However, this case deletion tends to lose partial useful information and may distort sample distribution especially in the limited samples or high missing rate situation [6]. The second category is the single imputation which attempts to model the data missing process by available partial data information, and estimates a reasonable value by various imputation models. This method mainly includes regression imputation (K-nearest neighbor (KNN) [7, 8], local least square (LLS) [9, 10], nearest neighbor regression (NNR) [11], support vector machine (SVM) [12, 13]), probabilistic model imputation (probabilistic principal component analysis (PPCA) [14, 15], singular value decomposition (SVD) [16, 17]), matrix completion imputation (low-rank matrix completion (LRMC) [18, 19] and self-representation (SRSP) [20, 21]). However, these models share a common problem: They impute a single value for each missing position and then treat the imputed data the same as the observed data in subsequent analysis. In fact, the MVs obey an implicit distribution determined by available observed information, rather than a certain value [22]. The single imputation implicitly assumes that the imputation models and results are perfect, but fails to account for the uncertainty of missing data in the imputation process. That can be overcome by replacing each missing value with several slightly different imputed values, and this kind of imputation framework is the third category—multiple imputation which is the most effective framework for missing data analysis [23]. The idea of multiple imputation is to model the missing data distribution through multiple filling MVs and evaluate all filling values fitness and give the optimal filling value [24, 25]. However, the performance of multiple imputation models relies on the correct distribution assumption of the selected imputation model [24]. So, the accurate distribution solution is imperative to increase the availability of multiple imputation.

Usually, as time goes by, time series data tends to exhibit regularity and randomness, and show different trends and data distributions under different states and conditions [3, 5]. Taking the traffic flow data of two adjacent detecting points in Fig. 1 as an example, there are significant differences in time series data distribution and trend of traffic flow parameters in different time periods, and these differences are still obvious even for adjacent points at the same periods. This is traffic flow multistate distribution characteristic, will further increase the difficulty of data distribution solution in the missing data imputation task. Existing imputation methods adopt fixed preset statistic distribution or simple solution by superficial model and they cannot accurately realize the adaptive modeling of multistate distribution for time series data. Therefore, it is necessary to explore a more effective distribution solution model to improve imputation performance in time series data missing problem.

Fig. 1
figure 1

Multistate time series of traffic sensor data

Recently, generative adversarial network (GAN) gives us a better choice in modeling data distribution as a latest generative model. More specifically, by training with original data, GAN can capture the distribution of original data by making the distribution of generated samples approximate original data distribution [26, 27]. It has been successfully applied to image completion [28] and sentence generation [29], but the limitation of directly using common GAN for MVs imputation is the network requires a complete dataset for training which is impossible for incomplete time series dataset. To deal with incomplete input data in imputation task, J. Yoon et. al. proposed a novel generative adversarial imputation network (GAIN) to learn missing part distribution by adversarial learning between imputation and discrimination [30]. The subsequent models [5, 31,32,33] pay more attention to the multivariate characteristic of data by designing a specific generator like multichannel or feature convolution. But the multistate distribution characteristic of time series data will increase the difficulty of distribution learning process, the existing models lack corresponding targeted structural design and may decrease the distribution learning performance. Meanwhile, GAN obtains input variables by random generation and computes the corresponding results, this generating way brings the variety of generated results but also increases their uncertainty [26, 30]. For MVs imputation task, the uncertainty of generated results will affect the imputation accuracy. Up to now, the effective solution of multistate distribution of time series data and the uncertainty handling of imputation process are still challenging in time series imputation task.

To deal with these problems, a novel imputation network framework combining with the GAN’s modeling data distribution ability and the uncertainty handling ability of multiple imputation is proposed in this paper. The imputation framework consists two stages. In the first stage, a time series generative adversarial imputation network (TGAIN) is constructed to overcome the hardship of modeling the multistate distribution for time series missing values. TGAIN utilizes the conditional information and sequence generation structure to direct the data imputation process. Through large sample adversarial training by incomplete dataset, the well-trained TGAIN model can learn the distribution of missing data under different states, the implicit relationships between variables and the temporal information of time series data. In the second stage, to deal with uncertainty in imputation and determine the ‘best’ filling value, multiple imputation strategy is adopted in this stage. Multiple-input ‘noise’ of TGAIN’s generator is utilized to generate multiple filling values which obey the learned distribution, TGAIN’s discriminator evaluates the imputation fitness of each of the filling values, and a max-pooling structure is designed to overall determine the final best filling value. In experiments, the compared experiments on two real-world traffic datasets show the proposed method has better ability in dealing with the uncertainty of imputation and the multistate time series distribution solution. The main contribution of this paper can be summarized as follows:

  1. 1.

    To deal with the distribution solution for multistate time series data and the uncertainty of missing values imputation, a novel TGAIN network imputation framework is constructed combined the generative adversarial network and multiple imputation. It is a new multiple imputation network for multistate time series data.

  2. 2.

    To realize the accuracy distribution solution for each state of time series data, TGAIN network designs the condition information and sequence generation structure to direct the generative adversarial imputation process. The well-trained TGAIN can realize multistate distribution learning and utilize the latent temporal information among datasets.

  3. 3.

    To better capture the uncertainty of imputation process, a multiple imputation strategy based on TGAIN is designed. Multiple-input ‘noise’ of TGAIN’s generator is utilized to generate multiple fill values which match the learned distribution, and by a max-pooling structure to overall determine the best filling value.

  4. 4.

    The TGAIN imputation model outperforms several state-of-the-art methods in various missing patterns, even without complete observations for the model training. Even in the case of high missing rate, the imputation performance still remains excellent.

The rest of this paper is organized as follows. In Sect. 2, we introduce the missing data imputation related works. In Sect. 3, we present the TGAIN imputation framework and algorithm process. Experiments and comparison results with several state-of-the-art imputation methods are shown in Sect. 4. Section 5 makes a conclusion and discusses the future work.

2 Related work

Data missing is a common and confusing problem in actual applications, especially for multistate time series data. Over the past decades, a number of advanced methods have been proposed for data imputation and demonstrated significantly improved imputation performance by exploiting the data correlation and the implicit data distribution [20, 34]. From the perspective of imputation structure, the imputation methods can be divided into two categories: single imputation models and multiple imputation models.

2.1 Single imputation models

The single imputation models utilize the data correlation between observed values and missing values to give a reasonable value to replace the missing part. This category also can be roughly divided into following three classes.

2.1.1 Regression-based imputation

This class methods attempt to model the inherent relationship between MVs and observed values by means of regression techniques, such as KNN [7, 8], LLS [9, 10], NNR [11] and SVM [12, 13]. Despite some differences in terms of specific regression models, these methods share a common motivation, which is aim to use observed values to predict the missing partial. For example, KNN-based imputation method [8] takes advantage of similarity measure to find several samples from dataset which most similar to the MVs samples, and MVs can be estimated as the weighted average of those selected samples. Instead of weighted average in KNN, LLS [10] imputation describes the relation between the MVs and observed values by least square regression, allowing more flexibility than weighted average. Though these methods utilize the local relation of data to recover the MVs individually, they all failed to consider the consistency across the sample space and the performance impact of multistate distribution. Even worse, if a whole series of consecutive data is lost, the imputation performance of these methods will decrease rapidly.

2.1.2 Probability-based imputation

To this kind of methods, the complete data is supposed to follow a probabilistic distribution with specific form but the distribution parameters are unknown, e.g., mixed Gaussian model. Based on the observed partial data, both the distribution parameters and missing data can be iteratively estimated. The Markov chain Monte Carlo (MCMC) [35] and probabilistic principal component analysis (PPCA) [14, 15] are the representative methods and have shown promising results in traffic data imputation. The major disadvantage of these methods is the imputed performance heavily depends on the prior assumption about data distribution, which is unknown in practice. Especially, due to the multistate characteristic of time series data, it may be improper to postulate a uniformed distribution for time series in different state.

2.1.3 Matrix completion-based imputation

This class methods organize whole or partial data into a matrix, and the missing values are recovered based on specific assumption about data matrix. Low-rank matrix completion (LRMC) [18, 19] assumes that the whole data matrix has a global low-rank structure, and MVs can be recovered through rank minimization on the whole matrix. Recently, matrix self-representation (SRSp) [20, 21] has been proposed which assumed each sample can be well represented by linear combination of other samples. Despite encouraging results, the assumption of these methods led to disregard the local relation of data.

Above three kinds of single imputation methods have their own unique theoretical value, but all single imputation methods cannot resolve the uncertainty in imputation process effectively. This problem limits the imputation effectiveness of the algorithm itself.

2.2 Multiple imputation models

Multiple imputation (MI) is a general framework that incorporates the uncertainty into the imputation process. MI is comprised of three stages: imputation stage, in which there is a need to calculate the dataset statistic parameters and distribution, and variability is put into the imputed values to create multiple complete datasets; analysis stage, in which each of the complete datasets is analyzed using a complete data technique; and the last stage, in which the results from the analysis are combined in order to yield a final result and this stage combines the uncertainty in the data and the uncertainty due to missing values. The original multiple imputation [22], Bayesian MI [36] and deep learning [24] make certain expansion on the multiple imputation methods, and even some improved algorithms by single methods also draw on the idea of multiple fillings [37, 38]. The core problem is how to effectively model the data distribution of missing data location. Even though existing methods have been tried by fusing the results of different algorithms or multiple results of a single algorithm [37, 38], the accurate distribution solution of missing data based on partial observation information is still a hard problem, especially for multistate time series data.

In all, the main foundation of missing data imputation utilizes the latent data correlation and implicit data distribution. But how to model the correct distribution of data and handle the uncertainty simultaneously are the technical difficulties in improving the effect of multiple imputation.

3 Method

To deal with the uncertainty of imputation and the multistate distribution solution in time series data imputation task, a novel imputation framework is designed and described in Fig. 2. From Fig. 2, the framework divides into two stages: The first stage is the distribution solution for time series missing data. Inspired by the advantage of GAN data distribution modeling [26], we proposed a novel TGAIN architecture to adversarial learning missing data distribution by incomplete data samples. Considering the multistate characteristic and utilization of temporal correlation, the conditional vector and seq-to-seq temporal generator are introduced into the TGAIN to direct the impute process for the missing values. By large sample training, the trained TGAIN can learn the data distribution under different states, the implicit relationships between observation and the temporal information of data. The second stage is multiple imputation by TGAIN to determine the best imputed values. Multiple ‘noises’ are input into TGAIN’s generator to generate multiple fill values which obey the learned distribution, and the discriminator gives the probability that each imputed value is close to the original incomplete time series. A max-pooling structure is designed to select the maximum probability result and gives the final reasonable fill values for each missing position.

Fig. 2
figure 2

Framework of time series generative adversarial imputation network (TGAIN)

Formally, given a collection of multivariate time series data with \(d\) dimensions \(X = [x_{{t_{0} }} ,...,x_{{t_{i} }} ,...,x_{{t_{n - 1} }} ] \in R^{d \times N}\), where \(x_{{t_{i} }} \in R^{d}\) denotes the \(t_{i}\) th observation of \(X\) and \(x_{{t_{i} }}^{j}\) is the \(j\) th feature of \(x_{{t_{i} }}\).

It is worth noticing that in missing data imputation problem, the observation time series \(X\) is incomplete, let \(\tilde{X}\) denotes the uncompleted time series matrix. The mask matrix \(M \in R^{d \times N}\) is introduced to indicate whether the values of \(X\) exist or not, i.e., if \(x_{{t_{i} }}^{j}\) exists, \(M_{{t_{i} }}^{j} = 1\); otherwise, \(M_{{t_{i} }}^{j} = 0\). For example:

$${\tilde{\text{X}}} = \left[ {\begin{array}{*{20}c} {x_{{t_{1} }}^{1} } & * & {x_{{t_{3} }}^{1} } & * \\ * & * & {x_{{t_{3} }}^{2} } & {x_{{t_{4} }}^{2} } \\ {x_{{t_{1} }}^{3} } & {x_{{t_{2} }}^{3} } & * & {x_{{t_{4} }}^{3} } \\ \end{array} } \right]\quad {\text{M}} = \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 \\ \end{array} } \right]$$

The architecture of TGAIN and multiple imputation process will be described in detail in the following parts.

3.1 TGAIN architecture

In order to replace missing values in multimodal time series dataset, a time series generative imputation adversarial network (TGAIN) is constructed to unsupervised learning the distribution of original time series dataset under various conditions. In this custom GAN architecture as shown in Fig. 3, some condition information \(C\) corresponds to \(\tilde{X}\) as the additional input vector to direct the data imputing process. The generator utilizes a random noisy matrix and condition vector to generate the fake imputed values, and the discriminator is trained to distinguish the fake imputed parts and real observation parts in certain conditions. With the iteration of adversarial imputation training, it will achieve a Nash equilibrium between the imputation ability of the generator and the discernment ability of the discriminator. Finally, the generator learns a mapping function \(G(z)\) that tries to map the random noise vector \(z\) to a realistic time series.

Fig. 3
figure 3

TGAIN network architecture

3.1.1 Generator

The generator \({\text{G}}\) takes \(\tilde{X}\), \(M\), random matrix \(Z\) and condition vector \(C\) as inputs and output is a complete matrix \(\overline{X}\). Here, the \(Z\) must be independent of all other variables to avoid being influenced by variable uncertainty. The generator calculation is expressed as follows:

$$\overline{X} = G(\tilde{X} \odot M + (1 - M) \odot Z,C)$$
(1)
$$\hat{X} = M \odot {\tilde{\text{X}}} + (1 - M) \odot {\overline{\text{X}}}$$
(2)

where \(\odot\) denotes element-wise multiplication. It notes that the directly generated matrix \(\overline{X}\) is completely a fake matrix whether the component was observed or missing. So \(\hat{X}\) is the completed imputed data matrix by replacing missing component * in \(\tilde{X}\) with the corresponding value of \(\overline{X}\) as Eq. (2).

To effectively utilize the latent data relationships (time series tendency and the different feature correlation) and the guidance of conditional information in the matrix generation process, a novel generator is constructed as shown in Fig. 4. The TGAIN generator adopts sequence generation architecture: First, the incomplete time series matrix is combined with the random matrix by \(\tilde{X} \odot M + (1 - M) \odot Z\) as the input of generator and extracts spatial feature between different variables by multilayer fully connected encoder; then, spatial feature vectors and conditional vector will be fused by concat function to guide subsequent calculations. The LSTM layer is adopted to capture and utilize the temporal relation, and the final feature vectors are input a multilayer fully connected decoder to generate the imputed matrix in chronological order.

Fig. 4
figure 4

Conditional time series generator of TGAIN

The whole generation processes not only utilized the spatial–temporal information of different variables, but also added the conditional information, which will conducive to the accurate estimation of missing data.

3.1.2 Discriminator

In TGAIN framework, the discriminator D is introduced as an adversary to train G. Due to the imputation task different, unlike in a standard GAN where the output of generator is entire real or fake, in this setting the output is composed of real and fake components. So, the discriminator of TGAIN attempts to distinguish which components are real (observed) or fake (imputed), rather than identify whether an entire vector is real or fake. The results of discriminator are equivalent to predicting the mask vector \(M\) which is predetermined by the dataset. It notes that the discrimination process also needs the guidance of correspondent conditional information which could satisfied the needs of data multistate distribution.

Formally, the discriminator is a function \(D:\hat{M} = D(\hat{X},C)\), with the \(i\) th component of output \(\hat{M}\) corresponding to the probability that the \(i\) th component of \(\hat{x}\) was observed under condition \(C\).

3.1.3 Conditional vector

Considering the difference on latent relation and distribution of multimodal time series dataset, the conditional vector is constructed to direct the missing value imputing process. It makes the whole network extended as a conditional model because the generator and discriminator are conditioned on same extra information \(C\). The \(C\) could be any kind of auxiliary information for different kinds of data imputation task, such as class labels or data from other modalities. It notes that the condition information \(C\) should be potentially related to the multistate characteristic of time series data, so that \(C\) could direct data generation and imputation process under different data distribution situation. Meanwhile, \(C\) should be appropriately settings for different tasks. More conditional information may increase the complexity of network learning and need more data under each condition, but less conditional information may lead to the poorly direct modeling performance.

In the task of traffic flow data imputation, the traffic data showed obvious different temporal distribution patterns between workday and non-workday and different time periods on the same day. So, the week label and time label are selected as conditional vector in this paper, the specific partition is determined by real traffic data distribution. In order to ensure the independence of the labels, the labels are processed by one-hot encoding [39]; that is, each label is given an effective encoding bit.

$$C{ = }[week\;lable\;,\;\;time\;label]$$
(3)

3.1.4 Objective

In TGAIN network, the discriminator \({\text{D}}\) is trained to maximize the probability of correctly predicting \(M\), the generator \({\text{G}}\) is trained to minimize the probability of \({\text{D}}\) to predict \(M\), so we define the objection function as

$$V(D,G) \, = \, {\mathbb{E}}_{{\hat{X},M,C}} [M^{T} \log D(\hat{X},C) + (1 - M)^{T} \log (1 - D(\hat{X},C))]$$
(4)

where log is element-wise logarithm and dependence on \({\text{G}}\) is through \(\hat{X}\).

Then, the objective of TGAIN is a minimax game problem given by

$$\mathop {\min }\limits_{G} \mathop {\max }\limits_{D} V(D,G)$$
(5)

Writing \(\hat{M} = D(\hat{X},C)\), Eqs. (4) and (5) can be rewritten as:

$$\mathop {\min }\limits_{G} \mathop {\max }\limits_{D} {\mathbb{E}}_{{\hat{X},M,C}} \left[ {M^{T} \log (\hat{M}) + (1 - M)^{T} \log (1 - \hat{M})} \right]$$
(6)

3.1.5 Loss

TGAIN attempts to model the latent distribution of missing data by multiple generative adversarial learning processing rather than just the statistical expectation. So, we solve the minimax optimization problem of the network in an iterative manner.

We first optimize the discriminator \({\text{D}}\) with a fixed generator \({\text{G}}\) using mini-batch method in [40]. For each sample in the mini-batch \((\tilde{x}(j),m(j),c(j))\), we draw \(k_{D}\) independent random samples \({\text{z}}(j)\). We define the discriminator loss function to train the discriminator as

$$L_{D} (\tilde{X},M,C,Z) = \mathop {\min }\limits_{D} - \sum\limits_{j = 1}^{{k_{D} }} {\left[ {m(j)^{T} \log D\left( {\hat{x}(j),c(j)} \right) + (1 - m(j)^{T} )\log \left( {1 - D\left( {\hat{x}(j),c(j)} \right)} \right)} \right]}$$
(7)

Second, we optimize the generator \({\text{G}}\) using the newly updated discriminator \({\text{D}}\) with mini-batches of size \(k_{G}\). It notes that \({\text{G}}\) in fact outputs the value for entire data matrix (including values for the components we observed). Therefore, in training \({\text{G}}\), the loss function should ensure not only the imputed values for missing components (\(m_{j} = 0\)) successfully fool the discriminator, but also the values outputted by \({\text{G}}\) for observed components (\(m_{j} = 1\)) are close to those actually observed. This also assures that the representations learned in \({\text{G}}\) suitably capture the information combined in \(\tilde{X}\) (Just like an auto-encoder).

To achieve this purpose, a two-part loss function is defined to evaluate the fitness of imputation. The following paragraphs will describe the discriminative loss and masked reconstruction loss in details.

3.1.6 Masked reconstruction loss

The masked reconstruction loss makes sure that the generated samples by \({\text{G}}\) are close enough to the original incomplete time series \(\tilde{X}\) in non-missing components. It is defined by masked squared errors between the original sample \(\tilde{X}\) and the generated sample \(\overline{X}\) by \(G(\tilde{X},Z,C)\).

$$L_{M} \left( {\tilde{X},\overline{X},M} \right) = \mathop {\min }\limits_{G} - \sum\limits_{i = 1}^{{k_{G} }} {\left\| {\tilde{x}(i)} \right.} \left. { \odot m(i) - G\left( {\tilde{x}(i),z(i),c(i)} \right) \odot m(i)} \right\|$$
(8)

3.1.7 Discriminative loss

The discriminative loss forces the generate sample by \({\text{G}}\) as real as possible in missing components. It stands for the generated sample \(G(\tilde{X},Z,C)\)’s degree of authenticity. It is based on the output of the discriminator \({\text{D}}\) in missing components which represents the confidence level of the generated sample being real.

$$L_{G} ({\tilde{\text{X},M,C,Z}}) = \mathop {{\text{min}}}\limits_{G} - \sum\limits_{i = 1}^{{k_{G} }} {\left( {1 - m(i)} \right)\log (m(i))}$$
(9)

\(L_{G}\) is smaller when \(\hat{m}_{i}\) is closer to 1. It means \(L_{G}\) is smaller when \({\text{D}}\) is less able to identify the imputed values as being imputed (D falsely categorizes them as observed).

As can be seen from their definitions, \(L_{G}\) applies to the missing components (\(m_{i} = 0\)) and \(L_{{\text{M}}}\) applies to the observed components (\(m_{i} = 1\)). The generator \({\text{G}}\) is then trained to minimize the weighted sum of the two losses as follows:

$$\mathop {\min }\limits_{G} \sum\limits_{i = 1}^{{k_{G} }} {L_{M} \left( {\tilde{x}(i),\overline{x}(i),m(i)} \right)} + \alpha L_{G} \left( {{\tilde{\text{x}}}(i){\text{,m}}(i){\text{,c}}(i){\text{,z}}(i)} \right)$$
(10)

where \(\alpha\) is a hyper-parameter to balance two parts loss function and affect the final imputation performance.

Table 1 illustrates the TGAIN network training process in the first stage of TGAIN imputation framework. Firstly, we fixed the generator parameters and computer the \(L_{{\text{D}}}\) to direct the discriminator parameters update. Secondly, the discriminator parameters are fixed and we computer the mixed loss \(L_{M} { + }\alpha L_{G}\) to direct the generator parameters update. Two steps will be loop iterated until the network loss converges. By adversarial learning process between imputation and discrimination, the imputed false values will gradually close to the latent ‘real’ values. When the TGAIN network converges, the distribution of imputed values by G is consistent with the latent distribution of missing data. This will be confirmed in the convergence analysis experiment of TGAIN.

Table 1 TGAIN network iterative training pseudo-code

3.2 Multiple imputation by TGAIN

Through TGAIN network training, the generator \({\text{G}}\) can learn a mapping function \(G(z) = z \mapsto x\) that maps the random noise vector \(z\) to the imputation value satisfied latent distribution. However, this still remains a problem, the random noise vector is randomly sampled from a latent space, e.g., Gaussian distribution. It means that the generated values may change in a range with the changing of the input random noise \(z\). In other words, the imputed value generated by single random sample \(z\) may has a distance to the ideal imputed value as shown in Fig. 5a.

Fig. 5
figure 5

Schematic diagram of multiple imputation by TGAIN

Inspired by the uncertainty solution of multiple imputation [22, 23], a novel multiple imputation by TGAIN is designed as the second stage of our imputation framework. This stage function is to find a best vector \(z\) from the latent input space so that the generated sample \(G(z)\) can be mostly close to the latent ideal value \(\tilde{x}\). To do this, multiple random samples are input into the well-trained TGAIN’s generator to generate imputed values, and the well-trained TGAIN’s discriminator is utilized to measure the degree of imputation fitness for each generated sample. The maximum imputation fitness corresponds to the ideal imputed value. This multiple imputation by TGAIN is shown in Fig. 5b. Therefore, the multiple imputation network by TGAIN is designed as the combination of multiple generate–evaluate pooling as shown in Fig. 6. Here, the max-pooling structure is used to integrate the evaluation results, compute the maximum imputation fitness and give the most reasonable imputed value.

Fig. 6
figure 6

Multiple imputation stage by TGAIN

The multiple imputation by TGAIN stage is presented in Stage 2 as follows (Table 2):

Table 2 Multiple imputation pseudo-code by TGAIN

4 Experiments and analysis

4.1 Traffic sensor data and setting

In this study, we evaluate the imputation performance of proposed TGAIN by two real-world traffic flow dataset. (a) Traffic section volume data by fixed road sensor in I90, Seattle, USA. (b) Traffic speed raster data by floating car GPS in Changchun, China. The testing sites and datasets were screened out to ensure the experiment data completeness for evaluate imputation performance conveniently and objectively, even though two types of data often have the problem of missing data in most cases.

4.2 I90 volume database

The data comes from Interstate 90 (I90) interstate highways in Seattle, USA, and collected by Digital Roadway Interactive Visualization and Evaluation Network (DRIVENet), which is an open-access database (http://wsdot.uwdrive.net). The selected data contains the vehicle volume counts record by 15 sensors which have the upstream and downstream correlation relation. The selected sub-area road sensor is shown in Fig. 7. The time period is January 01, 2015, to December 31, 2015, and holiday data are excluded for reducing the experiment complex. The sampling interval is 5 min. The traffic data from January 01, 2015, to September 30, 2015 (9 months, 75% of the total data) are used as training dataset, and the others (25%) are used as testing dataset. Here, to determine the optimal parameters for each of the experiment models and make sure unbiased estimate of the performance, the training dataset (75% of the total data) are split into training dataset A (25% of the total data) and training dataset B (50% of the total data). Training dataset A is used to search and determine the optimal models’ architecture parameters and training dataset B is used to training the experiment models.

Fig. 7
figure 7

Observation location on I90 interstate highways in Seattle, USA

In the I90 dataset experiment, the conditional vector of TGAIN is set as the week label connect the time label, the time interval is set as 4 h for a better distinguish performance which the traffic distribution and tendency have obviously differences. The time label section is shown in Fig. 8.

Fig. 8
figure 8

Multistate time series traffic volume data partition by time label

4.2.1 Changchun speed database

This database comes from the GPS equipment installed on about 15,000 taxis in Changchun, China. The acquisition time period is April 01, 2018, to May 31, 2018, from 08:00 to 22:00 and exclude holidays for reducing the experiment complex. The test site located at Weixing road west-to-east direction in Changchun is shown in Fig. 9. The road network is gridded at about 150 m. We used ARCGIS software to filter the dataset and mapped the raw GPS point data into road segments by map matching algorithm in [41] and calculated the travel speed of each floating taxi car. Then the speed value for each road grid was calculated by average travel speed of floating taxi car within 5 min. The traffic data from April 01, 2018, to May 16, 2018 (41 days, 75% of the total data) are used as training dataset, and the others (25%) are used as testing dataset. The training dataset is also divided into two parts and used in the same way as I90 database.

Fig. 9
figure 9

Test site on Weixing Street in Changchun, China

The morning and evening traffic congestion is a common phenomenon in city road network. The traffic speed data shows obviously multistate characteristic during the congestion formation, dissipation and free flow state as shown in Fig. 10. In Changchun dataset experiment, the conditional vector of TGAIN is still set as the week label connect the time label, but the time label is determined according to the speed data distribution time changing trends. The time label section is shown in Fig. 10.

Fig. 10
figure 10

Multistate time series traffic speed data partition by time label

4.3 Missing pattern and evaluation index

To reflect the complex distribution of MVs, the experiments simulate three common MVs pattern. (i) Missing completely at random (MCAR) where the propensity for a data point to be missing is completely random, i.e., independent of the observed data and the other missing data. In this pattern, MVs appear as a set of isolated points randomly distributed. (ii) Missing at random (MAR) where the occurrence of MVs depends on its neighboring MVs. As a result, this pattern looks like a group of successive MVs. (iii) A mixture of MCAR and MAR (MIXED), where the mixing ratio for MCAR and MAR is 0.5, indicating half of the MVs are from MCAR while the other half are from MAR. We also define missing ratio \(\delta\) as the ratio of the number of MVs to the total number of values and change the value of \(\delta\) from 0.1 to 0.9 with step 0.1 so as to simulate imputation problem with varying difficulties. In addition, the missing ratio \(\delta\) of TGAIN’s training stage is set to 0.3 to make sure relatively much information to learn the multistate distribution, which is a relatively high missing rate in general. In actual, the training dataset is the real detection dataset which could mix various missing rate samples.

To comprehensively evaluate the effectiveness of TGAIN, we compare it with several state-of-the-art imputation methods, including mean imputation, KNN [8], NNR [11], multiple imputation [22], SVD [17], PPCA [15], LLS [10], LRMC [18], SRSP [20], VIGAN [32], CollaGAN [33], GAIN [30] and MISGAN [42]. Among them, mean imputation is usually regarded as the baseline for MVs imputation, while the other belongs to different classes of methods (according to the taxonomy described in Sect. 2), e.g., regression model, probabilistic model and matrix completion model. The VIGAN, CollaGAN, GAIN and MISGAN are the variants of GAN for data imputation. The experiments are implemented in python 3.6. There are some parameters need to be set in each method, we first make the initial settings of model parameters according to the definition of corresponding algorithms and pervious research [20, 30, 43], the parameters in each method will further optimized by particle swarm optimization algorithm (PSO) [44, 45] to achieve the best imputation performance for traffic flow data in the experiment.

In experiments, missing scenarios are generated artificially and then different imputation methods are used to get a corresponding estimation. In order to quantitatively measure the recovery performance of imputation methods, the root mean squared error (RMSE) and mean absolute percentage error (MAPE), two widely evaluation indexes, are selected compute the differences between the imputed values and real values.

$${\text{RMSE = }}\sqrt {\frac{1}{n}\sum\limits_{i = 1}^{n} {\left( {x_{i}^{impute} - x_{i}^{true} } \right)^{2} } }$$
(11)
$${\text{MAPE = }}\sum\limits_{i = 1}^{n} {\left| {\frac{{x_{i}^{impute} - x_{i}^{true} }}{{x_{i}^{true} }}} \right|} \times \frac{100\% }{n}$$
(12)

where \(n\) denotes the number of MVs, \(x_{true}\) and \(x_{impute}\) denote the real value, respectively. Considering the randomness when artificially simulating missing entries, the experiment was repeated five times for each method and calculate the average imputation error to ensure the imputation effect and stability.

4.4 Network parameters setting

In TGAIN network input parameters settings, the distribution of noise matrix elements is set as a standard Gaussian distribution, which ensures the unbiasedness of the generated input noise \(z\). The conditional vector length depends on the label setting, the combination of week label and time label is adopted in the I90 volume database and Changchun speed database experiments. After One-hot encoding [39], the week label length is set as 7 (e.g., Monday code is [0000001], Sunday code is [1000000]), the time label division is shown in Figs. 8 and 10, and its length is set as 6. More importantly, the size of the input matrix \(X \in R^{d \times N}\) is an extremely important parameter. Due to the division by conditional labels, the length \(N\) of under different \(C\) may be inconsistent, so it depends on the maximum sample length. If the length of a sample is less than \(N\), it is filled with 0 elements and the marker matrix corresponding element is set as 1. The I90 volume database contains 15 sensors detecting data and time label interval is set as 4 h, so the input matrix dimension is set as 15*48. The Changchun speed database contains 18 sections detecting data and max time label interval is set as 8 h, so the input matrix dimension is set as 18*96.

The TGAIN network structure have an important influence on the imputation performance. The generator and decimator network layer and nodes number are the key parameters. In the experiment, the particle swarm optimization algorithm [39, 40] is adopted to determine the optimal network parameters. First, set different network layers and number of nodes for TGAIN, and calculate the corresponding imputation error. Then, the layer setting and node number are set as variables, the minimum imputation error is set as the optimization goal, and the network parameters are optimized by using PSO. Figure 11 gives the optimal network parameters setting. Besides, the maximum number of training is set as 20,000, the network learning rate is set as 0.001, and the activation function and network optimizer are set as ReLU and Adam.

Fig. 11
figure 11

TGAIN network parameters setting

4.5 Convergence analysis of TGAIN

In the TGAIN, the generative adversarial imputation and MVs distribution learning process is optimized by iteration training. Next, we investigate the convergence behavior of this algorithm under varying missing ratios and different missing patterns. The visualization of training process on I90 dataset in the case of MCAR pattern and \(\delta\) = 0.5 is shown in Fig. 12 and some convergence curves of TGAIN’s G and D loss function obtained in the experiments are shown in Fig. 13. As shown in Fig. 12, the third line is the unbroken traffic flow space–time matrix which could represent the ideal imputed matrix and the first line is the randomly generated missing matrix in the case of MCAR pattern and \(\delta\) = 0.5. The second line is the imputed matrix by TGAIN. With the increase in the iteration training, the imputed matrix by TGAIN is more similar and closer to the unbroken matrix. Meanwhile, the TGAIN generator and discriminator loss is iteratively convergent quickly in Fig. 13. It confirms the effectiveness and convergence of MVs distribution learning of TGAIN and the TGAIN can realize an effective mapping \(G(z) = z \mapsto x\) based on learned distribution.

Fig. 12
figure 12

Visualization of TGAIN training process (MCAR, \(\delta\) = 0.5, 04:00–08:00) (The process is training and testing by G)

Fig. 13
figure 13

TGAIN generator and discriminator loss convergence curve (MCAR, \(\delta\) = 0.5)

4.6 Comparison experiments on I90 dataset

Tables 3, 4, 5 list the imputation errors of different algorithms under MCAR, MAR and MIXED missing patterns, respectively. We can see some interesting points from these tables. Firstly, MCAR and MAR are the easiest and hardest situation among three missing patterns, respectively. The reason is continuous data missing will cause the losing of more valuable information among the dataset, and increase the difficulty of accurate imputation. Secondly, the baseline mean imputation is the worst in terms of imputation accuracy because it relies on the fixed distribution assumption while ignoring the difference of multistate distribution. Thirdly, the imputation error of each method increases with the missing rate. These imputation methods could performance well in the low missing rate, because enough information could provide to direct the model statistic and calculation. The mean imputation, SVD, PPCA, LLS, LRMC, SRSP will rapidly degrade when missing ratio increases, the reason is their models mainly utilize the data local correlation and less consider the regularity information contained in historical dataset. The imputation error of KNN, NNR, VIGAN, CollaGAN, GAIN, MISGAN and TGAIN have a slowly growth relatively and they essentially utilize the data under same distribution state. The results also indicate these models possess better robustness and can satisfied the missing rate fluctuation in reality. Fourth, the VIGAN, CollaGAN, GAIN and MISGAN could have an excellent imputation performance from various missing situation. They verify the distribution approach ability of GAN. Among them, the MISGAN and TGAIN are relatively better, because they both utilize the hint information to guide the distribution learning process. In particular, TGAIN achieves best performance in most cases and the comprehensive comparisons confirm the handling ability of uncertainty of imputation and multimodal distribution solution.

Table 3 Imputation error obtained by different methods under MCAR missing pattern. (I90 dataset)
Table 4 Imputation error obtained by different methods under MAR missing pattern. (I90 dataset)
Table 5 Imputation error obtained by different methods under MIXED missing pattern. (I90 dataset)

To clarify the imputation performance, Fig. 14 shows some imputation results obtained by different methods in the case of MCAR pattern and \(\delta\) = 0.3. The imputation residual obtained by TGAIN is smallest and the imputation tendency and performance is satisfied.

Fig. 14
figure 14

Partial imputation results obtained by different methods (4.30 observation spot, MCAR, \(\delta\) = 0.3)

4.7 Influence of parameters on imputation error

4.7.1 The \(\alpha\) of balancing the masked reconstruction and discriminative loss of TGAIN’s \({\text{G}}\)

According to Eq. (10), the \(\alpha\) is an important factor which balances the generated matrix reconstructed error for observed portion and imputed error for missing portion. We change the \(\alpha\) in range of {0.1, 0.5, 1, 3, 5, 7, 10, 20} and record the imputation error under different missing situation. Some experiment results under MCAR pattern and different missing rates are shown in Fig. 15. As shown in Fig. 15, the imputation error decreases and then increases with increasing \(\alpha\), and \(\alpha { = 5}\) works best. It means that TGAIN first make sure the ‘real’ portion of generated matrix is same as observed portion, and then let the ‘fake’ portion successfully fool the discriminator. When the \(\alpha\) is too large, it will lead to overfitting problem for ‘real’ portion and weak the imputation probability of \({\text{G}}\). In the experiment, the \(\alpha\) is set as 5.

Fig. 15
figure 15

Imputation error in different value of \(\alpha\) (MCAR)

4.7.2 The imputation number \(n\) of multiple imputation by TGAIN

The imputation number \(n\) is the key of TGAIN multiple imputation. It effects the determination of ‘best’ imputation value and imputation performance. In the parameter search phase, we change the number of multiple random matrixes and try to find the satisfactory result. Some experiment results under MCAR pattern and different missing rates are shown in Fig. 16. As shown in Fig. 16, the imputation performance gets better with the imputation number \(n\) increases. This further verifies the success of MVs distribution learning by TGAIN training stage and the significance of dealing with uncertainty by TGAIN multiple imputation stage. More imputation number need more compute resources and computation time. The multiple imputation number needs be comprehensive consideration between imputation performance and computational efficiency. In the experiment, the imputation number \(n\) is set as 30.

Fig. 16
figure 16

Imputation error in different multiple imputation number

4.8 Comparison experiments on changchun dataset

To further verify the applicability of TGAIN algorithm, Changchun speed dataset was utilized into the traffic MVs imputation. Table 6 lists the imputation errors of different algorithms under MCAR missing patterns. From these results, the speed MVs imputation task is easier than volume MVs imputation, because the multistate characteristic of traffic speed is more obviously and data fluctuation range is relatively smaller. The results show that most imputation methods have relatively small imputation error (RMSE and MAPE) in low missing rate and the error rapidly degrade with the increase of missing rate \(\delta\). Overall, the TGAIN has the minimum absolute and relative error in most missing condition and also has a better filling performance stability.

Table 6 Imputation error obtained by different methods under MCAR missing pattern. (Changchun dataset)

In order to better show the imputation performance under different traffic distribution state, the evaluation indexes were calculated, respectively, for each of the experiment methods according to the partition in Fig. 10. Figure 17 shows RMSE and MAPE of different methods under different traffic state during weekday and weekend. The imputation performance changes along with traffic state, the speed data under congestion formation state has sharp decline trend and strong randomness, this increases MVs imputation difficulty and the imputation error in this state is the biggest. The error of congestion dissipation state is relatively small for the data distribution stability and tendency consistency. Due to the random fluctuation of free flow state, there is no obvious difference between weekday and weekend. On the whole, the TGAIN has a superior imputation ability under different data states, and the results confirmed TGAIN’s handling ability of uncertainty of imputation and multimodal distribution solution.

Fig. 17
figure 17

Imputation error obtain by different methods under different traffic states during weekday and weekend (MCAR, \(\delta\) = 0.5)

5 Conclusion

To deal with the distribution solution for multistate time series missing data and the uncertainty of imputation process, a novel MV imputation framework is proposed based on generative adversarial network and multiple imputation. In the first stage, a novel TGAIN is built and it utilizes adversarial imputation process to suitably capture MVs latent distribution and information learning for multistate time series data. The adjustable condition vector and novel time series generator is constructed to direct the adversarial learning for each data state. In the second stage, to reduce the uncertainty of imputation, a new multiple imputation by TGAIN is adopted to determine the best filling value. TGAIN network structure is skillfully combined with multiple imputation process to overcome data distribution predefined defect. We apply the proposed method to two real-world traffic sensor datasets and the experiments results show the TGAIN multiple imputation has superior robustness and imputation performance, and better ability in dealing with the uncertainty and distribution solution for time series data imputation than other state-of-the-art methods.

Furthermore, the proposed TGAIN imputation framework can be used as a general method for multistate time series MVs imputation in more fields (such as CPS and Health). The specific functional design of generator and conditional vector in TGAIN can be adjustable in other imputation tasks, and this would become a remarkable extension of TGAIN.