Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

K, Priti; Shakya, Kaushlesh Singh; Kumar, Prashant

doi:10.1007/s11356-023-27659-x

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Research Article
Published: 23 May 2023

Volume 30, pages 75469–75488, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Environmental Science and Pollution Research Aims and scope Submit manuscript

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Download PDF

472 Accesses
2 Citations
Explore all metrics

Abstract

Monitoring air contaminants has become essential to exposure science, toxicology, and public health research. However, missing values are common while monitoring air contaminants, especially in resource-constrained settings such as power cuts, calibration, and sensor failure. In contaminants monitoring, evaluating existing imputation techniques for dealing with recurrent periods of missing and unobserved data are limited. The proposed study aims to perform a statistical evaluation of six univariate and four multivariate time series imputation methods. The univariate methods are based on inter-time correlation characteristics, and the multivariate approach considers muti-site to impute missing data. The present study retrieved data from 38 ground-based monitoring stations for particulate pollutants in Delhi for 4 years. For univariate methods, missing values were simulated under 0–20% (5%, 10%, 15%, and 20%), and high 40%, 60%, and 80% missing levels having long gaps. Before evaluating multivariate methods, input data underwent pre-processing steps: selecting the target station to be imputed, choosing covariates based on the spatial correlation between multiple sites, and framing a combination of target and neighbouring stations (covariates) under 20%, 40%, 60%, and 80%. Next, the particulate pollutants data of 1480 days is provided as input to four multivariate techniques. Finally, the performance of each algorithm was evaluated using error metrics. The results show that the long interval time series data and spatial correlation of multiple stations significantly improved outcomes for univariate and multivariate time series methods. The univariate Kalman_arima performs well for long-missing gaps and all missing levels (except for 60–80%), yielding low error and high R² and d values. In contrast, multivariate MIPCA performed better than Kalman-arima for all target stations with the highest missing percentage.

Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space

Spatiotemporal Exposure Prediction with Penalized Regression

Article 17 November 2022

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Discover the latest articles, news and stories from top researchers in related subjects.

Environmental Chemistry

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In common parlance, the issue of missing data exists across many areas of research, including exposure science, statistical surveys, epidemiological studies, and occupational health research (Junninen et al. 2004, Huisman 2009, Aslan 2010, Eekhout et al. 2012, Ramli et al. 2013, Sukatis et al. 2019, Kumar 2022). Air pollution monitoring has become an essential aspect of atmospheric exposure, public health, policy framework research, and risk communication to the public. The regulators have focused on strengthening the ground monitoring station networks to check air quality compliance with standards and for research applications. As a result, data on air quality is becoming more widely available, and the research underlying the associated health effects is also significantly growing.

Air pollution has been a significant concern in recent years due to unprecedented growth in urban centres. The rise in acute air contaminants incidences in many urban centres globally has increased the burden of disease, carrying culpability for around one out of every nine deaths annually. The WHO report of 2016 suggests that exposure to ambient and household air pollution contributes to mortality of over 7 million annually (WHO 2016). Also, air pollution is a direct indication of sustainable growth since sources of air pollution also produce climatic pollutants such as CO₂ or black carbon (Chan 2015). Monitoring of air pollutants has various spatial and temporal scales, and sampling methods can range from ground-based continuous monitoring stations as part of an ambient monitoring network (AMN) to low-cost or personal wearable sensors employed in the household or occupational settings that cover data from hours to days. Irrespective of sampling techniques, missingness can arise due to a variety of uncontrollable factors, including instrument malfunction, sensor sensitivity, maintenance or repair, calibration, and many other reasons (Wardana et al. 2022). Automated air quality monitoring stations preclude both regular and unscheduled shutdowns for maintenance. For illustration, many air monitors require 1 or 2 h every 2 weeks to examine (adjust) the zero value and air input flow, as well as a full calibration/maintenance every 6 months (i.e. a full working day shutdown) (Gómez-Carracedo et al. 2014, Little and Rubin 2019). Furthermore, remote stations experience lengthy outages due to power supply failures, problems with air aspiration pumps, or electronic processing malfunction. Such uncontrolled factors can impact time series data like air pollution as empty values can distort temporal information such as autocorrelation, trends, and seasonality. Therefore, mechanism and pattern of missingness are the most significant aspects to consider when choosing the optimal solution for imputation through preventing poor handling of missing data and misleading data interpretation.

Rubin and Little gave a mathematical explanation of the missing data mechanism and divided it into three types based on the relationship of values of attributes with unobserved/missing values (Rubin 1976). Missing at random (MAR) pattern is frequently observed in air pollution research data (Junninen et al. 2004, Plaia and Bondi 2006, Ghazali et al. 2020). Using the MAR mechanism, the cause of data missingness is explainable as missingness for an attribute is defined by the observed data, not by missing values. Missing information is recovered by using other variables for which the sample lacks missing data. In air pollution studies, missing data has a MAR pattern if data loss is due to power failure or system shutdown. When data is lost due to the inability of sensors to detect lower concentration limits of pollutants, it is missing not at random (MNAR) (Gómez-Carracedo et al. 2014). In a complex missingness structure, both MAR and MNAR missingness mechanisms exist. Furthermore, when data is missing due to unknown reasons, then it is missing completely at random (MCAR) (Hadeed et al. 2020). However, most empty values exist due to explainable conditions (Ghazali et al. 2020). Thus, understanding the selectivity of missing data and the corresponding mechanism is a pivotal step for dealing with missing data adequately.

Imputation techniques are broadly classified into two categories: Single imputation (SI) and multiple imputations (MI) to completely filled the missing data. SI provides single values for missing data items, whereas MI imputes multiple values for a given missing datum. Under the SI category, mean replacement (unconditional mean imputation) is the most widely used method in research studies. However, this method provides varied results depending on the pattern and type of missingness, pattern of missingness percentage, and gape size (short or long period of time the data are missing from the datasets) of missing values. Under the MAR mechanism, the estimated results through mean imputation provide inconsistent results through the variance of the regression coefficients, whereas the MCAR mechanism holds consistent variance but underestimates (Junger and De Leon 2015). Also, mean imputation gives more relevant results for normal data distribution than for skewed data. Median is another simple imputation technique that provides better results for skewed distribution, ameliorating the mean imputation gap (Junger and De Leon 2015, Hadeed et al. 2020). The univariate single imputation mean method tends to alter the tails of distribution as the number of observations is higher at the centre of normal distribution (Little and Rubin 2019). Another single imputation method that fills the missing values of univariate time series data of air pollutants using the last observed value of the same variable is last observation carried forward (LOCF). This method is convenient as it fixes the entire univariate time series dataset. However, even in MCAR conditions, this method can generate biased estimates (Molenberghs and Kenward 2007). Spline interpolation is another univariate inter-time imputation technique that imputes NAs values using the na.interpolation function in TS imputes package. The fitted function is a piecewise nonlinear polynomial that uses current available pollutants data to impute missing values (Wijesekara and Liyanage 2020). Other univariate methods considered for long-missing gaps, highly complex time series with trend and seasonality, and low autocorrelation are Kalman smoothing and seasonal decomposition using na.Kalman and na.seadec function, respectively. The seadec algorithm performs seasonal decomposition as a pre-processing step (Liu et al. 2020).

Multivariate time series imputation is another class of imputation techniques that handles multiple variables simultaneously, which provide single imputation or multiple imputations for missing data points. For the proposed work, all the techniques are multiple imputations based that rely on correlations between different co-variables to estimate values for missing data. Random forest imputation is a typical example of a multivariate imputation method implemented under the mice package (Little and Rubin 2019). This multivariate imputation approach fills NAs data points of the target variable using the time series of neighbouring variables. Other widely used multivariate imputation methods are predictive mean matching, multiple imputations through chain equation (MICE), expectation-maximisation, weighted predictive mean matching, random imputation, and multiple imputation principal component analysis. Predictive mean matching (PMM) is considered for multivariate imputation under the mice package of R. PMM is a hot deck-based algorithm where missing values are filled using covariates with the same distribution characteristics (Kleinke 2018). In their study, Marshall et al. (2010a, b) reported that PMM provides less biased estimates and significantly improved performance metrics in the context of missing covariate data (Marshall et al. 2010a, b). Another skewed data imputation study recommended that PMM performs better when missing data is MAR and < 50% (Marshall et al. 2010a, b). Another algorithm that stimulates multiple variate data points is midastouch (multiple imputation by automatic, distance-aided donor selection) under the MICE package. The method is based on a hot deck iterative algorithm with distance-based donor selection, and this measure of distance controls the trade-off between bias and variance of estimates (Siddique and Belin 2008, Siddique and Harel 2009). This method replaces PMM within the mice package by ‘midastouch’ (Gaffert et al. 2018). Multiple imputation PCA is another stimulation technique that estimates NAs data points and considers the similarity between individuals (rows) and the link between variables across all individuals. This method performs better when the correlation structure between variables is stronger (John et al. 2019). Regularised terms fix the problem of overfitting by penalising relatively unreasonably large parameters. In addition, these multiple imputation methods can take account of extra variability, yielding more precise results; however, such methods are only addressed a little (Schafer 1997).

Despite the fact that a large number of imputation tools are present in different statistical packages but used scarcely in air pollution studies as the most common approach is to ignore them. However, data continuity is important in time series modelling as excluding incomplete values influences temporal characteristics such as autocorrelation, seasonality, and trends. Moreover, deep learning algorithm requires a large amount of data, and some algorithms are sensitive to missing values; therefore, there are better options than complete case analysis to handle especially long missing data. Though some studies have used different imputation techniques, may it be single or multiple imputation-based methods for air quality datasets. However, its overall application still needs to be improved, with few tests of performance in real-world scenarios and little guidance regarding the imputation of air quality data.

This study presents the comparative analysis of six univariate time series methods used to impute low (0–20%) and high missing (40%, 60%, and 80%) percentage along with long missing gaps. Different approaches were used to filter the temporal components. Kalaman_arima and seadec method considered the temporal component of time series. In addition, we use spatial correlation component, i.e. multi-site, to select covariates for multiple imputation methods like PMM, MIDAS, RF, and MIPCA to impute high missing percentages under four missing categories (20%, 40%, 60%, and 80%). We examine the performance of single univariate imputation and multivariate muti-imputation methods for the short and long consecutive missing periods and provide future guidance to implement in other study settings.

Methodology

Study area

The study area considered for the proposed work is Delhi, the national capital of India, located in the northern part of Indo-Gangetic plain, which lies between the latitude of 28°24′17″ and 28°53′00″ north and longitude of 76°50′24″ and 70°20′37″ of the east with the total geographic area of 1483 Km² (700 km² urban and 783 km² rural). The climate of Delhi makes it favourable for pollution stagnation, especially during winter when the temperature reaches 22 °C to 5 °C and with no dispersion (Budhiraja et al. 2019). According to the Koppen classification system of climate Delhi has five seasons with the extreme type of climate and witnesses 714 mm of annual rainfall. From a demographic perspective, Delhi has the highest population density of 11,297 persons per sq. km as per the 2011 census compared to other Indian states/U.T. (Census, 2011; http://census2011.co.in). The key factors that cause alarming situation for Delhi’s air quality include the city’s landlocked geographical location, residual crop burning in neighbouring states of Punjab, Haryana, Uttar Pradesh, and Rajasthan, vehicular emissions, industrial pollution, and large-scale construction activities (Chatterji 2021). The city is contending with the capital’s escalating air pollution problem and associated health risks.

Data information

The empirical data of 38 continuous air quality monitoring stations (CAQMS) of Delhi is retrieved from the central pollution control board website (http://www.cpcb.nic.in/). The stations IDs have a series such as D_01, D_02, D_03, D_05…………. D_40. It is important to mention over here that the data of two stations, namely, D_04, and D_12 are not available. The geospatial distribution of these sites is shown in Fig. 1. Air pollutants data are in matrix form, (m × n), with each column n (representing variables) containing a time series of two air pollutants (PM_2.5 particulate matter with an aerodynamic diameter of less than 2.5 μm, PM₁₀ particulate matter with an aerodynamic diameter of less than 10 μm), considered and each attribute per station are studied. The particle pollutants are initially in hourly measurements but later aggregated on a 24-h basis. Thus, 1480 individuals (rows) are considered for two pollutants from February 2018 to February 2022 (4 years and 9 days). These stations are deployed in traffic, industrial, and residential zones to monitor important pollutants. While selecting the stations for study, some conditions are considered: the accessibility of particle pollution with at least 3 years from the same timespan is included.

Descriptive statistics of all monitoring stations

Appendix 1 shows descriptive statistics of all stations of the Delhi region for the PM_2.5 and PM₁₀ datasets. The detailed information can be found in Appendix 1 of five statistical characteristics: mean, 25% quantile, median, and 75% quantile, and standard deviation for each station (PM_2.5, PM₁₀), excluding missing values. The standard deviation is a measure of statistical dispersion that indicates how far observed values are dispersed from the mean of each variable for different locations. For example, small σ value for a monitoring station indicates that the monitored values of contaminants are close to the average; in contrast, a high standard deviation value depicts that actual values are far dispersed from the mean. Another measure of variability in descriptive statistics is quartiles, dividing the structured observed data (from lowest to highest) into four parts. The first quartile (25%) is defined as the value in the middle of the range between the minimum and the median, while the third quartile (75%) is defined as the value in the middle of the range between the median and the maximum (Wardana et al. 2022).

Missing data mechanism

Knowledge of the selectivity of missing data and the corresponding mechanism is a crucial starting point for dealing with missing data effectively. A correct approach to missing data is determined by whether the data is MAR or missing, not at random (MNAR). Observed data can be used to estimate missing values where the missing data pattern is MAR (or can be assumed to be MAR) (Rubin 1976). Such estimation is impossible when the data is MNAR, as estimation depends on unobserved/missing data. MNAR condition of missingness is nonignorable and generally viewed as a condition that results in biased predicted values. A third mechanism missing completely at random (MCAR) assumes that subjects with missing data are a random subset of the entire study sample, making them less susceptible to bias.

In most cases, missing data is neither MCAR nor MNAR. In the case of air quality data, the missingness condition most often observed is MAR, as reasons for missingness are known, such as routine maintenance, sensor malfunction, and power outages (Gómez-Carracedo et al. 2014). In some cases, the causes of missingness are unexplainable (e.g. MCAR). Furthermore, complete-case analysis for both MAR and MNAR data can produce biased results as such conditions have higher missing values and reduce the analysis’s precision. For the MCAR mechanism, complete case analysis (which allows for excluding incomplete observations) produces unbiased results. However, as air pollution is time series data, the listwise deletion method may break the data structure, resulting in the loss of valuable information. MAR’s missing data mechanism is more general and realistic than MCAR’s. Modern missing data methods typically begin with the MAR assumption (Van Buuren 2018). Thus, under the proposed study, we assumed MAR condition for air quality data imputation from available information.

Missing data statistics

It is important to understand the missing statistics of a dataset before selecting imputation techniques (Appendix 2). The higher percentage of missing value affects the time series of air pollutants. It would be much more challenging to perform time-related analyses, such as identifying weekly patterns or autoregressive models that predict from previous days. The percentage of missing across the two variables PM_2.5 and PM₁₀ in this study varied between 0 and 80%. Figures 2 and 3 depict that the dark areas represent the span of the monitored values of pollutant with no missing values, while the white stripes represent the missing values for PM_2.5 and PM₁₀, respectively.

Depending on the causes of the air pollution data missing, the gap size of the missing pattern can be divided into the short interval and long period missing. The missingness due to temporary power outages and routine maintenance causes short gaps in data collection, while sensor malfunctions and other critical reasons cause long-interval data missing. Missing values appears both in the discrete and consecutive missing pattern. For short-interval broken data, different level of missing percentage is considered as input for univariate time series imputation methods (20% missing rate category divided into four subcategories of 5%, 10%, 15%, and 20%) and for long-interval of missing levels 40%, 60%, and 80% is considered. Furthermore, we applied different multiple imputation methods based on the MICE algorithm in the case of long-interval missing levels. Before multivariate imputation, the pre-processing steps are applied to consider the spatial characteristics of monitoring stations while imputing missing values. The target stations to be filled under four categories, 20%, 40%, 60%, and 80% is, chosen based on the highest missing percentage (missing rates varied for each target station).

The univariate time series methods consider both low and high missing percentage monitoring stations for PM_2.5 and PM₁₀, and for multivariate imputation techniques, high missing percentage target monitoring sites are selected. The neighbouring sites were selected based on spatial characteristics of monitoring sites with a missing percentage of < 20%.

Imputation methods

The methods for imputation of time series can be broadly classified into three types: inter-time (time series), inter-variable (cross-sectional), and inter-variable + inter-time (TS cross-sectional). This study considered the first two approaches.

Univariate time series methods (inter-time series)

The univariate time series method uses temporal characteristics such as trend, seasonality, autocorrelation, and periodicity for air pollutants data. Temporal substitution techniques replace missing air pollutants data either with the mean of the whole column (time series) or stimulate the mean of neighbouring values using lagged data to fill missing gaps. One of the simplest and most popular methods to fix missing data is to replace them with the overall mean. There are better ideas than putting the overall mean in time series data like air pollution, which shows seasonality characteristics. Thus, mean imputation is generally not recommended by statisticians; it is only used in case of rapid fix only when less missing data is present. In case of large number of missing values, a fix value (i.e. mean) can change the distribution shape, standard deviation becomes smaller for imputed data compared to actual values. For high missing percentage with long consecutive gaps, mean imputation can cause shrinkage in standard deviation. Some studies have illustrated that mean imputation can work better with conditions like < 10% missing values in dataset and low correlation between variables (Raymond 1986, Tsikriktsis 2005). Though data stratification into smaller groups can slightly enhance this method (Norazian et al. 2008, Junior et al. 2016).

Another method, median, is a standard univariate imputation algorithm that does not consider the time series characteristics of air pollutants. However, it can give better results than the mean in a skewed distribution. Despite their popularity, these statistical techniques have the evident constraint of ignoring important temporal information and further affect correlation structure reasonably (Weerakody et al. 2021, Iodice D’Enza et al. 2022). However, these univariate time series substitution methods are relatively simple and do not account for any temporal variation in air pollution data.

LOCF is the univariant time-series longitudinal data method. The concept of the LOCF imputation technique is to substitute the most recently observed value for the entire missing data of the same variable. LOCF must be followed by a proper statistical analysis method that distinguishes between actual and imputed data. In time series data, the first-order Markov model assumes a similar condition as the LOCF imputation method. The probability of an observation at time ‘t’ depends on the last observation at a time (t-1) (Canales 2004). This method is simplest that takes advantage of time series characteristics. However, this algorithm has a drawback with a time series dataset with seasonality, when there are considerable differences between observations at t and the last Observation at the time (t-1) (Moritz et al. 2015). Another univariate time series method is the spline that replaces missing values based on the interpolation algorithm. This temporal interpolation method fits the observed data with a curve to stimulate NAs in air pollution data (Cho et al. 2020). The imputed air pollutants dataset can accurately simulate the temporal trend for short gap size. However, in case of more extended gap size, these simple imputation techniques like LOCF and interpolation might provide significant bias between imputed values and corresponding true values as data structure during an unobserved period is unpredictable from a univariate standpoint (Liu et al. 2020). This single imputation method ignores uncertainty in imputed values.

‘Seasonally Decomposed Imputation’ and ‘Kalman Smoothing on Structural Time Series Models’ are complex methods for imputation of missing data in univariate series since their ability to capture time-dependent characteristics like trend, seasonality, and cycles. However, they necessitate longer computation times (Moritz and Bartz-Beielstein 2017). In this study, Kalman Smoothing on the state space representation of an ARIMA model is used for imputation of time series data. However, to obtain more accurate results of imputation, smoothing algorithm is used (Agbailu et al. 2020). This technique has been developed by (Kalman 1960, Welch 2006). The algorithm imputes missing values based on lagged data points of the same variable. Let z_{(t − 1)} represents a set of lagged values and assume conditional distribution of μ_t is N(μ_t, p_t), μ_t, p_t are assumed to have been determined (Agbailu et al. 2020). The function na_kalman uses KalmanSmooth (as KalmanSmoother) and operates on state space representation of an ARIMA model obtained by auto.arima (Harvey 1990). This imputation technique based on the recursive data assimilation system of the Kalman filter (Moritz and Bartz-Beielstein 2017). To fill missing values, structural model fitted with maximum likelihood is used (Han et al. 2023).

Another univariate time series method is Seasonal decomposition (Secdec) combined with moving average to impute missing values. The Secdec algorithm removes the seasonal component of time series and afterwards algorithm like moving average perform imputation on deseasonalised dataset (Moritz and Bartz-Beielstein 2017). The air pollutant time series contains various levels of time dependency structures, such as 1 and 2 autocorrelation lags which would have been poorly captured by straightforward techniques like interpolation, mean values, random values. For this reason, among other imputation options, we choose moving average filters. In actuality, earlier Seadec tests that utilised these options produced subpar outcomes. Seadec. is therefore described as a useful technique for seasonal and autocorrelated series (Lloret et al. 2000, Benavides et al. 2022).

Multivariate timeseries methods (cross-sectional inter-variables)

Multivariate is another class of imputation method that accounts for temporal characteristics of a set of observed predictor variables. Compared to a single imputation (such as the mean), having multiple imputations eliminates uncertainty in missing values. In this case, covariates are selected as the correlation structure of the data is also an important factor in determining the performance of some multiple imputation methods (Dray and Josse 2015). The mice function of the R package MICE is a multiple imputation method that estimates the missing data and the variability associated with the imputation. MICE is an iterative algorithm; at each iteration, all variables except one are fixed and will be used as predictors, while an imputation method is chosen for the unfixed variable.

Predictive mean matching (PMM) was applied using the MICE package. The PMM method is especially effective for numerical variables without a normal distribution (Allison 2001). PMM is a semi-parametric method random case is chosen from complete cases with a predictive value that is close to the missing case (Little and Rubin 2019). Since the method uses data set values, they are accurate and there are no arbitrary imputed values (Schenker and Taylor 1996, Abayomi et al. 2008). The number of multiple imputations is 6 and maxit iterations is 20. Each target variable in the dataset is imputed using the donor variables in each iteration. These iterations should be repeated until convergence appears to have been achieved (Li et al. 1999). Several PMM variants were developed using various predictor variables (neighbouring stations) combinations. The number of imputations generated in PMM is six, and donor selection, d = 5, i.e. in each case of target stations missing value to be predicted, is matched to five cases with closest values. One of the donor values is chosen randomly, and its value is assigned to the target station with missing data (Allison 2015) and 24-h mean across the 20 iterations were used to impute daily concentrations of target stations. PMM Imputes missing values of target stations from observed values of covariates using linear regression coefficients. The advantage of PMM is that the imputed values are drawn from a range of observed values, and the possibility of imputing unrealistic values, such as negative concentrations, is eliminated (Van Buuren 2018). Similarly, two other multiple imputation methods considered for this study in the MICE package are MIDAS and Random Forest, which uses covariates to fill the target stations missing values giving multiple imputed datasets. RF method is based on Breiman’s algorithm as presented in (Breiman 2001, Doove et al. 2014).This non-parametric method predicts missing instances of features using values of other attributes. In this manner, the missing values for that feature are discovered. The process is repeated until it converges. For this study, value of parameter ntree chosen was 10.

Lastly, the multivariate imputation method was implemented on the air pollution dataset is MIPCA, implemented in the missMDA package. Josse and Husson (2009) proposed a regularised version of the iPCA algorithm (RPCA) to overcome the problem the overfitting (Iodice D’Enza et al. 2022). MIPCA provides multiple imputed datasets using the PCA model. The observed values are consistent across datasets, whereas the imputed values vary. The variability in imputed values reflects the variability in missing value prediction. According to Little and Rubin (2002), multiple imputations are appropriate because it accounts for parameter variability (Josse and Husson 2011). The Bayesian method is used for multiple imputations to reflect variability caused by missing values. The argument used for its implementation is a method.mi=‘Bayes’. The threshold for the criteria convergence is 1e⁻⁰⁴ (Audigier et al. 2016). Three steps are followed to perform MIPCA on the incomplete dataset: estimate the number of dimensions using the function estim_ncpPCA by k-fold cross-validation method to select tuning parameters. The second step is to use the function impute PCA to run the (regularised) iterative PCA algorithm with the number of dimensions selected in the first step. The third step is analysing the precision of each method by comparing the sample variance of the imputed values with the original variance (Li et al. 1999). These steps are repeated to achieve convergence.

R packages Functions used for missing values imputation

The R software (Crawley 2012) is a free and open-source statistical programming language. All single and multiple imputation methods are implemented under different R packages and are freely available from the Comprehensive R Archive Network (CRAN) at http://www.R-project.org/. We use the TS stats time series objects from base R to represent univariate time series (Zeileis and Grothendieck 2005). We implemented multiple univariate time series-based algorithms using TS packages such as mean, median (na.mean function), last observation carried forward (na.locf), spline (na.interpolation), Kalman Smoothing on the state space representation of an arima model (na. Kalman), seasonally decomposition model (na.seadec) (Moritz and Bartz-Beielstein 2017). Another R package used for multivariate imputation by chain equation is mice. The algorithms used are PMM, MIDAS , and random Forest to determine the missing values present in target variables (Van Buuren and Groothuis-Oudshoorn 2011). MissMDA package is used to for imputing missing values through MIPCA algorithm, (Stekhoven and Stekhoven 2013, Josse and Husson 2016). The evaluation metric to validate the imputation model through comparison of imputed and actual observed time series data using the goodness of fit hydroGOF package of R (Moriasi et al. 2007).

Methods for evaluation metric

There are several methods used for the evaluation of considered imputation methods. The accuracy of imputation methods is measured based on the difference between actual and imputed data through different error metrics across four levels of missingness (i.e. 0–20%, 20–40%, 40–60%, and 60–80 missing). Generally, five indicators are used to assess the performance of imputation algorithms as discussed below (Quinteros et al. 2019). In common parlance, low missing percentage result in lower RMSE/MAE/PBIAS error and higher R², and d scores. The RMSE/MAE values may differ significantly due to the physical nature of each pollutant. R² score is introduced to provide a more intuitive view of performance.

Root mean square error: RMSE is an error metric computed between the imputed value and the respective actual value time series, i.e. the standard deviation is calculated as referred to in Eq. 1.

$$RMSE\ \left(\overline{x},x\right)=\sqrt{\frac{\sum_{i=1}^n{\left(\overline{x_i}-{x}_i\right)}^2}{n}}$$

(1)

This method has been extensively used in the literature to evaluate the performance of imputation methods (Junger and De Leon 2015, Moritz et al. 2015, Wardana et al. 2022). In air pollution data, seasonality and trends are observed; summer values for particulate pollutants are low compared to winter. The upward trend can be seen during winter, but error metrics like RMSE may not be appropriate for such data with significant differences (Moritz et al. 2015). A lower value indicates that the model performed better.

Mean absolute error: MAE is an important error metric for datasets showing strong trends. The evaluation metric is based on the difference between imputed values and actual observed values in time series. This metric is least affected due to the dataset having a strong trend, i.e. the difference between datasets is high. Unlike RMSE, MAE is the more natural measure of average error that is unambiguous as RMSE varies with the variability of the error magnitude distribution, the square root of the number of errors (n^1/2), and the average-error magnitude (Willmott and Matsuura 2005) Eq. 2.

$$MAE\left(\overline{x},x\right)=\frac{\sum_{i=1}^n\left|\overline{x_i}-{x}_i\right|}{n}$$

(2)

Coefficient of determination: R² is commonly used for evaluating models as a goodness-of-fit metric. R2 is calculated by squaring the correlation coefficient between two columns and evaluating the variance between observed and predicted concentrations (Quinteros et al. 2019, Hadeed et al. 2020) Eq. 3.

$${R}^2={\left(\frac{\sum_{i=1}^n\left({x}_i-\overset{\acute{\mkern6mu}}{x_i}\right)\ast \left(\overline{x_i}-{\overset{`}{x}}_i\right)}{\sqrt{\sum_{i=1}^n{x}_i-\overset{\acute{\mkern6mu}}{x_i}}\ast \sqrt{\sum_{i=1}^n\overline{x_i}-{\overset{`}{x}}_i}}\right)}^2$$

(3)

where ${x}_i\&\overline{x_i}$ are the i_th value of actual and imputed observations, respectively, and $\overset{\acute{\mkern6mu}}{x_i}\&\overset{`}{x_i}$ are the mean of actual and imputed datasets. It indicates how well the model explains the variance in the observations when compared to using the mean of the observations as the prediction.

Percentage bias: PBIAS: The goodness of fit measures the average tendency of an imputation algorithm, i.e. if the imputed 24-h aggregated pollutants concentration within each station is larger or smaller than their observed values. A low value of PBIAS indicates accurate model simulation (zero implies optimal value; positive values indicate overestimation bias, while negative values indicate underestimation bias in the model). The metric is calculated in percentage as given in Eq. 4.

$$\textrm{PBIAS}\ \left(\overline{\textrm{x}},\textrm{x}\right)=\frac{\sum_{\textrm{i}=1}^{\textrm{n}}{x}_i-\overline{x_{\textrm{i}}}}{\sum_{\textrm{i}=1}^{\textrm{n}}{x}_{\textrm{i}}}\times 100$$

(4)

Index of agreement: Willmott (1981) developed the Index of Agreement (d) as a standardised measure of model prediction error that ranges from 0 to 1(Willmott and Matsuura 2006). A value of one indicates a strong positive match, while a value of zero indicates no agreement at all. The index of agreement can identify additive and proportional differences in observed and simulated means and variances; however, due to squared differences, it is extremely sensitive to extreme values (Legates and McCabe Jr 1999) given in Eq. 5.

$$d=1-{\left(\frac{\sum_{\textrm{i}=1}^{\textrm{n}}\left({\textrm{x}}_{\textrm{i}}-\overline{{\textrm{x}}_{\textrm{i}}}\right)}{\sum_{\textrm{i}=1}^{\textrm{n}}\left({\textrm{x}}_{\textrm{i}}-\overset{\acute{\mkern6mu}}{{\textrm{x}}_{\textrm{i}}}\right)+\left(\overline{{\textrm{x}}_{\textrm{i}}}-\overset{\acute{\mkern6mu}}{{\textrm{x}}_{\textrm{i}}}\right)}\right)}^20\le d\le 1$$

(5)

where ${x}_i\&\overline{x_i}$ are the i_th value of actual and imputed observations respectively and $\overset{\acute{\mkern6mu}}{x_i}$ is the mean of actual datasets.

Result and analysis

Initially, 38 monitoring stations and two pollutants per station were analysed. Descriptive statistics were applied on all the stations for PM_2.5 and PM₁₀, including mean, 25% percentile, median, 75% percentile, and standard deviation, as mentioned in Appendix 1. The statistical dispersion around the mean for particulate pollutants varies across all the stations, with the highest of all stations being 135.28 ± 108.83 for PM_2.5 and 308.78 ± 143.58 for PM₁₀. Next, to understand the complexity of time series data for NAs, it is necessary to know the statistics of missing components like the number of missing values, percentage of missing, number of gaps, average gap size, longest NA gap (series of consecutive NAs), and most frequent gap size (series of consecutive NA series). Appendix 2 provides detailed missing statistics information on the air pollutants dataset. A total of 1480 observations are considered for each station. The percentage of missing across 38 stations varied between 0 and 80%. The missing percent range, 0–80, is divided into four categories (0–20%, 20–40%, 40–60%, and 60–80%). Most of the stations lie between 0 and 20, missing percentage. Therefore, we further categorised 0–20% into four ranges 0–5%, 5–10%, 10–15%, and 15–20%. Furthermore, some stations were eliminated from the univariate time series experimental, with a missing percentage of < 1.0%. The largest missing gap size (NAs in rows) is 1016 days continuous, with an average gap size of 70.2 for PM_2.5 and PM₁₀. Furthermore, air pollutants data are omitted in a random manner because data may be missed due to a variety of explainable circumstances. For example, many air pollutants analysers require 1 to 2 h every 2 weeks to verify and analyse the air input flow. The unanticipated events, such as power supply, pump, and electronic processor failure, occur randomly, resulting in missingness, as shown in Figs. 2 and 3. The missing pattern is missing at random (MAR) as NAs data can be imputed based on observed data.

Univariate imputation techniques

Tables 1 and 2 show imputation results of different univariate (mean, median) and univariate time series methods (LOCF, spline, Kalman_arima, seadec) under various measurement indicators (MAE, RMSE, PBIAS, d, R²), low (0–20%) and higher missing (at 40%, 60%, and 80%) percentage, and air pollutants (PM_2.5, PM₁₀) through comparison of imputed value with actual observed value within each station. PBIAS measures the difference between observed and imputed mean concentrations of pollutants (Tables 1 and 2). The results estimated through univariate substitution methods that replace the missing values of air pollutant data with the mean and median of the single imputation value for each station show that the median imputation method consistently resulted in low PBIAS% across all levels of missingness compared to mean substitution for all sites. In the case of the mean imputation method, a decline of PBIAS% is observed across all stations with increasing missing values. Also, mean and median methods do not account for any temporal characteristic of PM_2.5 and PM₁₀ time series data. Thus, both method performance depends only on the percentage of missing, not merely on missing interval gaps (short gap/longer consecutive missing gap).

Table 1 Comparative evaluation of univariate time series imputations method using five different metrics for PM_2.5. The stations with highest missing percentage under each category are considered and for other stations results are in supplementary file (Appendix 3)

Full size table

Table 2 Comparative evaluation of univariate timeseries imputations method using five different metrics for PM10. The stations with highest missing percentage under each category are considered and for other stations results are in supplementary file (Appendix 4).

Full size table

For PM_2.5 low missing percentage, i.e. for 0–20%, the Kalman imputation yields absolute bias, i.e. the difference between actual and simulated 24-h mean by 1.87 μg/m³ (for station D_36 at 5 %missing), 4.43 μg/m³ (D_26 at 10% missing), 1.83 μg/m³ (D_13 at 15 % missing), 0.90 μg/m³(D_31 at 20% missing), and for higher missing percentage at 40% missing, this differences is around 3.64 μg/m³ (D_01), at 60% missing, 4.46 μg/m³ (D_06), and at 80% missing, imputed mean differed from observed mean by 67.39 μg/m³ (D_18). The biasness between the estimated and actual values increases with an increase in missing percentages. However, in the case of Kalman imputation, the biasness also depends on missing gap size. For instance, D_31 has a higher missing percentage than D_26. However, the missing interval in the case of D_31 has longer consecutive values (consecutive gap of 153 values) compared to D_26 (longest gap of 33 continuous missing values) refers to Appendix 2. The univariate time series Kalman_arima method accounted for the temporal characteristics of continuous readings of 4 years of data like autocorrelation and seasonal trends that potentially improved the performance of the Kalman method for long consecutive missing intervals. The method performed well for stations with long consecutive missing gaps, irrespective of the percentage of missing values.

Furthermore, compared with the mean and median, LOCF performs well for < 20% level of missing; however, for higher missing percentage, the difference between imputed and observed mean increases, and the performance decrease significantly with high values of MAE, RMSE, and PBIAS%. The NAs values filled with the spline interpolation method accurately stimulate the temporal trend of particulate pollutants at the low level of missing < 5%; however, with increasing missing percentage, the evaluation metric PBIAS% provides negative values indicating high bias between imputed values and corresponding actual values.

Correlation metric R² was used to estimate the relationship between daily observed and 24-h imputed concentrations of PM_2.5 and PM₁₀. A high R² value indicates the significant performance of imputation methods, and a low value signifies the weak relationship between observed and imputed values. Mean, median, LOCF, Kalman_arima, and sedec imputation method all produced higher R² values than the spline method, with Kalman producing the highest R² (Table 1). R² values for Kalman are around 0.99 at 0–5% missingness (D_15), 0.95 at 5–10% missingness (D_26), 0.99 at 10–15% missingness (D_13), and 0.94 at 15–20% missingness (D_31). At 40%, 60%, and 80% missingness, R² dropped significantly to around 0.85, 0.94, and 0.43, respectively for PM_2.5. Similar results can be observed for PM₁₀ for the Kalman method in Table 2. The performance of spline is significant only for ≤ 5% missing percentage.

The index of agreement (d) measures the prediction error that varies from 0 to 1. A value of 0 indicates no agreement between the simulated and observed values, and 1 indicates a perfect match. Spline results for both pollutants (Tables 1 and 2) show a significant drop in d values after 0–5 missing percentage categories. In some cases, a 0 value is observed for D_22 and D_05 (Table 1 and Appendix 3) and PM₁₀ stations D_05 and D_01 (Table 2 and Appendix 4). Mean, median, locf, and sedec all performed well except for 40–60% and 60–80%, missing the percentage category. Kalman gives the highest d values across all missing categories.

The error metric RMSE performed well for mean, median, locf, Kalman, and seadec for low missing percentage categories (0–5%, 5–10%). RMSE, like other error metrics, increased with the level of missingness (Table 1). At 10–15%, 15–20%, 20–40%, 40–60%, and 60–80% missing, Kalman and the median performed well; however, spline imputation produced the highest RMSE values for the high missing category (Table 1 and Appendix 3). In the case of PM₁₀, locf, median, and Kalman perform equally well for higher missing percentages (Table 2 and Appendix 4).

Another error metric is MAE which is less sensitive to outliers in predicted values. Similar to RMSE, the lower the value, the better the performance of the simulated method. At 0–8% missing percentage, spline performed moderately well; however, performance declined after. Mean method performance yield high MAE values as the missing percentage increases. Seadec provides fluctuating results across all levels of missing. Median, locf, and Kalman performed well across all levels of missing, with Kalman at 20–80% missing, performing best, yielding the lowest MAE values of 1.66 (D_31), at 15–20% missing, 3.64 (D_01) at 20–40% missing, 4.46 (D_06) at 40–60% missing, and 47.11 (D_18) for 60–80% missing (Table 1). Similar results for PM₁₀ are observed in Table 2.

Based on univariate time series results, the worst performing method was spline, which consistently produced high errors at low and high missing percentages except for values ≤ 5% missing level. Even in some cases, PBIAS gives negative values and MAE, and RMSE is 3–4 times higher than considered univariate methods (Tables 1 and 2). Across all levels of missingness, kalamn_arima provided the best estimates of 24-h mean PM_2.5 and PM₁₀ concentrations. At low levels of missingness (0–20%), the mean method performed well when evaluating the error metric for 1480 days. However, as missingness increases, performance becomes low. Median and locf imputation performed on average at higher levels of missingness (20–60%) (Tables 1 and 2). Henceforth, Kalman_arima is the better-performing imputation method for use with PM_2.5 and PM₁₀ data from all univariate methods across all metrics.

We further worked on different multiple imputation methods using another characteristic of time series data, i.e. inter-variable/cross-sectional (described in the “Evaluation of spatial characteristics” section and the “Multivariate imputation” section). First, the target stations were selected for both variables with the highest missing percentage under four categories at 20%, 40%, 60%, and 80%. Then, to fill in missing values in target stations, neighbouring stations were chosen based on spatial correlation values using the Pearson coefficient. Finally, after framing the data matrix of 1480*4, different MICE-based imputation methods were applied and evaluated based on five error metrics.

Evaluation of spatial characteristics

Selecting target station

Air pollutants possess strong spatial relationship, but such characteristic is scarcely used in imputing missing air pollutants data. A total of 38 monitoring stations deployed in Delhi can have highly stochastic spatial correlations among each other. To reconstruct incomplete data, the proposed study used spatial characteristics of all monitoring stations for two variables (PM_2.5 and PM₁₀) to select covariates and concurrently use these co-variables for multiple imputations to yield less biased estimates for high missing percentages (process mentioned in Fig. 4). The considered multivariate methods take advantage of the spatial relationship by using highly correlated sites as covariates to fill the missing values of target stations. There are two assumptions while selecting covariates, i.e. the combination of neighbouring stations has a significant correlation (> 0.50) with the target site and has a low missing percentage compared to the target stations to be filled. This pre-processing step is advantageous when the target station’s sensors fail to get data on air pollutants. Then the lost values can be estimated from the neighbouring station’s data. Even though the raw dataset has several pollutants, we consider two variables, PM_2.5 and PM₁₀, as target pollutants from a health perspective. The same pollutant PM_2.5/PM₁₀ is combined from 38 monitoring stations, and the correlation coefficient is calculated to examine the spatial characteristics of air contaminants.

The selection of the target station is based on the highest missing percentage from each category, i.e. 20%, 40%, 60%, and 80%. For instance, target stations selected for PM_2.5 based on the above criteria are D_31 (15.54%), D_01 (25.00%), D_06 (57.23%), and D_18 (72.64%) and similarly for PM10 D_15 (17.23%), D_01 (25.81), D_27 (59.93%), and D_18 (73.11%) under 20%, 40%, 60%, and 80% missing percent category as mentioned in Table 3. Furthermore, to fill in missing data in target stations, a combination of covariates neighbouring stations is chosen based on their coefficient correlations with the target station.

Table 3 Criteria for selection of target station to impute missing values

Full size table

Spatial correlation coefficient of pollution data to select covariates for MICE algorithm

To analyse the overall spatial correlation of air pollutants data, a length of 1480 individuals and 38 columns are considered, with a significance level (α) of 0.01, to compute Pearson coefficient ρ. Examining PM_2.5 and PM₁₀ target pollutants coefficient of correlation among air monitoring stations is an important task in cross-sectional imputation. First, the data of the same target pollutant (PM_2.5/PM₁₀) is combined for all stations into a single data matrix of m*n, where m = 1480 (no. of individuals) and n = 38 (representing columns as the number of monitoring stations) given in Fig. 4. Once the data is prepared in 1480*38 for a single pollutant further, correlation coefficients are calculated through Pearson correlation formulae for each pollutant as given in Eq. 6. Pearson correlation determines the linear relationship between a pair of monitoring stations for two time series for the same pollutant of the same timestamp.

$$r\left({D}^t,{D}^n\right)=\frac{\sum \left({D_i}^t-{\mu}_t\right)\sum \left({D_i}^n-{\mu}_n\right)}{\sqrt{\sum {\left(\left({D_i}^t-{\mu}_t\right)\left({D_i}^n-{\mu}_n\right)\right)}^2}}$$

(6)

Suppose the D^t and Dⁿ are two time series with sequences as (D₁^t, D₂^t, D₃^t⋯⋯⋯⋯D_i = 1480^t) and ( D₁ⁿ, D₂ⁿ, D₃ⁿ⋯⋯⋯⋯D_i = 1480ⁿ) of PM_2.5/PM₁₀ data for the target and neighbouring station of Delhi with the same time range (1 to 1480 days), respectively. Equation 6r(D^t, Dⁿ) indicates the correlation coefficient of the target and neighbouring stations’ time series for the same attribute. D_i^t and D_iⁿ represents i^th sampling points for D^t and Dⁿ, respectively. Further μ_t and μ_n denotes the mean of both time series D^t and Dⁿ, respectively, representing ${\mu}_t=\frac{1}{1480}\sum_{i=1}^{1480}{D_{1\dots .1480}}^t$ and ${\mu}_n=\frac{1}{1480}\sum_{i=1}^{1480}{D_{1\dots .1480}}^n$. The numerator represents the covariation of both time series D^t and Dⁿ from their mean values, and the denominator denotes the variance of D^t and Dⁿ. Correlation coefficient value range between −1 and 1. If the value of r = 1, for D^t and Dⁿ then stations are highly correlated, and if it is −1 then they are negatively correlated, and the 0 value of r represents no obvious linear relationship between the two stations.

Appendices 6 and 7 depict the target station’s correlation with different air monitoring stations for PM_2.5 and PM₁₀, respectively. While analysing the correlation matrix, the results show that target station particulate pollution significantly correlates with neighbouring stations. The target stations having Pearson coefficient value of > 0.5 Dⁿsuggests that particle pollution has significant spatial correlation and is little affected by meteorological and social activities. The correlation matrix for D^tand Dⁿ shows that particle pollution in Delhi is not uniformly dispersed as the spatial correlation between D^tand Dⁿ is independent of the geographical distance between the two as mentioned in Appendix 5. The obtained coefficient infers the strength of the relationship between D^t and Dⁿ for PM_2.5 and PM₁₀ (Appendix 6 and 7). The negative coefficient values in PM_2.5 and PM₁₀ datasets correspond to a negative relationship. Appendix 6, the r value among PM_2.5 monitoring stations range between −0.43 and 0.98. Pollutant PM_2.5 correlations are strong for target station D_06 with 31 neighbouring stations, except D_01, D_07, D_18, and D_31 have negative values. In contrast, the correlations with target stations D_31, D_01, and D_18 are weaker with most of the paired neighbouring stations except for three (presented in bold). Appendix 7 shows that the correlation coefficients between D^t and Dⁿ stations measuring PM₁₀, which range from −0.63 to 0.98, are more varied. Furthermore, based on the correlation coefficient analysis of each station, we draw different combinations of Dⁿ for each D^t ,as given in Table 4, which are eligible to deliver data to impute target sites. We carefully selected the three Dⁿ with the highest correlation coefficients (> 0.50) to D^t. Table 4 shows the selected neighbouring stations for PM_2.5 and PM₁₀, sorted from highest to lowest ρ values.

Table 4 Covariates selection for each target station based on highest correlation coefficient

Full size table

Selecting neighbouring station (covariates) to frame combination

After analysing the overall spatial correlation between the target and neighbouring stations (covariates) for the same pollutants, the next step is selecting neighbouring stations based on ρ values of paired stations (D^t and Dⁿ) and frame combinations. We selected the three highest ρ values between D^t and Dⁿ and built a data matrix of 1480*4 (one D^t and three Dⁿ) for multivariate imputation techniques. The selected neighbouring sites may differ for PM_2.5 and PM₁₀ as correlation coefficients vary, as given in Appendix 6 and 7. Based on Appendix 6 and 7 results, the number of neighbouring stations Dⁿ (as covariates) are selected as input for multivariate methods based on the MICE algorithm. Three neighbouring stations are selected for each target column (in total, four columns are used as input) D^t based on the first three highly correlated neighbouring sites with D^t as given in Table 4. The columns of the input set contain data from the target station as well as data from nearby stations. Before finalising three neighbouring stations as input in multiple imputation methods, we evaluated the imputation results for the varying number of neighbouring stations (2 to 5) for PM_2.5 and PM₁₀. It was analysed that performance is only significant if the number of neighbouring stations increased and decreased. However, it increases the computation load. Therefore, we use three neighbouring stations Dⁿ in addition to the target station D^t for multivariate imputation techniques.

Multivariate imputation

The data matrix obtained after spatial correlation analysis is 1480*4 (1480 represents daily concentration values, and four refers to one D^t and three Dⁿ columns) to impute target stations with the highest missing percentage from each category (20%, 40%, 60%, 80%) for PM_2.5 and PM₁₀ (target pollutants). The multivariate imputation methods like PMM, MIDAS, RF, and MIPCA are based on the MICE algorithm that uses target air pollutants data of Dⁿ, neighbouring sites, which are chosen based on correlation coefficient (> 0.5) to impute high missing percentage data of target stations. The highly correlated neighbouring sites (as covariates) are used to estimate high missing percentage data of (D^t). The missing percentage for covariates is fixed at < 20%. MICE-based methods provide multiple imputation values to missing data instead of just one, as in the case of the univariate method. Multiple imputation methods were used for each missing category, and evaluation metrics were used to compare the accuracy of the estimated results for PM2.5 and PM10 concentrations at each target station (Tables 5 and 6). One of the crucial features of MICE-based methods is it gives multiple imputed values for each target station missing values, as mentioned in Fig. 4 of step 6.

Table 5 Evaluation of multivariate imputation methods for PM_2.5 under four missing categories with highest missing percentage D^t

Full size table

Table 6 Performance comparison of four MICE based imputation methods for PM₁₀ four missing categories with highest missing percentage D^t

Full size table

PMM, MIDAS, RF, and MIPCA consistently resulted in the lowest PBIAS values across 0–20% and 20–40% levels of missingness; however, for PM10 at the same missing levels, the values of PBIAS were slightly higher. At 40–60% and 60–80%, the PBIAS increases as missingness increases. The multivariate method outperformed in terms of low PBIAS error compared to univariate methods across all missing levels except for kalman_arima in D_06 (Table 1). R² and d values are high at 20% and 40% missing levels for both variables, but the performance decreases as the missing percentage increases. The MIPCA method performed well for both variables. PMM, MIDAS, RF, and MIPCA showed good performance for both target pollutants PM_2.5 and PM₁₀ at 20% and 40%, with low error and high R². However, the performance of methods decreases as the level of missing increases.

From a spatial correlation perspective, PMM, MIDAS, RF, and MIPCA decrease substantially with a decrease in correlation (as given in Appendix 5). For all target stations (Tables 5 and 6), MIDAS and MIPCA imputation methods consistently produced the lowest MAE, RMSE, ad PBIAS error across all levels of missingness; however, the error metrics values increased at 60–80% missingness. MIPCA performed best in all methods for all high missing percentage target stations. PMM performed well for target stations with < 50% missing percentage (Table 5).

Statistical significance test (Wilcoxon Signed Rank Test): comparison of imputation methods

Since the Kolmogorov-Smirnov test revealed that many of our features originated from populations with non-normal distributions, the statistical Wilcoxon Signed Rank Test was used to demonstrate which method performed better (Cohen, Cohen et al. 2013). We have performed Wilcoxon signed rank test (one-tail) for comparison of univariate (mean, median, LOCF, spline, kalman_arima, seadec.) and multivariate (PMM, MIDAS, RF, PCA) methods with varying missing percentages at the significance level = 0.05. Based on the five evaluation metrics results, we have formulated two null hypotheses to evaluate the consistency of performance of Kalman and the MIPCA imputation method for long gaps and high missing percentages. Null hypothesis (H₀¹) is there is no significant difference in the performance of Kalman_arima imputation methods compared to five univariate methods (mean, median, LOCF, spline, seadec.). Null hypothesis (H₀²) is there is no significant difference in the performance of MIPCA compared to all nine considered methods (six univariate and three multivariate), especially for high missing percentages and long gaps. To determine the likelihood of Kalman and MIPCA imputation methods performance on PM_2.5 and PM₁₀ datasets and we compared nonparametric test results of target methods (Kalman and MIPCA) with different imputation techniques at the missing percentage of 20% (5%, 10%, 15%, 20%), 40%, 60%, and 80%. This resulted in a total of 142 pairwise test (70 for Kalman and 72 for MIPCA) is included in Table 7 and 8. In all tests, the pairs the sets are (K_i, K_j), where K_i is the target method (Kalman) and K_j (j = 1, 2, …., 5) are the rest five univariate methods. For multivariate method testing, the pairs are (L_i, L_j), where L_i is the target method (MIPCA) and L_j (j = 1, 2, 3, .…, 9) are nine univariate and multivariate methods. In Table 7, three conditions are considered (1) K_i > K_j (2) K_i ≅ K_j (3) K_i < K_j and similarly in Table 8, (1) L_i > L_j (2) L_i ≅ L_j (3) L_i < L_j, these conditions respectively denote the performance of target method (K_i and L_i is significantly better than K_jand L_j; (2) there is no significant difference between the performances of K_i and K_j, L_i and L_j; and (3) the performance of K_i and L_i is significantly less than K_i and Lj. Table 7 illustrates the results of Kalman imputation; it was found that at the low missing percentage (5%, 10%,15%, and 20%) out of 40 tests, Kalman is statistically significant for 14 cases and in 20 cases, it performed equally well as Wilcox test statistic closely match null hypothesis, and in 6 instances Kalman showed p-value > 0.05. From Table 7, it is observed that at 5% missing class, from 10 tests, Kalman performed statistically significant with p-value < 0.05 for 2 cases (p-value = 0.034, 0.028) and performed equally well for 6 cases and in 2 cases, it performed less than K_j (p-value = 0.425, 0.163). In 10% missing class, out of 10 tests, the low performance of Kalman remains constant (p-value = 0.855, 0.879, the number of times Kalman performs equally with others decreases to 5, while the number of wins over others increases to 3 (p-value = 0.002, 0.012, 0.020). At 15% missing level, win cases for Kalman are 4 (p-value = 0.001, 0.018, 0.027, 0.010), loss cases decrease to 1 (p-value = 0.171), and performed equally for 5 cases. At 20%, the number of cases the Kalman lost increased to 2 (p-value = 0.171, 0.228), equally performed with others decreased to 4, but the winning frequency remains the same as previous (p-value = 0.025, 0.013, 0.001, 0.001).

Table 7 Comparison of kalman with five univariate methods based on Wilcox signed rank test

Full size table

Table 8 Comparison of MIPCA with nine imputation methods based on Wilcox signed rank test

Full size table

For 40%, 60%, and 80% (high missing percentage with long gaps), the kalman method performed better with winning frequencies of 6, 8, and 9 at 40%, 60%, and 80%, respectively. On the contrary, Kalman shows insignificant results for 1, 0, 0 times (p-value= 0.866) at 40%, 60%, and 80%, respectively. Table 8, the multivariate method (MIPCA) considers four classes 20%, 40%, 60%, and 80%. A total of 72 tests were implemented; at 20% onwards, MIPCA performed significantly better than others for 10 cases, equally well for 6 cases, and in 2 cases it lost (p-value = 0.879, 0.855). At 40%, 60%, and 80% missing classes, the winning cases are 12, 14, and 18, respectively, performed equally well for 7, 4, and 0 cases, respectively, and found statistically insignificant results for 1 (p-value = 0.526), 0, and 0 at 40%, 60%, and 80%, respectively. To summarise, in the case of Kalman (Table 7), beyond 20%, the performance increases with a rising missing percentage. The overall performance of Kalman is 51.43%, significantly better than other univariate methods; in 37.14% of cases, it is comparable to the other five univariate methods and of 11.43% cases, it performed insignificant to others. Similarly, for multivariate (MIPCA), the overall performance is quite significant compared to all nine methods from 20% onwards with a 75% success rate, 23.61% cases performed equally with others, and < 1% cases performed worse than the nine considered methods.

Discussion

The missing data are commonly encountered in air pollution data due to calibration, repair of instruments, inconsistent power supply, and voltage fluctuation, which affects routine monitoring. Regular air monitoring is crucial to understand the impacts of policy changes and health impacts due to exposure and designing the daily alert system. In this context, we investigated approaches for dealing with missing data for low and high missing percentages. Among the six univariate methods evaluated across several metrics, the Kalman_arima method performed well across all missing levels except for 60–80%. Low missing percentage < 5% spline produced the best 24-h mean estimates with the lowest error and highest R² and d values. Mean, median, and locf imputation methods performed inconsistently for input time series in terms of percent missing. Seadec. It also provides uncertain results for stations with a high missing percentage. Kalman_arima imputation method performed exceptionally well for time series data for both variables with strong seasonality and trend. It may be a realistic option for imputing time-series data of this type. Effective univariate time series imputation algorithms utilised inter-time correlation characteristics. The performance of imputation is always highly dependent on the characteristics of the input time series, and therefore, for time series with strong seasonality, Kalman_arima performed well with statistically significant cases for about 51.43% cases, where p-value for higher missing percentage and long gaps were lower than 0.05 (alpha value).

The multivariate imputation methods PMM, MIDAS, RF, and MIPCA performed much better for target stations, with the highest missing percentage under 20% and 40% missing categories. At 60% and 80% performed of PMM, MIDAS, and RF significantly dropped. However, MIPCA performed highly significant than Kalman_arima at 60% and 80%. This is because multivariate imputation methods impute missing values from observed concentrations of neighbouring stations (covariates) with similar spatial characteristics. Comparison of MIPCA with nine imputation methods based on Wilcox signed rank test p-value frequency (α = 0.05), also suggest that above results based on evaluation metrics are significant. MIPCA performed highly statistically significant compared to other univariate and multivariate methods, as p-value were lower than < 0.05.

Interestingly, MIPCA performed well across all levels of missingness. One possible explanation for the superior performance of MIPCA imputation from univariate imputation is the spatial correlation characteristics of the monitoring station. The MICE algorithm captures the spatial characteristics of the neighbouring station’s time series data to fill target stations missing values. This would explain why univariate methods using partial data observed within each station (Inter-time correlations) yielded low estimates of PM concentrations, especially for high missing percentages, compared to multivariate methods using spatial characteristics (cross-sectional relationship) between target and neighbouring stations.

Relatively median showed good performance was an unexpected finding given that this method can yield biased estimates under MAR and is generally discouraged. One reason for mean imputation’s success could be that the partially observed data within each station is distributed positively skewed (Miettinen 2012, Junger and De Leon 2015, Hadeed et al. 2020). The mean imputation method performed well for low missing levels, but the results dropped as the missing percentage increased. Moreover, the method provides a constant value for a missing gap, which brings biasness in a time series with seasonal characteristics like an air pollution dataset. Therefore, this method is highly discouraged for the MAR dataset (Quinteros et al. 2019). In time series data, the first-order Markov model assumes a similar condition as the LOCF imputation method (Canales 2004). LOCF fills the missing values with t-1 observed values. LOCF performed well for short interval gaps; however, for long gap size, especially in the case of time series data having temporal characteristics like seasonality and trend, this method’s performance drops (Moritz and Bartz-Beielstein 2017). The performance of seadec. fluctuates and needs to consider the temporal component of the time series accurately as the level of missing increases. The method utilized partial data observed within each station to predict concentrations missing within the same station, which may be effective for monitoring stations located in areas completely different from one another.

Kalman_arima univariate time series approach can work efficiently for monitoring stations located in a resource-constrained location with one/ two stations. The temporal characteristics to impute values based on inter-time correlation, like seasonality, may explain this method’s success. In addition, the method considers some aspects of the structural nature of the time series and computes the probability of concentrations based on lagged observed values. In addition, to the inter-time correlation component of time series, the multivariate time series algorithms consider correlation among covariates. PMM performed better for target stations with < 50% missing percentage (Marshall et al. 2010a, b). MIDAS performed better than PMM as it selects donor values based on distance and is a higher variant of PMM. RF performed slightly better than PMM and MIDAS for 40–60% and 60–80%. The following high iteration number ensures the algorithm imputes 24-h PM_2.5 and PM₁₀ missing data with low error and high R² and d values.

One of the limitations of the proposed study is that it used the time series characteristics like inter-time correlation and inter-variable (cross-sectional) correlation separately for univariate time series and multivariate time series. The performance for a high missing percentage can be improved if both the characteristics, i.e. TS cross-sectional (inter-variable + inter-time) taken into account simultaneously. Also, univariate methods like mean, median, and locf fail to capture temporal events like trend and seasonality. This could have consequences in designing an alert system where specific tasks may be associated with high pollutant concentrations. Such a univariate imputation method may only account for these high pollutant tasks if the concentration of pollutants is observed. Imputed values are based on partial observations, which can result in under- or overestimation. Thus, considering this limitation of simple univariate methods, kalamn_arima appears to be a viable option for imputing missing data in samples and populations with high heterogeneity, and multivariate imputation method MIPCA used spatial characteristics of monitoring stations to give better results for high missing percentage target stations.

There are only a few comparable studies in the context of ambient air monitoring data over the long-time interval. The existing studies are tailored to impute missing data, which are frequently preceded and followed by observed data for short periods. However, in our proposed, there are long consecutive periods of missing data at a fixed time interval, and also, in some cases, the missing data have random start points in a column. Thus, this unique nature of missing data can explain the difference between our proposed study’s and previous studies’ imputation performance. The advantage of our study is its large sample size; however, it is limited to ambient particulate pollution in densely populated urbanised regions. Therefore, the results for other pollutants can vary depending on the contaminant, sampling duration, and co-variates selected.

For instance, (Hadeed et al. 2020) compared imputation methods for short-term monitoring of PM_2.5. Their study used datasets with varying degrees of missingness to assess the performance of mean, median, random, Markov, LOCF, Kalman, PMM, and RMM. The three univariate methods, random, Markov, and mean outperformed all missing levels. The multivariate method used predictor variables such as type of fuel, geographic location, relative humidity, and temperature to impute PM_2.5 missing concentration. However, PMM, and RMM performed worse as the covariates selected are heterogeneous. However, our study contains the following: Continuous data of 4 years which allows the seasonality and trend to be accounted for, potentially improving the univariate time series method Kalman_arima as it requires inter-time correlation to work efficiently. In addition, the study employed spatial correlation characteristics of multiple stations to select covariates to impute target stations missing concentration, and MIPCA performed well across all missing levels.

(Junger and De Leon 2015) explored missing levels ranging from 5 to 40% in their study. Twelve imputation methods, including univariate and multivariate methods, were evaluated (complete case analysis, mean, median, nearest neighbour, EM models, ARIMA, general additive models, and spline models). For 366 days, each day concentrations were collected from 10 sampling sites, and found concentrations were strongly correlated between stations. Even with low levels of missingness, conditional mean performed well, whereas unconditional imputation (median, mean) performed poorly. Multivariate methods such as conditional mean and EM-models performed remarkably well as the frequency and interval of missing data increased. Our study results were similar to those published by (Junger and De Leon 2015); however, in our study the distribution is skewed, the median performed better at low missing levels than mean. The improved performance is attributed to the inclusion of a longitudinal component and the high inter-site correlation among monitoring stations. The study of long-term monitoring data from multiple sites in our analysis could explain the commonalities in imputation performance. More complex multivariate imputation methods like PMM, MIDAS, RF, and MIPCA impute concentrations performed well based on information observed at other monitoring sites.

Conclusion

Imputation methods presented in this study yield reliable results for high missing concentrations and long consecutive gaps. The overall performance of kalamn_arima methods is influenced by missing concentration statistics like missing gap size and missing percentage. For a long consecutive missing gaps, Kalman_arima performs well irrespective of missing percentage, and long gaps specially for 1016 consecutive values and high missing percentages, MIPCA gives significant results. The relevant imputation technique for missing concentration should be chosen based on missing data statistics and variables temporal and spatial characteristics. The imputation methods can be useful in resource-constrained developing nations where still manual monitors are used, or a smaller number of continuous monitoring stations are deployed, power cuts, sensor failures, etc. However, there has been little work done in evaluating current methods for filling missing data for long consecutive gaps and high missing percentages. Our finding can help reconstruct long missing gaps and high missing percent data for deep learning prediction algorithms that requires continuous time series data without gaps for designing the alert system. Regardless of the finding, more studies are required to investigate and identify appropriate imputation technique that are pervasive to various settings.

Data availability

Air pollutants data is freely available on the website of Central Pollution Control Board, India. Any question/ inquires pertaining to the type of data or the attributes of datasets, can be directed to the corresponding author, upon reasonable request.

Code availability

NA

References

Abayomi K, Gelman A, Levy M (2008) Diagnostics for multivariate imputations. J R Stat Soc Ser C Appl Stat 57(3):273–291
Google Scholar
Agbailu AO, Seno A, Clement OO (2020) Kalman filter algorithm versus other methods of estimating missing values: time series evidence. Studies 4(2):1–9
Google Scholar
Allison P (2015) Imputation by predictive mean matching: promise & peril. Statistical Horizons
Google Scholar
Allison PD (2001) Missing data. Sage publications
Aslan S (2010) Comparison of missing value imputation methods for meteorological time series data. MS thesis, Middle East Technical University
Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a Bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156
Google Scholar
Benavides IF, Santacruz M, Romero-Leiton JP, Barreto C, Selvaraj JJ (2022) Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm. Aquac Fish J
Breiman L (2001) Random forests. Mach Learn 45:5–32
Google Scholar
Budhiraja B, Gawuc L, Agrawal G (2019) Seasonality of surface urban heat island in Delhi city region measured by local climate zones and conventional indicators. IEEE J Sel Top Appl Earth Obs Remote Sens 12(12):5223–5232
Google Scholar
Canales RA (2004) The cumulative and aggregate simulation of exposure framework. Stanford University
Google Scholar
Chan M (2015) Achieving a cleaner, more sustainable, and healthier future. The Lancet 386(10006):e27–e28
Google Scholar
Chatterji A (2021) Air pollution in delhi: filling the policy gaps. Massach Undergr J Econ 17
Cho B, Dayrit T, Gao Y, Wang Z, Hong T, Sim A, Wu K (2020) Effective missing value imputation methods for building monitoring data. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE
Google Scholar
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Google Scholar
Crawley MJ (2012) The R book. John Wiley & Sons
Doove LL, Van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104
Google Scholar
Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216(5):657–667
Google Scholar
Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW (2012) Missing data: a systematic review of how they are reported and handled. Epidemiology 23(5):729–732
Google Scholar
Gaffert P, Meinfelder F, Bosch V (2018) Towards multiple-imputation-proper predictive mean matching. JSM:1026–1039
Ghazali SM, Shaadan N, Idrus Z (2020) Missing data exploration in air quality data set using R-package data visualisation tools. Bull Electr Eng Inform 9(2):755–763
Google Scholar
Gómez-Carracedo MP, Andrade J, López-Mahía P, Muniategui S, Prada D (2014) A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr Intell Lab Syst 134:23–33
Google Scholar
Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140
CAS Google Scholar
Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210
Google Scholar
Harvey AC (1990) Forecasting, structural time series models and the Kalman filter
Huisman M (2009) Imputation of missing network data: some simple procedures. J Soc Struct 10(1):1–29
Google Scholar
Iodice D’Enza A, Markos A, Palumbo F (2022) Chunk-wise regularised PCA-based imputation of missing data. Stat Methods Appt 31(2):365–386
Google Scholar
John C, Ekpenyong EJ, Nworu CC (2019) Imputation of missing values in economic and financial time series data using five principal component analysis approaches. CBN J Appl Stat (JAS) 10(1):3
Google Scholar
Josse J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la société française de statistique 150(2):28–51
Josse J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5(3):231–246
Google Scholar
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70:1–31
Google Scholar
Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104
CAS Google Scholar
Junior JRB, do Carmo Nicoletti M, Zhao L (2016) An embedded imputation method via attribute-based decision graphs. Expert Syst Appl 57:159–177
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
CAS Google Scholar
Kalman RE (1960) A new approach to linear filtering and prediction problems. Trans ASME J Basic Eng 82:35–45
Kleinke K (2018) Multiple imputation by predictive mean matching when sample size is small. Methodology: Euro J Res Methods Behav Res Methods 14(1):3
Google Scholar
Kumar P (2022) A critical evaluation of air quality index models (1960–2021). Environ Monit Assess 194(4):1–45
Google Scholar
Legates DR, McCabe GJ Jr (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35(1):233–241
Google Scholar
Li KH, Le ND, Sun L, Zidek JV (1999) Spatial–temporal models for ambient hourly PM10 in Vancouver. Environmetrics: the official journal of the Int Environ Sci 10(3):321–338
CAS Google Scholar
Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons
Google Scholar
Little RJA, Rubin DB (2002) Single imputation methods. Statistical analysis with missing data. p 59–74. https://doi.org/10.1002/9781119013563.ch4
Liu X, Wang X, Zou L, Xia J, Pang W (2020) Spatial imputation for air pollutants data sets via low rank matrix completion algorithm. Environ Int 139:105713
CAS Google Scholar
Lloret J, Lleonart J, Solé I (2000) Time series modelling of landings in Northwest Mediterranean Sea. ICES Mar Sci Symp 57(1):171–184
Google Scholar
Marshall A, Altman DG, Holder RL (2010a) Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study. BMC Med Res Methodol 10(1):1–10
Google Scholar
Marshall A, Altman DG, Royston P, Holder RL (2010b) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):1–16
Google Scholar
Miettinen OS (2012) Theoretical epidemiology: principles of occurrence research in medicine. Theoretical epidemiology: principles of occurrence research in medicine:359–359
Molenberghs G, Kenward M (2007) Missing data in clinical studies. John Wiley & Sons
Moriasi DN, Arnold JG, Van Liew MW, Bingner RL, Harmel RD, Veith TL (2007) Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans ASABE, Appl 50(3):885–900
Google Scholar
Moritz S, Bartz-Beielstein T (2017) ImputeTS: time series missing value imputation in R. R J 9(1):207
Google Scholar
Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. arXiv preprint arXiv:1510.03924
Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. SciAsia 34(3):341–345
CAS Google Scholar
Plaia A, Bondi A (2006) Single imputation method of missing values in environmental pollution data sets. Atmos Environ 40(38):7316–7330
CAS Google Scholar
Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49
CAS Google Scholar
Ramli MN, Yahaya A, Ramli N, Yusof N, Abdullah M (2013) Roles of imputation methods for filling the missing values: a review. Adv Environ Biol 7(12 S2):3861–3870
Google Scholar
Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420
Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. CRC press
Google Scholar
Schenker N, Taylor JM (1996) Partially parametric techniques for multiple imputation. Comput Stat Data Anal 22(4):425–446
Google Scholar
Siddique J, Belin TR (2008) Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med 27(1):83–102
Google Scholar
Siddique J, Harel O (2009) MIDAS: a SAS macro for multiple imputation using distance-aided selection of donors. J Stat Softw 29:1–18
Google Scholar
Stekhoven DJ, Stekhoven MDJ (2013) Package ‘missForest’. R package version 1
Sukatis FF, Noor NM, Zakaria NA, Ul-Saufie AZ, Annas S (2019) Estimation of missing values in air pollution dataset by using various imputation methods. Int J Conserv Sci 10(4):791–804
CAS Google Scholar
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62
Google Scholar
Van Buuren S (2018) Flexible imputation of missing data. CRC press
Google Scholar
Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Google Scholar
Wardana I, Gardner JW, Fahmy SA (2022) Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder. Neural Comput Appl:1–26
Weerakody PB, Wong KW, Wang G, Ela W (2021) A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 441:161–178
Google Scholar
Welch G (2006) An Introduction to the Kalman Filter. Univ. of North Carolina http://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf. Accessed 10 Oct 2022
Wijesekara W, Liyanage L (2020) Comparison of imputation methods for missing values in air pollution data: case study on Sydney air quality index. In: Future of Information and Communication Conference. Springer
Google Scholar
Willmott CJ (1981) On the validation of models. Phys Geogr 2:184–194
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 30(1):79–82
Google Scholar
Willmott CJ, Matsuura K (2006) On the use of dimensioned measures of error to evaluate the performance of spatial interpolators. Int J Geogr Inf Sci 20(1):89–102
Google Scholar
World Health Organization (2016) Ambient air pollution: A global assessment of exposure and burden of disease
Zeileis A, Grothendieck G (2005) zoo: S3 Infrastructure for Regular and Irregular Time Series. J Stat Softw 14(6):1–27
Google Scholar

Download references

Acknowledgements

The authors are thankful to CSIR-CSIO Chandigarh for providing the necessary infrastructure and support to carry out our work and especially thanks to the Central Pollution Control Board for making air pollutants data openly available to their official website.

Author information

Authors and Affiliations

Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, 201002, India
Priti K, Kaushlesh Singh Shakya & Prashant Kumar
CSIR-Central Scientific Instruments Organisation, Sector 30-C, Chandigarh, 160030, India
Priti K, Kaushlesh Singh Shakya & Prashant Kumar

Authors

Priti K
View author publications
You can also search for this author in PubMed Google Scholar
Kaushlesh Singh Shakya
View author publications
You can also search for this author in PubMed Google Scholar
Prashant Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed substantially to the following: Priti K performed writing—original draft, conceptualization, data curation, formal analysis, and methodology. Kaushlesh Singh Shakya performed data curation, formal analysis, and visualization. Prashant kumar: supervision, validation, and review and editing.

Corresponding author

Correspondence to Prashant Kumar.

Ethics declarations

Ethics approval

NA

Consent to participate

NA

Consent for publication

NA

Conflict of interest

The authors declare no competing interests.

Additional information

Responsible Editor: Marcus Schulz

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

ESM 1

(DOCX 68.8 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

K, P., Shakya, K.S. & Kumar, P. Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage. Environ Sci Pollut Res 30, 75469–75488 (2023). https://doi.org/10.1007/s11356-023-27659-x

Download citation

Received: 02 January 2023
Accepted: 11 May 2023
Published: 23 May 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11356-023-27659-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Abstract

Similar content being viewed by others

Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space

Spatiotemporal Exposure Prediction with Penalized Regression

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Explore related subjects

Introduction

Methodology

Study area

Data information

Descriptive statistics of all monitoring stations

Missing data mechanism

Missing data statistics

Imputation methods

Univariate time series methods (inter-time series)

Multivariate timeseries methods (cross-sectional inter-variables)

R packages Functions used for missing values imputation

Methods for evaluation metric

Result and analysis

Univariate imputation techniques

Evaluation of spatial characteristics

Selecting target station

Spatial correlation coefficient of pollution data to select covariates for MICE algorithm

Selecting neighbouring station (covariates) to frame combination

Multivariate imputation

Statistical significance test (Wilcoxon Signed Rank Test): comparison of imputation methods

Discussion

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher’s note

Supplementary information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation