Abstract
The presence of missing values in daily rainfall data may hamper the analyses to determine effective results for solving problems of hydrological, agricultural, and climatological issues. The study attempts to select an appropriate method for estimating the missing value of daily rainfall data of Bangladesh. For this purpose, eight methods and seven comparison techniques are employed. For imputation of missing values employing these methods, three sets of daily rainfall data (1, 5, and 10% missing values) with 1000 repetitions are considered randomly for five regions of the country. These samples are artificially created as missing and then imputation for these missing values is made applying the selected methods. The relative performance of the methods are examined using some comparison criteria. The following observations can be made from the study regarding the choice of the appropriate missing value estimation technique: for imputation of the missing values of daily rainfall data, the arithmetic average method for rainfall stations Chittagong and Rajshahi in the south-east region and the north-west region, respectively, is found as the best methods. Further, the single best estimator method for rainfall stations Sylhet and Dhaka in the north-east region and the mid-region, respectively, and the EM-MCMC method for rainfall station Khulna of the south-east region are also identified as the best methods in respect of Kolmogorov-Smirnov test, the lowest bias of estimate, the value of S index, etc.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Rainfall is an important factor in the field of hydrological study. The occurrences of rainfall provide the input of crop growth and production models. It also indicates the situation of landfills, tailing dams, and land disposal of liquid waste materials which are environmentally sensitive to any region or overall country. Generally, the rainfall amount is measured in daily time scale method, and then, it may be converted into a monthly or annual series. Therefore, the analysis of rainfall plays a significant role in the field of agriculture, ecology, and climatology studies (Asati 2012; Williams 1998; Cong and Brady 2012; and Silva et al. 2007). Besides, it is a highly influential factor for flood formation. Rainfall data analysis is always hampered by the shortage of consecutive data (Silva et al. 2007; Simolo et al. 2010). The presence of missing values in the rainfall data of different countries in the world is a common problem for data analysis. Rainfall data may be missing for various reasons such as loss of yearbooks, human errors, wars, fire accidents, occasional interruptions of automatic stations, instrument malfunctions, and network reorganizations (Simolo et al. 2010). A similar situation may also be observed in Bangladesh.
For performing the effective analysis of rainfall, it is essential to estimate the missing value of daily rainfall data. For this purpose, different authors have suggested suitable methods for estimating the missing values for specific countries or regions using several comparison techniques to the missing data estimation methods. Because the performance of any method for estimating missing values generally depends on the nature of the missing mechanism, nature of consecutive occurrences of rainfall, nature of neighboring stations, other intrinsic characteristics of the climate variables, etc. (Little and Rubin 1987).
To estimate the missing value of daily rainfall data, Silva et al. (2007) and Suhalia et al. (2008) have compared different methods such as inverse distance, normal ratio, arithmetic mean and aerial precipitation ratio, inverse weighting distance, and correlation coefficient method for Sri Lanka and for Malaysia, respectively, following the suggestions of Simanton and Osborn (1980), Tabios and Salas (1985), Young (1992), Hubbard (1994), Lennon and Turner (1995), Tang et al. (1996), Xia et al. (1999), Eischeid et al. (2000), Teegavarapu and Chandramouli (2005), Ahrens (2006), Garcia et al. (2008), and Chen and Liu (2012). For comparing these methods, they used techniques such as similarity index (S index), mean absolute error (MAE), and coefficient of correlation (R).
Further, Lo Presti et al. (2010) identified the Theil method as the best among the regression-based methods (simple substitution, parametric regression, ranked regression, and Theil method) for estimating the missing value of daily rainfall data of Candelaro River Basin, Italy. Besides, Coulibaly and Evora (2007) suggested artificial neural network (ANN) algorithms for imputation of daily rainfall missing precipitation. This algorithm is adapted on the basis of weighted interpolation technique from adjacent stations. Yozgatligil et al. (2013) suggested the Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) algorithm as the best technique for estimating missing value for the Turkish meteorological data. These studies indicate that to estimate the missing value of daily rainfall data for different stations, different techniques are found appropriate for separate station or region. Therefore, to analyze the daily rainfall data of different rainfall stations of Bangladesh, a suitable missing value estimation technique is essential for separate stations or regions.
Bangladesh is an agro-based country. Around 50% of the country’s labor forces are engaged in this sector. Its contribution on the gross domestic product (GDP) is 15.33% in the overall growth of 7.05% for the FY 2015–2016 (Bangladesh Economic Review 2016). It indicates that the analysis of daily rainfall data has a significant role in the development of agricultural sector. Therefore, to analyze the daily rainfall data of Bangladesh, several authors applied different simple techniques for replacing or handling the missing data problems, such as omission of the missing data, replacing the missing values in a month by average value of the same month from previous, and subsequent years (Kripalani et al. 1996). However, none of the works has been done till date to identify the best method for estimating the missing value of daily rainfall data for different stations in Bangladesh. Therefore, the study is an attempt to compare several missing value estimation methods and suggests a suitable method to estimate the missing value of daily rainfall data for different rainfall stations of Bangladesh.
Following this section, the study is organized as below. The daily rainfall data and the behavior of daily rainfall missing data are discussed in Sect. 2. The different methods and their comparison techniques to identify the best method for estimating the missing value of daily rainfall data for target stations are also discussed in the same section. The discussions regarding the results obtained by applying the selected methods and comparison techniques are depicted in Sect. 3. Finally, the conclusions of the study are drawn in Sect. 4.
2 Data and methods
2.1 Data
To perform the above objective, this study considers 27 out of 35 daily rainfall recording stations under Meteorological Department of Bangladesh. The metric unit millimeter is the measurement unit of daily rainfall data. These stations record daily rainfall data for consecutive days. We have considered five climatic sub-zones of Bangladesh according to the geographical condition such as south-east region, north-east region, mid-region, south-west region, and north-west region (Rashid 1991). From each climatic sub-zone, one station is considered as target station and the stations surrounding 100 km of it are considered as reference stations (Tronci et al. 1986). The climatic sub-zone-wise daily rainfall measuring stations, sub-zone-wise target and reference stations, the availability of rainfall data for corresponding stations, and station-wise geographical condition are shown in Table 2.
2.1.1 Overview of missing data for selected stations
To perform the study, it is identified that each considered rainfall station contains some missing data. The proportions of missing data in percentage for available years in the selected stations are displayed in Fig. 1. It indicates that the percentage of missing observations in the 27 stations varies from 1.9% in Ambagan to 6.6% in Hatiya. The study considers different methods for estimating the missing value of daily rainfall observations including different comparison techniques for these methods to identify the best method for each of the selected stations.
2.1.2 Missing data mechanism
The problems of missing data may arise due to different observational behaviors. Under probabilistic response, the patterns of missing data may be classified into three phenomena: missing at random (MAR), missing completely at random (MCAR), and not missing at random (NMAR) (Rubin 1976; Schafer 1997; Little and Rubin 1987). The Probability that missing data of daily rainfall observations depends on the observed responses but not on missing data itself indicates the MAR. The probability that the missing data of daily rainfall observations does not depend on its own values or observed data provides MCAR; i.e., MCAR is the special case of MAR. The probability that the missing data of daily rainfall depends on the value of missing observations itself can be termed as NMAR.
To measure the patterns of missing data different authors (Dempster et al. 1977; Little and Rubin 1987; Rubin 1987; Schafer 1997; Collins et al. 2001; Graham et al. 1997) have suggested different techniques such as maximum likelihood (ML) estimation method and multiple imputation (MI) method under expectation-maximization (EM) algorithm based on Bayesian framework, following the indication of Rubin (1976). Because the formulation of a statistical model using NMAR data creates different complexities, such as the missing data model may not be correctly specified, the estimated parameters may contain sizable bias, etc. Therefore, for testing the existence of NMAR mechanism in the daily rainfall missing observations, Lo Presti et al. (2010) suggested to verify the following statements:
-
(i)
The existence of a positive correlation between the missing data (yearly percentage of days with missing data in each station) and the elevation of the stations and
-
(ii)
The amount of missing data are affected by evident seasonal behavior; for instance, monsoon and autumn seasons are more rainy than summer, late autumn, and winter seasons.
To verify the statement (i), the study observed that the correlation coefficient between the missing data and the elevation of the corresponding station is found negative (r = − 0.133 with p value 0.507). Although this is not significant, the value of the correlation coefficient appears to be negative implying non-positive correlation between elevation and proportion of missing data of corresponding stations. The result indicates that the daily rainfall missing observations for different stations of Bangladesh do not follow NMAR mechanism. Further, to verify the statement (ii), Lo Presti et al. (2010) suggested standardized entropy (H) which is stated as below:
where p(k) is the proportion of missing observations to the total number of rainfall observations for a station at the kth month during the study period and lnk indicates the upper boundary of the measurement months. The value of standardized entropy (H) close to 1 indicates that the missing data distribution for study period is not affected by the seasonal behavior; i.e., the hypothesis NMAR may be rejected. For instance, the Table 1 shows the measurement result of standardized entropy for south-east region’s target station and its reference stations of Bangladesh. The standardized entropy is found near to 1 for the selected stations of south-east region of the country, which indicates that the distribution of missing observations of rainfall data does not follow NMAR (Table 1). Similar results are also observed for other regions of the country.
Further, Lo Presti et al. (2010) indicated that the measurement of MCAR for the missing data of rainfall observations always depends on the efficient measurement of rainfall amount. To measure the rainfall amount, Rubel and Hantel (1999) identified three leading sources of error: (i) wind-induced losses, (ii) wetting of the walls and evaporation from the tipping bucket, and (iii) instrumental accuracy and precision, which lead to underestimation of the actual rainfall amount. Except these, several secondary sources of error affect the measurement of rainfall amount such as splash in, splash out, wind shield, and temperature (Lo Presti et al. 2010; Goodison et al. 1998).
Bangladesh Meteorological Department (BMD) measures the rainfall observation in each station using natural siphon rainfall recorders and Snowdon rain gauge (Chowdhury 2013). Recently, this technique is highly popular for the efficient measurement of rainfall observation; however, there may also arise some reasonable errors, such as influence of other variables, instrumental failure, weak efficiency, and precision of technician. Considering these arguments, MCAR mechanism may not be appropriate for the missing data distribution of rainfall of the country. Besides, Rubin (1976) and Scheffer (2002) stated that the missing data of rainfall observations are very rare to follow MCAR. That is, the rejection of MCAR hypothesis leads us to consider the MAR mechanism for missing data distribution of rainfall observations in Bangladesh.
2.2 Methods
To estimate the missing value of daily rainfall observations, several authors employed different methods which are already discussed in Sect 1. The present study employed eight methods for estimating the missing values and made their comparison following some comparison measures. For performing the study, daily rainfall data from the year 2011 to 2014 (total number of days, n = 1461) are considered for each of the five target stations. From each target stations, 1% (sample size, n = 14), 5% (sample size, n = 73), and 10% (sample size, n = 146) non-missing observations are chosen randomly, and these are artificially created missing values. The actual values of those days are considered as observed values. Thereafter, different methods for estimating missing values are employed and their comparisons are made to identify the suitable method for each target station. This random process for sample selection, estimation process, and comparison techniques are repeated 1000 times. In the end, the arithmetic mean of the comparison measures of those 1000 repetitions is considered for the final decision for choosing the best missing value estimation technique.
2.2.1 Methods of missing value estimation for daily rainfall data
The methods employed in the study for estimating the missing values of daily rainfall data are discussed in this section. Let Ymi indicates the missing value of mth day of ith target station in the study period (2011–2014) which is to be estimated, and Ymj indicates the rainfall amount of mth day of jth reference station, where i = 1,2,3, …, n and j = 1,2,3, …, k.
Arithmetic average (AA) method
To estimate the missing value of daily rainfall observations, this method is used generally (Silva et al. 2007; Xia et al. 1999; Yozgatligil et al. 2013). In this method, missing values are estimated by the arithmetic average of concurrent observations of the reference stations which have similar features with the target station (Paulhus and Kohler 1952). The arithmetic average for estimating the missing value of mth day of ith target station is given by
Normal ratio (NR) method
Paulhus and Kohler (1952) proposed the method for spatial interpolation using weights, Wi. Afterwards, several authors used the method for imputing the missing value of daily rainfall data. The weights are estimated by the ratio of total annual rainfall amount for target station, Ti, to the total annual rainfall amount for each reference station, Tj. Then, the NR method is explained as (Yozgatligil et al. 2013)
Normal ratio method considering the weight as correlation function (NRWC)
Young (1992) modified the NR method considering the weight as correlation function instead of proportion to annual rainfall amount of target station to the reference station for the selected period in which missing value exists. To formulate the method, the weight is defined as
where rij is the correlation coefficient between the ith target station and jth reference station and nij is the number of rainfall observations for measuring correlation coefficient. Then, the NRWC is defined as
Inverse distance (ID) method
Shepard (1968) suggested the method for analyzing two-dimensional interpolation functions for irregularly spaced data. Then, various authors used this method for estimating the missing value of daily rainfall observations (Lam 1983; Tronci et al. 1986; Hubbard 1994; Xia et al. 1999; Eischeid et al. 2000). The method is explained as the weighted interpolation technique which is defined as
where weight, wij is explained as:
Here, p indicates the exponent of inverse distance and dij indicates the distance of proximity neighboring jth reference station from ith target station. To calculate the distance, dij from ith to jth station, the latitude and longitude values of the respective stations are used. Latitude and longitude values of each station are converted into decimal degrees. Then, the distance from ith to jth station is computed using Great Circle Calculator of National Hurricane Center of USA (National Hurricane Center of USA n.d).
The method is used to estimate the missing observations of meteorological or hydrological variables under interest for assigning more weight to closer points. That is, weight is decreased as the distance from the interpolated points increase. The higher value of exponent p indicates a high influence of closer values to the interpolated point (Suhalia et al. 2008). Xia et al. (1999) indicated that usual value of p ranges from 1.0 to 6.0, and this value is generally considered as 2. Thus, the study considers the value of p as 2.
Multiple imputation using EM-MCMC method
To estimate missing value of the data set, the multiple imputation method is developed by Rubin (1976, 1978) to overcome the uncertainty of the missing value estimates which rises due to the insufficient measurement of sampling variability. The method demonstrates that the missing values are imputed by estimating the parameters of the appropriate model to incorporate the random variation of multiple times and the average of multiple values. Then, to interpolate the missing data, the Monte Carlo Markov chain method-based expectation-maximization (EM-MCMC) algorithm is employed on the basis of Bayesian sampling procedure as the multiple imputation method (Tanner and Wong 1987; Schafer 1997). The method considers missing data according to proportional information of the sample to estimate the parameter of interest through conditional expectations. Therefore, the EM algorithm provides an estimation of parameters and imputations using MCMC procedure under iteration method (Yozgatligil et al. 2013).
The daily rainfall data always contains incomplete data with two types of observations (non-missing and missing value); these observations are explained as Y = (Yoi, Ymi). Here, Yoi and Ymi indicate the non-missing value and missing value of rainfall data, respectively, of ith day. To perform the multiple imputation techniques using EM-MCMC algorithm based on the Bayesian framework, the unknown θ and Ymi are considered as random variables for the performance of statistical inference on the parameter θ (Schafer 1997). Then, the posterior predictive distribution is stated as
where the functions f(Ym i| Yo i, θ) and f(θ| Yo i) indicate the conditional predictive distribution of Ymi and the posterior distribution of θ in respect of the non-missing value of rainfall observations, respectively. The posterior distribution, f(θ| Yoi), is determined through the intensification of Yoi providing the assumed value of Ymi, which is measured by two-step procedure (Yozgatligil et al. 2013). The first step is to impute the missing value, Ymi, from the conditional predictive distribution, f(Ymi| Yoi, θ)in the kth step, i.e.,
The second step provides the new value of θ from the posterior distribution of non-missing data given the missing data.
These two steps are repeated through the iteration process starting with initial value as θ(0), and the process yields a Markov chain, i.e., \( \left({Y}_{mi}^{(1)},\kern0.5em {\theta}^{(1)}\right) \), \( \left({Y}_{mi}^{(2)},\kern0.5em {\theta}^{(2)}\right) \), \( \left({Y}_{mi}^{(3)},\kern0.5em {\theta}^{(3)}\right) \)and so on.
The distribution of these transition counts of the Markov chain provides the joint conditional distribution, f(Ymi, θ| Yoi). If the value of parameter θ(k) satisfies the convergence of distribution, then the posterior distribution, f(θ| Yoi) , is drawn from non-missing data using this value of the parameter. Then, from the posterior predictive distribution, f(Ymi| Yoi), the \( {Y}_{mi}^{(k)} \) is considered as an appropriate selection. This method is perfectly valid, provided that the missing data of rainfall observations do not follow the NMAR mechanism (Scheffer 2002).
The whole process of multiple imputations using EM-MCMC method can be done by using PROC MI in the University Edition of SAS (Yim 2015). This study used PROC MI to make the multiple imputations of daily rainfall missing data for the target stations using the concurrent rainfall data of reference stations as covariates. The underlying distribution of the data is considered to be multivariate normal in this study.
Single best estimator (SBE) method
To estimate the missing value of daily rainfall data, various authors employed this method (Wallis et al. 1991; Xia et al. 1999; Eischeid et al. 2000). For performing this method, the daily rainfall data of proximity neighboring station corresponding to the missing data of target station is considered as the estimated missing value, provided that the data of neighboring and target station would have the highest positive correlation. This is analogous to the simple substitution or closest neighboring station method (Lo Presti et al. 2010; Garcia et al. 2006). To select proximity neighboring station to the target station, minimum distance with the target station is considered, because the rainfall amount of closest neighboring station and the target station always provide highest positive correlation compared to the other neighboring stations. For instance, in mid region of the study, Faridpur is found to be the closest station to target station Dhaka (distance 57 km), and in the south-east region, Ambagan is found to be the closest station to the target station Chittagong (distance 15 km) (Table 2). The distance measurement procedure is discussed in the inverse distance method.
Linear regression (LR) method
To formulate the linear regression method for estimating the missing data of daily rainfall occurrences, the study considers the following estimated form (Dumedah and Coulibaly 2011; Xia et al. 1999):
where \( {\widehat{Y}}_{mi} \) indicates the estimated value of missing rainfall observation of mth day for ith target station and Xmj indicates the observation of mth day rainfall of the closest reference station j. The closest reference station is selected by considering the minimum distance to the target station within the neighboring stations. Here, \( \widehat{\alpha\ }\kern0.5em \mathrm{and}\kern0.50em \widehat{\beta} \) are the parameters which are estimated by using least squares method from the simple linear regression model. To estimate the parameters (α and β), the daily rainfall observations of ith target station and proximity neighboring jth reference station are considered as dependent and independent variables, respectively.
Multiple regression (MR) method
Kemp et al. (1983), Tabony (1983), Young (1992), and Eischeid et al. (1995) explained different facilities of the regression model for data interpolation and missing data estimation. Following their suggestions, Xia et al. (1999) indicated multiple regression method for estimating the missing value of daily rainfall occurrences. Therefore, for estimating the missing value of daily rainfall occurrences, the study considers the following estimated multiple regression model as an interpolation method:
where \( {\widehat{Y}}_{mi\kern0.5em } \) indicates the estimated value of rainfall observation of the mth day in the ith target station and Xmj indicates the observation of mth day of the jth reference station (j= 1,2,3,...,k; where k is the number of reference stations of station i. Here, \( \widehat{\alpha\ }\kern0.5em \mathrm{and}\kern0.50em {\widehat{\beta}}_j \) are the parameters which are estimated by using least squares method from the multiple regression model. To estimate the parameters (α and β), the daily rainfall observations of ith target station and jth reference stations are considered as dependent and independent variables, respectively.
2.2.2 Techniques of comparison for the missing value estimation methods
To identify the appropriate matching between observed and expected observations, the following comparison criteria are considered in the study. For calculating the value of each comparison criterion, firstly, the study considers randomly selected portion of data as missing although there exist observed observations for target station of daily rainfall data, and then, these values are estimated by using different missing value estimation techniques. These estimated values of daily rainfall missing data are considered as the expected values \( \left({Y}_i^{\mathrm{est}}\right) \), and these are compared with the observed amount of observations \( \left({Y}_i^{\mathrm{obs}}\right) \). Here, i(i = 1, 2, …, n) indicates the number of sample observations.
Kolmogorov-Smirnov (K-S) test
Kolmogorov-Smirnov test for goodness of fit would be used to determine whether a method provides good estimates of missing values or not (Massey 1951; Wilks 1995; Simolo et al. 2010). It uses the cumulative frequency distribution function, say Fn(x)-based non-parametric test. Here, x indicates any specific value of daily rainfall data and Fn(x) indicates the proportion of cumulative frequency of individuals for the daily rainfall distribution. Further, Sn(x) indicates the proportion of cumulative frequency of individuals for the estimated daily rainfall distribution. Then, the Kolmogorov-Smirnov test statistic for goodness of fit is defined as
If the p value of above statistic is large, then the estimated daily rainfall observations provide a good fit to the observed rainfall observations.
Bias or mean of error (ME)
In the concepts of statistics, bias indicates the difference between the estimator’s expected value and the true value of the parameter. If this result is 0 (zero), it indicates unbiased estimation (Walther and Moore 2005). Therefore, the study considers differences between the observed value of daily rainfall amount \( \left({Y}_i^{\mathrm{obs}}\right) \) and the estimated value of daily rainfall missing observation \( \left({Y}_i^{\mathrm{est}}\right) \) for the corresponding observed value indicate the errors. Then, the mean of errors indicates the bias of estimate which is stated as (Simolo et al. 2010)
The bias is calculated for all estimation methods and the method with the minimum bias is considered as the best.
MAE
Mean absolute error is computed as the mean of the absolute differences of observed values and the estimated missing values of daily rainfall data. The estimation method having the lowest MAE value is considered as the best (Suhalia et al. 2008). Therefore, the method is defined as
Root-mean-square error (RMSE)
The RMSE is frequently used to measure the difference between the values (sample and population values) predicted by a model or an estimator and the values actually observed (Li and Zhao 2001; Chai and Draxler 2014). This measure is also used to compare the different estimating techniques or methods for identification of the best method. The method with the lowest value of RMSE indicates the best method. The study considers RMSE to measure the best technique or method using the difference between the observed values \( \left({Y}_i^{\mathrm{obs}}\right) \) of daily rainfall data and estimated values \( \left({Y}_i^{\mathrm{est}}\right) \) of daily rainfall missing data (Simolo et al. 2010). The measurement formula for RMSE is given below:
Coefficient of variation of root-mean-square error (CVRMSE)
To identify the forecasting performances for time series data, RMSE is commonly used as a measure of accuracy under scale measurement. However, to eliminate scale dependencies of comparison criterion, Yozgatligil et al. (2013) suggested CVRMSE measurement. The measurement RMSE is divided by the mean of actual (observed) values gives the CVRMSE. To compare missing value estimation techniques, the RMSE divided by the mean of observed daily rainfall data for the artificially created missing period provides CVRMSE,
Minimum CVRMSE suggests the minimum percentage of variation between observed values and estimated values of missing data for daily rainfall occurrences. So, the method with the minimum CVRMSE is considered as the best.
Standard deviation of error (ESD)
The standard deviation of error (difference between the observed and estimated value) indicates the fluctuations of the deviations. The minimum ESD is used as the criterion to identify the best technique for estimating the missing value (Silva et al. 2007). Then, it is defined as
Similarity index (S index)
S index is the criterion of agreement for assessing model performance which implies the percentage of agreement between the observed and estimated values. The values of S index lie between 0.0 and 1.0, where 0.0 indicates complete disagreement and 1.0 indicates perfect agreement (Wilmott 1981). The S index is used to find out the best missing value estimation technique for rainfall data (Suhaila et al. 2008). The S index is stated below:
where \( \overline{Y} \) is the mean of observed daily rainfall and n is the number of estimated or observed observations.
3 Results and discussions
To estimate the missing value of daily rainfall observations, different methods and their comparative techniques are already discussed in the previous section for identifying the suitable method. The performance of data for the study is also discussed in Sect. 2. In that section, the classification procedures of 27 selected stations into five climatic sub-zones and the selection of target and reference stations from each sub-zone are elaborately discussed. The nature of missing data distribution of these stations follows MAR, is also explained in Sect. 2. The results of daily rainfall missing data estimation of five target stations for different methods and the results of comparative techniques for identifying station-wise suitable method are discussed in this section, followed by a comparison of the present study to similar studies conducted in other parts of the world.
The results of the comparison criteria of missing value estimation techniques for target station Sylhet of north-east region, Chittagong of south-east region, Dhaka of mid region, Khulna of south-west region, and Rajshahi of north-west region are revealed in Tables 3, 4, 5, 6, and 7, respectively. However, the correlation coefficient between daily rainfall amount of target station and its nearest reference station is higher than that of all other reference stations. For example, the distance between target station Chittagong and reference station Ambagan is smallest (15 km), and their correlation coefficient is found to be 0.91559 and it is statistically significant (Table 2).
In Fig. 2, box plots for all the stations in each of the five regions are shown taking n = 14, 75, and 146 observations, respectively, which were randomly selected and set as missing observations considering 1, 5, and 10% missing data. Each row of the figure shows the box plots for each region for three different sample sizes (e.g., row 1 in the figure shows box plots for the stations in south-east region for 14, 75, and 146 observations, respectively), and each column shows the box plots for different regions of same sample size (e.g., column 2 shows the box plots of stations of each region considering 75 observations). So, it is obvious that the box plots in column 1 will have less number of outliers than those of columns 2 and 3 because of the least sample size considered. If we wish to look at the pattern in each region for all sample sizes, similar behavior can be noticed. For instance, the number of outliers for stations in each region are increasing with the increase in sample size (e.g., number of outliers for n = 14, 75,146 in Dhaka station of mid-region are 3, 10, and 30, respectively and in Sylhet station of north-east region are 2, 13, and 26, respectively). However, if we want to compare the pattern of stations of different regions, that can be done looking at the same column for a specific sample size. Let us consider column 2 (n = 75), for south-east region, we can observe that there are a considerable number of outliers for each station and the rainfall observations are right skewed for all the stations (median is zero for all the stations). Similar patterns can be observed for north-east, mid, and south-west regions. There is one extreme station in south-west region, named Satkhira for which the third quartile is also very small (Q3 for Satkhira = 1), which might be the result of the random choice of observations; different sample of observations would result in different box plots, but the pattern of right-skewed data remains same for all combinations of observations. Same explanations apply to the stations of north-west regions with very lower values of third quartiles (Q3 for Rajshahi = 0, Q3 for Ishwardi = 1, and Q3 for Chuadanga = 2). The presence of outliers in stations in columns 1 and 3 can be explained similarly. This is to keep in mind that these box plots are representing the actual rainfall occurrences for the days those are considered missing in the present study; they are not representative for the whole data set. So, we cannot generalize the findings of the box plots to assess the geographic variation among the stations. These are presented only to help in assessing the performance of the missing value estimation techniques applied to estimate these observations.
3.1 North-east region
Only one reference station (Srimangal) is identified corresponding the target station Sylhet, which have very high elevation (Table 2). For single reference station, the methods EM-MCMC, SBE, and LR are applicable among the methods to estimate the missing values of daily rainfall data. In these methods, SBE for 1, 5, and 10% missing data and EM-MCMC for 1% missing data provides good fit following the KS test. The efficiency measurement technique CVRMSE provides a similar result (around 2.29) for SBE and EM-MCMC methods. SBE method provides the highest value of S index compared to other methods for 1, 5, and 10% missing values (Table 3). The correlation coefficient between target and reference stations for daily rainfall data is very low (0.3094) due to long distance (68 km) between target station and reference station (Table 2). For such relationship, the EM-MCMC and LR methods did not perform well. Therefore, the SBE method is the most suitable method for estimating the missing value of daily rainfall data for Sylhet station.
3.2 South-east region
For this region, the nine rainfall stations are identified as reference stations surrounding to the target station, Chittagong (Table 2). Kolmogorov-Smirnov goodness-of-fit test provides satisfactory results for all missing value estimation methods of daily rainfall observations except regression methods, ID and EM-MCMC methods for 5 and 10% data. The bias of the estimated missing values is found the minimum for all the fitted methods other than ID and MR methods. However, S index provides good performance for all the methods except ID method (Table 4).
The box plots of 1, 5, and 10% daily rainfall data for target and reference stations in this region indicate some outliers in reference stations (Fig. 2). In these stations, daily rainfall observations show high variation due to high discrimination of elevations (Table 2).The box plots also indicate the possibility of the existence of a pair-wise moderate correlation between daily rainfall observations of the reference stations (Fig. 2), so the regression models may not provide a good fit for estimation of missing values. The ID method does not provide significant result in this region due to considerable variation of the distance between the target and each of the reference stations (Table 2). Therefore, to estimate the missing value of daily rainfall data in Chittagong station, four methods (AA, NR, NRWC, and SBE) provided satisfactory performance.
3.3 Mid region
In this region, five reference stations are identified neighboring target station Dhaka. According to distance, Faridpur is the nearest reference station to the target station (distance 57 km), and the elevation of the reference stations and target station are almost similar except Chandpur station (Table 2). The KS test provides a good fit for all methods except LR and MR methods, and AA (for 10% missing data), NR (for 10% missing), ID (for 5 and 10% missing), and EM-MCMC (for 5% missing) methods. The EM-MCMC method for estimating these missing data of daily rainfall provide the higher RMSE, MAE, and ESD than that of other methods. However, the bias of the estimates is the lowest for SBE method and S indices are close to 1 for AA, NR, NRWC, and SBE methods (Table 5).
The box plots for 1, 5, and 10% data of daily rainfall provide the presence of outliers for every station (Fig. 2). The correlation coefficient of daily rainfall amount between the target station Dhaka and for each of the reference stations expect Faridpur station is found around 0.45. For such weaker relationship, LR and MR methods may not be provided good fit. Again, for Dhaka and Faridpur station, this correlation is found 0.603. Due to this relationship, SBE method can be considered as the best estimator to estimate the missing value of rainfall data for Dhaka station on the basis of lowest bias and the higher value of S index compared to all other methods.
3.4 South-west region
For this region, five stations are identified as reference stations surrounding to the target station Khulna. For these stations, elevation is almost similar (around 2.1 m). In respect of distance, the nearest station is Mongla (35 km) to the target station (Table 2). The methods AA, NR, NRWC, EM-MCMC, and SBE demonstrate good fit to estimate the missing value of daily rainfall data following KS test. The bias and MAE of the estimates are found lower for AA method, and CVRMSE is observed lower for EM-MCMC compared to other methods. The value of S index for EM-MCMC method indicates the highest (S index close to 1) than that of other methods (Table 6).
The box plots for daily rainfall observations of the south-west region indicate a large number of outliers for all stations (Fig. 2). For this reason, the regression methods do not work well to estimate missing data of daily rainfall data. Further, the ID method also does not provide good fit due to the long distance between the target and reference stations. Therefore, the EM-MCMC method is found to be the best estimator for Khulna station to estimate the missing value of daily rainfall data.
3.5 North-west region
For this region, two rainfall stations are identified as reference station against target station Rajshahi. Ishwardi is the nearest reference station to the target station according to distance. The correlation coefficient between the target and its nearest reference station for daily rainfall data is 0.508 (Table 2).The methods AA, NR, NRWC, and SBE provide good fit to estimate the missing value of daily rainfall data following KS test. The bias of the estimates is found lowest for AA and SBE methods, and CVRMSE is found lowest for AA method than that of other methods. However, the value of S index is found almost same (around 0.65) for AA and SBE methods (Table 7).
The box plots indicate high variation among the stations’ rainfall data in this region (Fig. 2); due to this, the methods LR, MR, ID, and EM-MCMC do not provide satisfactory results in terms of comparison criteria. Besides, for long distance from the target to reference stations (Table 2), the ID method does not perform adequately. Therefore, the AA and SBE methods provide well fit in respect of lowest bias and high S index value to estimate the missing value of daily rainfall data in Rajshahi station.
3.6 Comparison with other similar studies
The present study has been conducted to suggest a suitable method to estimate the missing values in daily rainfall data in Bangladesh. The study employed eight different methods found in different literature and compared the performances of the methods using seven techniques. To the best of our knowledge, this is the first study making an attempt to find the appropriate missing value estimation technique for Bangladesh till date. However, this study was inspired by similar studies conducted in other parts of the world. For instance, there have been studies to find out the best method to estimate missing values in Turkish meteorological data (Yozgatligil et al. 2013), daily precipitation data from Brazil (Ferrari and Ozaki 2014), rainfall data from Malaysia (Suhalia et al. 2008), Italy (Lo Presti et al. 2010), Andes region in Venezuela (Garcia et al. 2006), etc.
Garcia et al. (2006) performed a cluster analysis to find two closest stations corresponding to a rainfall station and fill in the missing value of the target station from those closest station. They applied their method to daily, weekly, bi-weekly, and monthly data of 106 rainfall stations in Andes region in Venezuela and assessed the performance of the proposed method using mean error (ME), MAE, RMSE, coefficient of correlation (r), and Willmott agreement index (d). The author did not compare the proposed method with any other methods.
Yozgatligil et al. (2013) suggested EM-MCMC algorithm as best technique in case of Turkish meteorological data after comparing simple and weighted arithmetic average methods, multilayer perceptron neural network, and MCMC-based multiple imputation methods. The comparison criteria used in the study were RMSE, coefficient of variation of RMSE (CVRMSE), and correlation dimension technique (branch of nonlinear dynamic time series analysis).
Ferrari and Ozaki (2014) compared nearest neighbor method, inverse distance-weighted ratio method, and linear regression method for imputation of missing values in precipitation data from the state of Parana in southern of Brazil according to the value of RMSE. The author stated the inverse distance weighted ratio method to be most appropriate for imputing missing precipitation data from 484 stations in the area of interest.
Silva et al. (2007) compared arithmetic mean method, normal ratio method, and inverse distance method to estimate the missing rainfall data in Sri Lanka according to the measurements of descriptive statistics of error, RMSE, mean absolute percentage of error, and correlation coefficient. The authors also proposed a new method named aerial precipitation ratio which selected stations representing each of seven major ecological zones in Sri Lanka, and monthly rainfall data was estimated using abovementioned methods taking the surrounding stations in each zone into account. The authors suggested different methods to be suitable for different zones in Sri Lanka with no indication of single best method.
Lo Presti et al. (2010) proposed methods to fill in the missing observations in daily rainfall data in the Candelaro River Basin (Italy) in two stages. In the first stage, the authors assessed the missingness mechanism present in the data and then applied four different regression methods (simple substitution, parametric regression, ranked regression, and Theil method) to estimate the missing daily rainfall data. By studying the absolute error distribution, the authors indicated the Theil method to be the most suitable one, though a very complex method. Simple substitution method was also marked as acceptable method.
The present study employed eight methods to estimate the missing rainfall data from a target station from each of the five climatic sub-zones of Bangladesh, so the methods are applicable to all other stations. This kind of climatic or ecological division was only made by Silva et al. (2007). Before estimating the missing daily rainfall amount in the target stations, the missingness mechanism of the missing rainfall data was tested following the suggestions of Lo Presti et al. (2010). To the best of our knowledge, none of the other studies have tested the missingness mechanism. The present study chose the eight methods, which is highest among all studies reported here, from all the mentioned literature on the basis of relevance, simplicity, and relative performance in other regions. The comparison of such a high number of methods allowed flexibility in making choice of the best method to estimate the missing data in daily rainfall observations. Also, seven comparison criteria used in the present study were combined from the previous studies. The K-S test to determine the goodness-of-fit test, apart from the present study, was only applied once (Simolo et al. 2010) to assess the performance of missing value estimation techniques in rainfall data. The result of K-S test has significant contribution in choosing the most suitable method in the present article. One of the unique element of this study was the inclusion of box plots for the selected missing observations in target and reference stations which helped to understand the actual scenario in different stations across Bangladesh on the days chosen to be missing, which has effect on the performance of particular missing value estimation technique. Though this study did not propose any new method, it integrated a wide range of methods and comparison criteria along with some descriptive measures to be able to estimate the missing data in daily rainfall which will give rise to further scientific studies involving continuous rainfall data in future.
4 Conclusion
A suitable method of estimating missing rainfall value is of great interest to the researchers worldwide. The reason behind such interest is to make use of rainfall data from long series where occasional missing values pose formidable difficulty in using such data. In the present paper, a comparison of different methods has been conducted in order to suggest the best possible choice under certain specific criteria. Although the focus of the current paper is Bangladesh, the statistical criteria that are being used in this study can be generalized on the basis of underlying statistical reasoning that are highlighted here. To estimate the missing value of daily rainfall observations for five climatic regions’ target stations of the country, the eight methods and seven comparison techniques are employed to identify the best suitable method for each of the stations. For performing these methods, three sets of daily rainfall missing data sample (1, 5, and 10%) with 1000 times repetitions are considered (Sect. 2). The performance of the estimation methods according to the comparison techniques are shown in Tables 3, 4, 5, 6, and 7. On the basis of these results, the discussions are made in Sect. 3. From the results and discussions of the study, the following conclusions can be drawn. We have made an attempt to find a single method that can be suggested for all the stations in Bangladesh. To examine whether the findings of this study hold for other countries, studies can be repeated for other countries as well. This may provide a consensual technique under varied conditions prevail in the nature and extent of missing values in the time series data on rainfall.
Let us consider five measures of comparison (out of seven measures included in this study) for identifying the best estimation technique, namely, (i) K-S test statistic, (ii) bias, (iii) RMSE, (iv) MAE, and (v) S index. Two other measures of comparison CVRMSE and ESD are ignored due to inclusion of similar measures RMSE and MAE. The Kolmogorov-Smirnov test statistic shows that among all the estimation techniques, only SBE provides consistently acceptable estimation technique for all the regions. Other measures of comparison such as bias, RMSE, MAE, and S index also confirm that SBE is consistently better as a technique of estimating missing values. In some cases, arithmetic average, EM-MCMC, provides good estimate along with the linear or multiple regression estimates but the results are not consistent for all the regions. Garcia et al. (2006) observed that closest station method as the best one to fill in the missing observations of rainfall data in different time scales in Andes region in Venezuela. Lo Presti et al. (2010) stated the simple substitution method, which is same as the SBE described in the present study, to be an acceptable technique of missing value estimation in daily rainfall data in the Candelaro River Basin (Italy) when the similarity value is particularly high and significant. In the present study, from Table 2, it can be observed that the target station had significant positive correlation with all its reference stations. For the SBE method, a single station is chosen for each target station, which has highest significant correlation with the daily rainfall observations in the target station, and it also happens to be the nearest station to the target station according to the distance (Table 2). Thus, the consistent performance of SBE method has both statistical reasoning and practical significance. Hence, we may conclude that the technique of single best estimator is singled out in this study as the possible choice of estimating missing values.
References
Ahrens B (2006) Distance in spatial interpolation of daily rain gauge data. Hydrol Earth Syst Sci 10:197–208
Asati SR (2012) Analysis of rainfall data for drought investigation at Agra U. P. Int J Life Sci Biotechnol Pharm Res 1(4):81–86
Bangladesh Economic Review (2016) Economic adviser’s wing, finance division, Ministry of Finance, Government of the People’s Republic of Bangladesh
Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)?—arguments against avoiding RMSE in the literature. Geosci Model Dev 7(3):1247–1250
Chen FW, Liu CW (2012) Estimation of the spatial rainfall distribution using inverse distance weighting (IDW) in the middle of Taiwan. Paddy Water Environ 10(3):209–222
Chowdhury MRK (2013) Country report: Bangladesh meteorological department (BMD), People’s republic of Bangladesh
Collins LM, Schafer JL, Kam CM (2001) A comparison of inclusive and restrictive strategies in modern missing-data procedures. Psychol Methods 6:330–351
Cong RG, Brady M (2012) The interdependence between rainfall and temperature: copula analyses. Sci World J 2012:1–11
Coulibaly P, Evora ND (2007) Comparison of neural network methods for infilling missing daily weather records. J Hydrol 341:27–41
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38
Dumedah G, Coulibaly P (2011) Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data. J Hydrol 400(1–2):95–102
Eischeid JK, Baker CB, Karl TR, Diaz HGF (1995) The quality control of long-term climatological data using objective data analysis. J Appl Meteorol 34:2787–2795
Eischeid JK, Pasteris PA, Diaz HF, Plantico MS, Lott NJ (2000) Creating a serially complete, national daily time series of temperature and precipitation for the western United States. J Appl Meteorol 39(9):1580–1591
Ferrari GT, Ozaki V (2014) Missing data imputation of climate datasets: implications to modeling extreme drought events. Rev Bras Meteorol 29(1):21–28
Garcia B, Sentelhas P, Tapia L, Sparovek G (2006) Filling in missing rainfall data in the Andes region of Venezuela, based on a cluster analysis approach. Rev Bras Agrometeorol 14(2):225–233
Garcia M, Peters-Lidard CD, Goodrich DC (2008) Spatial interpolation in a dense gauge network for monsoon storm events in the southwestern United States. Water Resour Res 44:W05S13. https://doi.org/10.1029/2006WR005788
Goodison B, Louie PYT, Yang D (1998) WMO solid precipitation measurement inter comparison. Final report
Graham JW, Hofer SM, Donaldson SI, MacKinnon DP, Schafer JL (1997) Analysis with missing data in prevention research. The science of prevention: methodological advances from alcohol and substance abuse research, 1, pp 325-366
Hubbard KG (1994) Spatial variability of daily weather variables in the high plains of the USA. Agric For Meteorol 68:29–41
Kemp WP, Burnell DG, Everson DO, Thomson AJ (1983) Estimating missing daily maximum and minimum temperatures. J Climate Appl 22:1587–1593
Kripalani RH, Inamdar S, Sontakke NA (1996) Rainfall variability over Bangladesh and Nepal: comparison and connections with features over India. Int J Climatol 16(6):689–703
Lam NSN (1983) Spatial interpolation methods : a review. Am Cartographer 10(2):129–149
Lennon JJ, Turner JRG (1995) Predicting the spatial distribution of climate: temperature in Great Britain. J Anim Ecol 64:370–392
Li X, Z Zhao (2001) Measures of performance for evaluation of estimators and filters. Proc. 2001 SPIE Conf. on Signal and Data Processing, (July–August), pp 1–12
Little JRA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York
Lo Presti R, Barca E, Passarella G (2010) A methodology for treating missing data applied to daily rainfall data in the Candelaro River Basin (Italy). Environ Monit Assess 160:1–22
Massey FJ (1951) The Kolmogorov-Smirnov test for goodness of fit. JASA 46(253):68–78
National Hurricane Center of USA n.d. http://www.nhc.noaa.gov/gccalc.shtml
Paulhus JLH, Kohler MA (1952) Interpolation of missing precipitation records. Mon Weather Rev 80(8):129–133
Rashid H-e (1991) Geography of Bangladesh (2nd edition). In: Dhaka University Press Limited, Dhaka
Rubel F, Hantel M (1999) Correction of daily gauge measurements in the Baltic Sea drainage basin. Nord Hydrol 30:191–208
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Rubin DB (1978) Multiple imputation in sample surveys—a phenomenological Bayesian approach to nonresponse. Proceedings of the Survey Research Methods Section, ASA, pp 20–34
Rubin DB (1987) Multiple imputation for non-response in surveys. Wiley, New York
Schafer JL (1997) Analysis of incomplete multivariate data. Chapman & Hall, London
Scheffer J (2002) Dealing with missing data. Res Lett Inf Math Sci 3:53–160
Shepard D (1968) A two-dimensional interpolation functions for irregularly spaced data. Proceeding of the Twenty-Third National Conference of the ACM, Washington, DC, pp 517–524
Silva RP, Dayawansa NDK, Ratnasiri MD (2007) A comparison of methods used in estimating missing rainfall data. J Agric Sci 3(May):101–108
Simanton JR, Osborn HB (1980) Reciprocal-distance estimate of point rainfall. J Hydraul Eng 106:1242–1246
Simolo C, Brunetti M, Maugeri M, Nanni T (2010) Improving estimation of missing values in daily precipitation series by a probability density function-preserving approach. Int J Climatol 30:1564–1576
Suhalia J, Sayang MD, Jemain AA (2008) Revised spatial weighting methods for estimation of missing rainfall data. Asia-Pac J Atmos Sci 44(2):93–104
Tabios GQ, Salas JD (1985) A comparative analysis of techniques for spatial interpolation of precipitation. Water Resour Bull 21:365–380
Tabony RC (1983) The estimation of missing climatological data. J Climatol 3:297–314
Tang WY, Kassim AHM, Abubakar SH (1996) Comparative studies of various missing data treatment methods-Malaysian experience. Atmos Res 42:247–262
Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. JASA 82(398):528–540
Teegavarapu RSV, Chandramouli V (2005) Improved weighting methods, deterministic and stochastic data-driven models for estimation of missing precipitation records. J Hydrol 312:191–206
Tronci N, Molteni F, Bozzini M (1986) A comparison of local approximation methods for the analysis of meteorological data. Arch Meteorol Geophys Bioclimatol A 36:189–211
Walther BA, Moore JL (2005) The concept of bias, precison and accuracy, and their use in testing the performance of species richness estimators, with a literature review of estimators. Ecography 28:815–829
Wilks DS (1995) Statistical methods in the atmospheric sciences. Academic Press, New York
Williams P (1998) Modelling seasonality and trends in daily rainfall data. Adv Neural Inf Proces Syst 10:985–991
Wallis JR, Letten Mayer DP, Wood EF (1991) A daily hydro climatological data set for the continental United States. Water Resour Res 27:1657–1663
Wilmott CJ (1981) On the validation of models. Phys Geogr 2:194–194
Xia Y, Fabian P, Stohl A, Winterhalter M (1999) Forest climatology: estimation of missing values for Bavaria, Germany. Agric For Meteorol 96:131–144
Yim C (2015) Imputing missing data with SAS. SAS Global Forum 2015, April 26–29, 2015, Dallas, pp 1–21
Yozgatligil C, Aslan S, Iyigun C, Batmaz I (2013) Comparison of missing value imputation methods in time series: the case of Turkish meteorological data. Theor Appl Climatol 112(1–2):143–167
Young KC (1992) A three way model for interpolating monthly precipitation values. Mon Weather Rev 120:2561–2569
Acknowledgements
This study is supported under the HEQEP sub-project, CP-3293, in the Department of Applied Statistics, East West University funded by World Bank and implemented by University Grants Commission of Bangladesh (UGC). The authors are also grateful to Bangladesh Meteorological Department (BMD) for providing the data. We acknowledge the critical comments from anonymous reviewers and editor.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jahan, F., Sinha, N.C., Rahman, M.M. et al. Comparison of missing value estimation techniques in rainfall data of Bangladesh. Theor Appl Climatol 136, 1115–1131 (2019). https://doi.org/10.1007/s00704-018-2537-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00704-018-2537-y