1 Introduction

Rainfall is an important factor in the field of hydrological study. The occurrences of rainfall provide the input of crop growth and production models. It also indicates the situation of landfills, tailing dams, and land disposal of liquid waste materials which are environmentally sensitive to any region or overall country. Generally, the rainfall amount is measured in daily time scale method, and then, it may be converted into a monthly or annual series. Therefore, the analysis of rainfall plays a significant role in the field of agriculture, ecology, and climatology studies (Asati 2012; Williams 1998; Cong and Brady 2012; and Silva et al. 2007). Besides, it is a highly influential factor for flood formation. Rainfall data analysis is always hampered by the shortage of consecutive data (Silva et al. 2007; Simolo et al. 2010). The presence of missing values in the rainfall data of different countries in the world is a common problem for data analysis. Rainfall data may be missing for various reasons such as loss of yearbooks, human errors, wars, fire accidents, occasional interruptions of automatic stations, instrument malfunctions, and network reorganizations (Simolo et al. 2010). A similar situation may also be observed in Bangladesh.

For performing the effective analysis of rainfall, it is essential to estimate the missing value of daily rainfall data. For this purpose, different authors have suggested suitable methods for estimating the missing values for specific countries or regions using several comparison techniques to the missing data estimation methods. Because the performance of any method for estimating missing values generally depends on the nature of the missing mechanism, nature of consecutive occurrences of rainfall, nature of neighboring stations, other intrinsic characteristics of the climate variables, etc. (Little and Rubin 1987).

To estimate the missing value of daily rainfall data, Silva et al. (2007) and Suhalia et al. (2008) have compared different methods such as inverse distance, normal ratio, arithmetic mean and aerial precipitation ratio, inverse weighting distance, and correlation coefficient method for Sri Lanka and for Malaysia, respectively, following the suggestions of Simanton and Osborn (1980), Tabios and Salas (1985), Young (1992), Hubbard (1994), Lennon and Turner (1995), Tang et al. (1996), Xia et al. (1999), Eischeid et al. (2000), Teegavarapu and Chandramouli (2005), Ahrens (2006), Garcia et al. (2008), and Chen and Liu (2012). For comparing these methods, they used techniques such as similarity index (S index), mean absolute error (MAE), and coefficient of correlation (R).

Further, Lo Presti et al. (2010) identified the Theil method as the best among the regression-based methods (simple substitution, parametric regression, ranked regression, and Theil method) for estimating the missing value of daily rainfall data of Candelaro River Basin, Italy. Besides, Coulibaly and Evora (2007) suggested artificial neural network (ANN) algorithms for imputation of daily rainfall missing precipitation. This algorithm is adapted on the basis of weighted interpolation technique from adjacent stations. Yozgatligil et al. (2013) suggested the Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) algorithm as the best technique for estimating missing value for the Turkish meteorological data. These studies indicate that to estimate the missing value of daily rainfall data for different stations, different techniques are found appropriate for separate station or region. Therefore, to analyze the daily rainfall data of different rainfall stations of Bangladesh, a suitable missing value estimation technique is essential for separate stations or regions.

Bangladesh is an agro-based country. Around 50% of the country’s labor forces are engaged in this sector. Its contribution on the gross domestic product (GDP) is 15.33% in the overall growth of 7.05% for the FY 2015–2016 (Bangladesh Economic Review 2016). It indicates that the analysis of daily rainfall data has a significant role in the development of agricultural sector. Therefore, to analyze the daily rainfall data of Bangladesh, several authors applied different simple techniques for replacing or handling the missing data problems, such as omission of the missing data, replacing the missing values in a month by average value of the same month from previous, and subsequent years (Kripalani et al. 1996). However, none of the works has been done till date to identify the best method for estimating the missing value of daily rainfall data for different stations in Bangladesh. Therefore, the study is an attempt to compare several missing value estimation methods and suggests a suitable method to estimate the missing value of daily rainfall data for different rainfall stations of Bangladesh.

Following this section, the study is organized as below. The daily rainfall data and the behavior of daily rainfall missing data are discussed in Sect. 2. The different methods and their comparison techniques to identify the best method for estimating the missing value of daily rainfall data for target stations are also discussed in the same section. The discussions regarding the results obtained by applying the selected methods and comparison techniques are depicted in Sect. 3. Finally, the conclusions of the study are drawn in Sect. 4.

2 Data and methods

2.1 Data

To perform the above objective, this study considers 27 out of 35 daily rainfall recording stations under Meteorological Department of Bangladesh. The metric unit millimeter is the measurement unit of daily rainfall data. These stations record daily rainfall data for consecutive days. We have considered five climatic sub-zones of Bangladesh according to the geographical condition such as south-east region, north-east region, mid-region, south-west region, and north-west region (Rashid 1991). From each climatic sub-zone, one station is considered as target station and the stations surrounding 100 km of it are considered as reference stations (Tronci et al. 1986). The climatic sub-zone-wise daily rainfall measuring stations, sub-zone-wise target and reference stations, the availability of rainfall data for corresponding stations, and station-wise geographical condition are shown in Table 2.

2.1.1 Overview of missing data for selected stations

To perform the study, it is identified that each considered rainfall station contains some missing data. The proportions of missing data in percentage for available years in the selected stations are displayed in Fig. 1. It indicates that the percentage of missing observations in the 27 stations varies from 1.9% in Ambagan to 6.6% in Hatiya. The study considers different methods for estimating the missing value of daily rainfall observations including different comparison techniques for these methods to identify the best method for each of the selected stations.

Fig. 1
figure 1

Percentage of missing values (ratio of number of missing observations to total number of observations of rainfall in each station) of daily rainfall data for available years of each selected station for Bangladesh

2.1.2 Missing data mechanism

The problems of missing data may arise due to different observational behaviors. Under probabilistic response, the patterns of missing data may be classified into three phenomena: missing at random (MAR), missing completely at random (MCAR), and not missing at random (NMAR) (Rubin 1976; Schafer 1997; Little and Rubin 1987). The Probability that missing data of daily rainfall observations depends on the observed responses but not on missing data itself indicates the MAR. The probability that the missing data of daily rainfall observations does not depend on its own values or observed data provides MCAR; i.e., MCAR is the special case of MAR. The probability that the missing data of daily rainfall depends on the value of missing observations itself can be termed as NMAR.

To measure the patterns of missing data different authors (Dempster et al. 1977; Little and Rubin 1987; Rubin 1987; Schafer 1997; Collins et al. 2001; Graham et al. 1997) have suggested different techniques such as maximum likelihood (ML) estimation method and multiple imputation (MI) method under expectation-maximization (EM) algorithm based on Bayesian framework, following the indication of Rubin (1976). Because the formulation of a statistical model using NMAR data creates different complexities, such as the missing data model may not be correctly specified, the estimated parameters may contain sizable bias, etc. Therefore, for testing the existence of NMAR mechanism in the daily rainfall missing observations, Lo Presti et al. (2010) suggested to verify the following statements:

  1. (i)

    The existence of a positive correlation between the missing data (yearly percentage of days with missing data in each station) and the elevation of the stations and

  2. (ii)

    The amount of missing data are affected by evident seasonal behavior; for instance, monsoon and autumn seasons are more rainy than summer, late autumn, and winter seasons.

To verify the statement (i), the study observed that the correlation coefficient between the missing data and the elevation of the corresponding station is found negative (r = − 0.133 with p value 0.507). Although this is not significant, the value of the correlation coefficient appears to be negative implying non-positive correlation between elevation and proportion of missing data of corresponding stations. The result indicates that the daily rainfall missing observations for different stations of Bangladesh do not follow NMAR mechanism. Further, to verify the statement (ii), Lo Presti et al. (2010) suggested standardized entropy (H) which is stated as below:

$$ H=-\frac{\sum_{k=1}^K\left\{\left[\ln p(k)\right]\times p(k)\right\}}{\ln k} $$
(1)

where p(k) is the proportion of missing observations to the total number of rainfall observations for a station at the kth month during the study period and lnk indicates the upper boundary of the measurement months. The value of standardized entropy (H) close to 1 indicates that the missing data distribution for study period is not affected by the seasonal behavior; i.e., the hypothesis NMAR may be rejected. For instance, the Table 1 shows the measurement result of standardized entropy for south-east region’s target station and its reference stations of Bangladesh. The standardized entropy is found near to 1 for the selected stations of south-east region of the country, which indicates that the distribution of missing observations of rainfall data does not follow NMAR (Table 1). Similar results are also observed for other regions of the country.

Table 1 Proportion of missing observations per month and standardized entropy for study period (2011–2014) of south-east region taking Chittagong as the target station and its reference stations

Further, Lo Presti et al. (2010) indicated that the measurement of MCAR for the missing data of rainfall observations always depends on the efficient measurement of rainfall amount. To measure the rainfall amount, Rubel and Hantel (1999) identified three leading sources of error: (i) wind-induced losses, (ii) wetting of the walls and evaporation from the tipping bucket, and (iii) instrumental accuracy and precision, which lead to underestimation of the actual rainfall amount. Except these, several secondary sources of error affect the measurement of rainfall amount such as splash in, splash out, wind shield, and temperature (Lo Presti et al. 2010; Goodison et al. 1998).

Bangladesh Meteorological Department (BMD) measures the rainfall observation in each station using natural siphon rainfall recorders and Snowdon rain gauge (Chowdhury 2013). Recently, this technique is highly popular for the efficient measurement of rainfall observation; however, there may also arise some reasonable errors, such as influence of other variables, instrumental failure, weak efficiency, and precision of technician. Considering these arguments, MCAR mechanism may not be appropriate for the missing data distribution of rainfall of the country. Besides, Rubin (1976) and Scheffer (2002) stated that the missing data of rainfall observations are very rare to follow MCAR. That is, the rejection of MCAR hypothesis leads us to consider the MAR mechanism for missing data distribution of rainfall observations in Bangladesh.

2.2 Methods

To estimate the missing value of daily rainfall observations, several authors employed different methods which are already discussed in Sect 1. The present study employed eight methods for estimating the missing values and made their comparison following some comparison measures. For performing the study, daily rainfall data from the year 2011 to 2014 (total number of days, n = 1461) are considered for each of the five target stations. From each target stations, 1% (sample size, n = 14), 5% (sample size, n = 73), and 10% (sample size, n = 146) non-missing observations are chosen randomly, and these are artificially created missing values. The actual values of those days are considered as observed values. Thereafter, different methods for estimating missing values are employed and their comparisons are made to identify the suitable method for each target station. This random process for sample selection, estimation process, and comparison techniques are repeated 1000 times. In the end, the arithmetic mean of the comparison measures of those 1000 repetitions is considered for the final decision for choosing the best missing value estimation technique.

2.2.1 Methods of missing value estimation for daily rainfall data

The methods employed in the study for estimating the missing values of daily rainfall data are discussed in this section. Let Ymi indicates the missing value of mth day of ith target station in the study period (2011–2014) which is to be estimated, and Ymj indicates the rainfall amount of mth day of jth reference station, where i = 1,2,3, …, n and j = 1,2,3, …, k.

Arithmetic average (AA) method

To estimate the missing value of daily rainfall observations, this method is used generally (Silva et al. 2007; Xia et al. 1999; Yozgatligil et al. 2013). In this method, missing values are estimated by the arithmetic average of concurrent observations of the reference stations which have similar features with the target station (Paulhus and Kohler 1952). The arithmetic average for estimating the missing value of mth day of ith target station is given by

$$ {\widehat{Y}}_{mi}=\frac{\sum {Y}_{mj}}{k}=\frac{1}{k}\left({Y}_{m1}+{Y}_{m2}+\dots +{Y}_{mk}\right) $$
(2)

Normal ratio (NR) method

Paulhus and Kohler (1952) proposed the method for spatial interpolation using weights, Wi. Afterwards, several authors used the method for imputing the missing value of daily rainfall data. The weights are estimated by the ratio of total annual rainfall amount for target station, Ti, to the total annual rainfall amount for each reference station, Tj. Then, the NR method is explained as (Yozgatligil et al. 2013)

$$ {\widehat{Y}}_{mi}=\frac{1}{k}{\sum}_j^k\frac{T_i}{T_j}{Y}_{mj}=\frac{1}{k}\left(\frac{T_i}{T_1}{Y}_{m1}+\frac{T_i}{T_2}{Y}_{m2}+\dots \dots .+\frac{T_i}{T_k}{Y}_{mk}\right) $$
(3)

Normal ratio method considering the weight as correlation function (NRWC)

Young (1992) modified the NR method considering the weight as correlation function instead of proportion to annual rainfall amount of target station to the reference station for the selected period in which missing value exists. To formulate the method, the weight is defined as

$$ {w}_{ij}=\kern0.5em \left[{r}_{ij}^2\left(\frac{n_{ij}-2}{1-{r}_{ij}^2}\right)\right] $$
(4)

where rij is the correlation coefficient between the ith target station and jth reference station and nij is the number of rainfall observations for measuring correlation coefficient. Then, the NRWC is defined as

$$ {\widehat{Y}}_{mi}=\frac{1}{\sum_j^k{w}_{ij}}{\sum}_j^k{w}_{ij}{Y}_{mj}=\frac{1}{\sum \limits_j^k{w}_{ij}}\left({w}_{i1}{Y}_{m1}+{w}_{i2}{Y}_{m2}+\dots \dots .+{w}_{ik}{Y}_{mk}\right) $$
(5)

Inverse distance (ID) method

Shepard (1968) suggested the method for analyzing two-dimensional interpolation functions for irregularly spaced data. Then, various authors used this method for estimating the missing value of daily rainfall observations (Lam 1983; Tronci et al. 1986; Hubbard 1994; Xia et al. 1999; Eischeid et al. 2000). The method is explained as the weighted interpolation technique which is defined as

$$ {\widehat{Y}}_{mi}={\sum}_{j=1}^k{w}_{ij}{Y}_{mj} $$
(6)

where weight, wij is explained as:

$$ {w}_{ij}=\frac{d_{ij}^{-p}}{\sum \limits_j^k{d}_{ij}^{-p}}\kern1.5em and\kern1em {\sum}_{j=1}^k{w}_{ij}=1 $$
(7)

Here, p indicates the exponent of inverse distance and dij indicates the distance of proximity neighboring jth reference station from ith target station. To calculate the distance, dij from ith to jth station, the latitude and longitude values of the respective stations are used. Latitude and longitude values of each station are converted into decimal degrees. Then, the distance from ith to jth station is computed using Great Circle Calculator of National Hurricane Center of USA (National Hurricane Center of USA n.d).

The method is used to estimate the missing observations of meteorological or hydrological variables under interest for assigning more weight to closer points. That is, weight is decreased as the distance from the interpolated points increase. The higher value of exponent p indicates a high influence of closer values to the interpolated point (Suhalia et al. 2008). Xia et al. (1999) indicated that usual value of p ranges from 1.0 to 6.0, and this value is generally considered as 2. Thus, the study considers the value of p as 2.

Multiple imputation using EM-MCMC method

To estimate missing value of the data set, the multiple imputation method is developed by Rubin (1976, 1978) to overcome the uncertainty of the missing value estimates which rises due to the insufficient measurement of sampling variability. The method demonstrates that the missing values are imputed by estimating the parameters of the appropriate model to incorporate the random variation of multiple times and the average of multiple values. Then, to interpolate the missing data, the Monte Carlo Markov chain method-based expectation-maximization (EM-MCMC) algorithm is employed on the basis of Bayesian sampling procedure as the multiple imputation method (Tanner and Wong 1987; Schafer 1997). The method considers missing data according to proportional information of the sample to estimate the parameter of interest through conditional expectations. Therefore, the EM algorithm provides an estimation of parameters and imputations using MCMC procedure under iteration method (Yozgatligil et al. 2013).

The daily rainfall data always contains incomplete data with two types of observations (non-missing and missing value); these observations are explained as Y = (Yoi, Ymi). Here, Yoi and Ymi indicate the non-missing value and missing value of rainfall data, respectively, of ith day. To perform the multiple imputation techniques using EM-MCMC algorithm based on the Bayesian framework, the unknown θ and Ymi are considered as random variables for the performance of statistical inference on the parameter θ (Schafer 1997). Then, the posterior predictive distribution is stated as

$$ f\left({Y}_{m\ i}|{Y}_{o\ i}\right)=\int f\left({Y}_{m\ i,\kern0.5em }\theta |{Y}_{o\ i}\right) d\theta =\int f\left({Y}_{m\ i}|{Y}_{o\ i},\theta \right)\ f\left(\theta |{Y}_{o\ i}\right) d\theta $$
(8)

where the functions f(Ym i| Yo i, θ) and f(θ| Yo i) indicate the conditional predictive distribution of Ymi and the posterior distribution of θ in respect of the non-missing value of rainfall observations, respectively. The posterior distribution, f(θ| Yoi), is determined through the intensification of Yoi providing the assumed value of Ymi, which is measured by two-step procedure (Yozgatligil et al. 2013). The first step is to impute the missing value, Ymi, from the conditional predictive distribution, f(Ymi| Yoi, θ)in the kth step, i.e.,

$$ {Y}_{mi}^{\left(k+1\right)}\sim f\left({Y}_{mi}|{Y}_{oi},\kern0.5em {\theta}^k\right) $$
(9)

The second step provides the new value of θ from the posterior distribution of non-missing data given the missing data.

$$ \left({Y}_{oi,}{Y}_{mi}^{\left(k+1\right)}\right),\kern0.5em \mathrm{i}.\mathrm{e}.,\kern0.5em {\theta}^{\left(k+1\right)}\sim f\left(\theta |{Y}_{oi},{Y}_{mi}^{\left(k+1\right)}\right) $$
(10)

These two steps are repeated through the iteration process starting with initial value as θ(0), and the process yields a Markov chain, i.e., \( \left({Y}_{mi}^{(1)},\kern0.5em {\theta}^{(1)}\right) \), \( \left({Y}_{mi}^{(2)},\kern0.5em {\theta}^{(2)}\right) \), \( \left({Y}_{mi}^{(3)},\kern0.5em {\theta}^{(3)}\right) \)and so on.

The distribution of these transition counts of the Markov chain provides the joint conditional distribution, f(Ymi,  θ| Yoi). If the value of parameter θ(k) satisfies the convergence of distribution, then the posterior distribution, f(θ| Yoi) , is drawn from non-missing data using this value of the parameter. Then, from the posterior predictive distribution, f(Ymi| Yoi), the \( {Y}_{mi}^{(k)} \) is considered as an appropriate selection. This method is perfectly valid, provided that the missing data of rainfall observations do not follow the NMAR mechanism (Scheffer 2002).

The whole process of multiple imputations using EM-MCMC method can be done by using PROC MI in the University Edition of SAS (Yim 2015). This study used PROC MI to make the multiple imputations of daily rainfall missing data for the target stations using the concurrent rainfall data of reference stations as covariates. The underlying distribution of the data is considered to be multivariate normal in this study.

Single best estimator (SBE) method

To estimate the missing value of daily rainfall data, various authors employed this method (Wallis et al. 1991; Xia et al. 1999; Eischeid et al. 2000). For performing this method, the daily rainfall data of proximity neighboring station corresponding to the missing data of target station is considered as the estimated missing value, provided that the data of neighboring and target station would have the highest positive correlation. This is analogous to the simple substitution or closest neighboring station method (Lo Presti et al. 2010; Garcia et al. 2006). To select proximity neighboring station to the target station, minimum distance with the target station is considered, because the rainfall amount of closest neighboring station and the target station always provide highest positive correlation compared to the other neighboring stations. For instance, in mid region of the study, Faridpur is found to be the closest station to target station Dhaka (distance 57 km), and in the south-east region, Ambagan is found to be the closest station to the target station Chittagong (distance 15 km) (Table 2). The distance measurement procedure is discussed in the inverse distance method.

Table 2 Classification of 27 selected stations according to climatic sub-zones with geographic position and data availability

Linear regression (LR) method

To formulate the linear regression method for estimating the missing data of daily rainfall occurrences, the study considers the following estimated form (Dumedah and Coulibaly 2011; Xia et al. 1999):

$$ {\hat{Y}}_{mi}=\hat{\alpha}+\hat{\beta}{X}_{mj},\kern1.5em i=1,2,\dots, n $$
(11)

where \( {\widehat{Y}}_{mi} \) indicates the estimated value of missing rainfall observation of mth day for ith target station and Xmj indicates the observation of mth day rainfall of the closest reference station j. The closest reference station is selected by considering the minimum distance to the target station within the neighboring stations. Here, \( \widehat{\alpha\ }\kern0.5em \mathrm{and}\kern0.50em \widehat{\beta} \) are the parameters which are estimated by using least squares method from the simple linear regression model. To estimate the parameters (α and β), the daily rainfall observations of ith target station and proximity neighboring jth reference station are considered as dependent and independent variables, respectively.

Multiple regression (MR) method

Kemp et al. (1983), Tabony (1983), Young (1992), and Eischeid et al. (1995) explained different facilities of the regression model for data interpolation and missing data estimation. Following their suggestions, Xia et al. (1999) indicated multiple regression method for estimating the missing value of daily rainfall occurrences. Therefore, for estimating the missing value of daily rainfall occurrences, the study considers the following estimated multiple regression model as an interpolation method:

$$ {\widehat{Y}}_{mi}=\widehat{\alpha}+{\sum}_{j=1}^k{\widehat{\beta}}_j{X}_{mj},\kern0.75em i=1,2,\dots, n $$
(12)

where \( {\widehat{Y}}_{mi\kern0.5em } \) indicates the estimated value of rainfall observation of the mth day in the ith target station and Xmj indicates the observation of mth day of the jth reference station (j= 1,2,3,...,k; where k is the number of reference stations of station i. Here, \( \widehat{\alpha\ }\kern0.5em \mathrm{and}\kern0.50em {\widehat{\beta}}_j \) are the parameters which are estimated by using least squares method from the multiple regression model. To estimate the parameters (α and β), the daily rainfall observations of ith target station and jth reference stations are considered as dependent and independent variables, respectively.

2.2.2 Techniques of comparison for the missing value estimation methods

To identify the appropriate matching between observed and expected observations, the following comparison criteria are considered in the study. For calculating the value of each comparison criterion, firstly, the study considers randomly selected portion of data as missing although there exist observed observations for target station of daily rainfall data, and then, these values are estimated by using different missing value estimation techniques. These estimated values of daily rainfall missing data are considered as the expected values \( \left({Y}_i^{\mathrm{est}}\right) \), and these are compared with the observed amount of observations \( \left({Y}_i^{\mathrm{obs}}\right) \). Here, i(i = 1, 2, …, n) indicates the number of sample observations.

Kolmogorov-Smirnov (K-S) test

Kolmogorov-Smirnov test for goodness of fit would be used to determine whether a method provides good estimates of missing values or not (Massey 1951; Wilks 1995; Simolo et al. 2010). It uses the cumulative frequency distribution function, say Fn(x)-based non-parametric test. Here, x indicates any specific value of daily rainfall data and Fn(x) indicates the proportion of cumulative frequency of individuals for the daily rainfall distribution. Further, Sn(x) indicates the proportion of cumulative frequency of individuals for the estimated daily rainfall distribution. Then, the Kolmogorov-Smirnov test statistic for goodness of fit is defined as

$$ {D}_n(x)=\underset{x}{\max}\left|{F}_n(x)-{S}_n(x)\right| $$
(13)

If the p value of above statistic is large, then the estimated daily rainfall observations provide a good fit to the observed rainfall observations.

Bias or mean of error (ME)

In the concepts of statistics, bias indicates the difference between the estimator’s expected value and the true value of the parameter. If this result is 0 (zero), it indicates unbiased estimation (Walther and Moore 2005). Therefore, the study considers differences between the observed value of daily rainfall amount \( \left({Y}_i^{\mathrm{obs}}\right) \) and the estimated value of daily rainfall missing observation \( \left({Y}_i^{\mathrm{est}}\right) \) for the corresponding observed value indicate the errors. Then, the mean of errors indicates the bias of estimate which is stated as (Simolo et al. 2010)

$$ ME={n}^{-1}{\sum}_{i=1}^{\mathrm{n}}{\varepsilon}_i,\kern0.75em where\kern0.5em {\varepsilon}_i={Y}_i^{obs.}-{Y}_i^{est.} $$
(14)

The bias is calculated for all estimation methods and the method with the minimum bias is considered as the best.

MAE

Mean absolute error is computed as the mean of the absolute differences of observed values and the estimated missing values of daily rainfall data. The estimation method having the lowest MAE value is considered as the best (Suhalia et al. 2008). Therefore, the method is defined as

$$ MAE={n}^{-1}{\sum}_{i=1}^n\left|{\varepsilon}_i\right|\kern0.5em ,\kern0.5em where\kern0.75em {\varepsilon}_i={Y}_i^{obs.}-{Y}_i^{est.} $$
(15)

Root-mean-square error (RMSE)

The RMSE is frequently used to measure the difference between the values (sample and population values) predicted by a model or an estimator and the values actually observed (Li and Zhao 2001; Chai and Draxler 2014). This measure is also used to compare the different estimating techniques or methods for identification of the best method. The method with the lowest value of RMSE indicates the best method. The study considers RMSE to measure the best technique or method using the difference between the observed values \( \left({Y}_i^{\mathrm{obs}}\right) \) of daily rainfall data and estimated values \( \left({Y}_i^{\mathrm{est}}\right) \) of daily rainfall missing data (Simolo et al. 2010). The measurement formula for RMSE is given below:

$$ RMSE=\sqrt{n^{-1}{\sum}_{i=1}^n{\varepsilon}_i^2,\kern0.5em }\kern0.75em where\kern1em {\varepsilon}_i={Y}_i^{obs}-{Y}_i^{est.} $$
(16)

Coefficient of variation of root-mean-square error (CVRMSE)

To identify the forecasting performances for time series data, RMSE is commonly used as a measure of accuracy under scale measurement. However, to eliminate scale dependencies of comparison criterion, Yozgatligil et al. (2013) suggested CVRMSE measurement. The measurement RMSE is divided by the mean of actual (observed) values gives the CVRMSE. To compare missing value estimation techniques, the RMSE divided by the mean of observed daily rainfall data for the artificially created missing period provides CVRMSE,

$$ \mathrm{CVRMSE}=\frac{\mathrm{RMSE}}{{\overline{\mathrm{Y}}}^{\mathrm{obs}.}}\kern0.5em ,\kern1.25em \mathrm{where}\kern0.75em {\overline{\mathrm{Y}}}^{\mathrm{obs}.}={\mathrm{n}}^{-1}{\sum}_{\mathrm{i}}^{\mathrm{n}}{\mathrm{Y}}_{\mathrm{i}}^{\mathrm{obs}.} $$
(17)

Minimum CVRMSE suggests the minimum percentage of variation between observed values and estimated values of missing data for daily rainfall occurrences. So, the method with the minimum CVRMSE is considered as the best.

Standard deviation of error (ESD)

The standard deviation of error (difference between the observed and estimated value) indicates the fluctuations of the deviations. The minimum ESD is used as the criterion to identify the best technique for estimating the missing value (Silva et al. 2007). Then, it is defined as

$$ ESD=\sqrt{{\left(n-1\right)}^{-1}{\sum}_{i=1}^n{\left({\varepsilon}_i-\overline{\varepsilon}\right)}^2},\kern0.5em where\kern0.5em \overline{\varepsilon}={n}^{-1}{\sum}_{i=1}^n{\varepsilon}_i\kern0.5em and\kern0.5em {\varepsilon}_i={Y}_i^{obs.}-{Y}_i^{est.} $$
(18)

Similarity index (S index)

S index is the criterion of agreement for assessing model performance which implies the percentage of agreement between the observed and estimated values. The values of S index lie between 0.0 and 1.0, where 0.0 indicates complete disagreement and 1.0 indicates perfect agreement (Wilmott 1981). The S index is used to find out the best missing value estimation technique for rainfall data (Suhaila et al. 2008). The S index is stated below:

$$ S\ \mathrm{index}=1-\frac{\sum_{i=1}^n{\left({Y}_i^{obs.}-{Y}_i^{est.}\right)}^2}{\sum_{i=1}^n{\left(\left|{Y}_i^{obs.}-\overline{Y}\right|+\left|{Y}_i^{est.}-\overline{Y}\right|\right)}^2} $$
(19)

where \( \overline{Y} \) is the mean of observed daily rainfall and n is the number of estimated or observed observations.

3 Results and discussions

To estimate the missing value of daily rainfall observations, different methods and their comparative techniques are already discussed in the previous section for identifying the suitable method. The performance of data for the study is also discussed in Sect. 2. In that section, the classification procedures of 27 selected stations into five climatic sub-zones and the selection of target and reference stations from each sub-zone are elaborately discussed. The nature of missing data distribution of these stations follows MAR, is also explained in Sect. 2. The results of daily rainfall missing data estimation of five target stations for different methods and the results of comparative techniques for identifying station-wise suitable method are discussed in this section, followed by a comparison of the present study to similar studies conducted in other parts of the world.

The results of the comparison criteria of missing value estimation techniques for target station Sylhet of north-east region, Chittagong of south-east region, Dhaka of mid region, Khulna of south-west region, and Rajshahi of north-west region are revealed in Tables 3, 4, 5, 6, and 7, respectively. However, the correlation coefficient between daily rainfall amount of target station and its nearest reference station is higher than that of all other reference stations. For example, the distance between target station Chittagong and reference station Ambagan is smallest (15 km), and their correlation coefficient is found to be 0.91559 and it is statistically significant (Table 2).

Table 3 Results of comparison measures for the missing value estimation techniques applied to estimate 1, 5, and 10% missing values in north-east region
Table 4 Results of comparison measures for the missing value estimation techniques applied to estimate 1, 5, and 10% missing values in south-east region
Table 5 Results of comparison measures for the missing value estimation techniques applied to estimate 1, 5, and 10% missing values in mid-region
Table 6 Results of comparison measures for the missing value estimation techniques applied to estimate 1, 5, and 10% missing values in south-west region
Table 7 Results of comparison measures for the missing value estimation techniques applied to estimate 1, 5, and 10% missing values in mid-region and north-west region

In Fig. 2, box plots for all the stations in each of the five regions are shown taking n = 14, 75, and 146 observations, respectively, which were randomly selected and set as missing observations considering 1, 5, and 10% missing data. Each row of the figure shows the box plots for each region for three different sample sizes (e.g., row 1 in the figure shows box plots for the stations in south-east region for 14, 75, and 146 observations, respectively), and each column shows the box plots for different regions of same sample size (e.g., column 2 shows the box plots of stations of each region considering 75 observations). So, it is obvious that the box plots in column 1 will have less number of outliers than those of columns 2 and 3 because of the least sample size considered. If we wish to look at the pattern in each region for all sample sizes, similar behavior can be noticed. For instance, the number of outliers for stations in each region are increasing with the increase in sample size (e.g., number of outliers for n = 14, 75,146 in Dhaka station of mid-region are 3, 10, and 30, respectively and in Sylhet station of north-east region are 2, 13, and 26, respectively). However, if we want to compare the pattern of stations of different regions, that can be done looking at the same column for a specific sample size. Let us consider column 2 (n = 75), for south-east region, we can observe that there are a considerable number of outliers for each station and the rainfall observations are right skewed for all the stations (median is zero for all the stations). Similar patterns can be observed for north-east, mid, and south-west regions. There is one extreme station in south-west region, named Satkhira for which the third quartile is also very small (Q3 for Satkhira = 1), which might be the result of the random choice of observations; different sample of observations would result in different box plots, but the pattern of right-skewed data remains same for all combinations of observations. Same explanations apply to the stations of north-west regions with very lower values of third quartiles (Q3 for Rajshahi = 0, Q3 for Ishwardi = 1, and Q3 for Chuadanga = 2). The presence of outliers in stations in columns 1 and 3 can be explained similarly. This is to keep in mind that these box plots are representing the actual rainfall occurrences for the days those are considered missing in the present study; they are not representative for the whole data set. So, we cannot generalize the findings of the box plots to assess the geographic variation among the stations. These are presented only to help in assessing the performance of the missing value estimation techniques applied to estimate these observations.

Fig. 2
figure 2

Box plots for daily rainfall data of five regions of Bangladesh (vertical axes indicate the amount (mm) of daily rainfall occurrences, and horizontal axes indicate names of stations)

3.1 North-east region

Only one reference station (Srimangal) is identified corresponding the target station Sylhet, which have very high elevation (Table 2). For single reference station, the methods EM-MCMC, SBE, and LR are applicable among the methods to estimate the missing values of daily rainfall data. In these methods, SBE for 1, 5, and 10% missing data and EM-MCMC for 1% missing data provides good fit following the KS test. The efficiency measurement technique CVRMSE provides a similar result (around 2.29) for SBE and EM-MCMC methods. SBE method provides the highest value of S index compared to other methods for 1, 5, and 10% missing values (Table 3). The correlation coefficient between target and reference stations for daily rainfall data is very low (0.3094) due to long distance (68 km) between target station and reference station (Table 2). For such relationship, the EM-MCMC and LR methods did not perform well. Therefore, the SBE method is the most suitable method for estimating the missing value of daily rainfall data for Sylhet station.

3.2 South-east region

For this region, the nine rainfall stations are identified as reference stations surrounding to the target station, Chittagong (Table 2). Kolmogorov-Smirnov goodness-of-fit test provides satisfactory results for all missing value estimation methods of daily rainfall observations except regression methods, ID and EM-MCMC methods for 5 and 10% data. The bias of the estimated missing values is found the minimum for all the fitted methods other than ID and MR methods. However, S index provides good performance for all the methods except ID method (Table 4).

The box plots of 1, 5, and 10% daily rainfall data for target and reference stations in this region indicate some outliers in reference stations (Fig. 2). In these stations, daily rainfall observations show high variation due to high discrimination of elevations (Table 2).The box plots also indicate the possibility of the existence of a pair-wise moderate correlation between daily rainfall observations of the reference stations (Fig. 2), so the regression models may not provide a good fit for estimation of missing values. The ID method does not provide significant result in this region due to considerable variation of the distance between the target and each of the reference stations (Table 2). Therefore, to estimate the missing value of daily rainfall data in Chittagong station, four methods (AA, NR, NRWC, and SBE) provided satisfactory performance.

3.3 Mid region

In this region, five reference stations are identified neighboring target station Dhaka. According to distance, Faridpur is the nearest reference station to the target station (distance 57 km), and the elevation of the reference stations and target station are almost similar except Chandpur station (Table 2). The KS test provides a good fit for all methods except LR and MR methods, and AA (for 10% missing data), NR (for 10% missing), ID (for 5 and 10% missing), and EM-MCMC (for 5% missing) methods. The EM-MCMC method for estimating these missing data of daily rainfall provide the higher RMSE, MAE, and ESD than that of other methods. However, the bias of the estimates is the lowest for SBE method and S indices are close to 1 for AA, NR, NRWC, and SBE methods (Table 5).

The box plots for 1, 5, and 10% data of daily rainfall provide the presence of outliers for every station (Fig. 2). The correlation coefficient of daily rainfall amount between the target station Dhaka and for each of the reference stations expect Faridpur station is found around 0.45. For such weaker relationship, LR and MR methods may not be provided good fit. Again, for Dhaka and Faridpur station, this correlation is found 0.603. Due to this relationship, SBE method can be considered as the best estimator to estimate the missing value of rainfall data for Dhaka station on the basis of lowest bias and the higher value of S index compared to all other methods.

3.4 South-west region

For this region, five stations are identified as reference stations surrounding to the target station Khulna. For these stations, elevation is almost similar (around 2.1 m). In respect of distance, the nearest station is Mongla (35 km) to the target station (Table 2). The methods AA, NR, NRWC, EM-MCMC, and SBE demonstrate good fit to estimate the missing value of daily rainfall data following KS test. The bias and MAE of the estimates are found lower for AA method, and CVRMSE is observed lower for EM-MCMC compared to other methods. The value of S index for EM-MCMC method indicates the highest (S index close to 1) than that of other methods (Table 6).

The box plots for daily rainfall observations of the south-west region indicate a large number of outliers for all stations (Fig. 2). For this reason, the regression methods do not work well to estimate missing data of daily rainfall data. Further, the ID method also does not provide good fit due to the long distance between the target and reference stations. Therefore, the EM-MCMC method is found to be the best estimator for Khulna station to estimate the missing value of daily rainfall data.

3.5 North-west region

For this region, two rainfall stations are identified as reference station against target station Rajshahi. Ishwardi is the nearest reference station to the target station according to distance. The correlation coefficient between the target and its nearest reference station for daily rainfall data is 0.508 (Table 2).The methods AA, NR, NRWC, and SBE provide good fit to estimate the missing value of daily rainfall data following KS test. The bias of the estimates is found lowest for AA and SBE methods, and CVRMSE is found lowest for AA method than that of other methods. However, the value of S index is found almost same (around 0.65) for AA and SBE methods (Table 7).

The box plots indicate high variation among the stations’ rainfall data in this region (Fig. 2); due to this, the methods LR, MR, ID, and EM-MCMC do not provide satisfactory results in terms of comparison criteria. Besides, for long distance from the target to reference stations (Table 2), the ID method does not perform adequately. Therefore, the AA and SBE methods provide well fit in respect of lowest bias and high S index value to estimate the missing value of daily rainfall data in Rajshahi station.

3.6 Comparison with other similar studies

The present study has been conducted to suggest a suitable method to estimate the missing values in daily rainfall data in Bangladesh. The study employed eight different methods found in different literature and compared the performances of the methods using seven techniques. To the best of our knowledge, this is the first study making an attempt to find the appropriate missing value estimation technique for Bangladesh till date. However, this study was inspired by similar studies conducted in other parts of the world. For instance, there have been studies to find out the best method to estimate missing values in Turkish meteorological data (Yozgatligil et al. 2013), daily precipitation data from Brazil (Ferrari and Ozaki 2014), rainfall data from Malaysia (Suhalia et al. 2008), Italy (Lo Presti et al. 2010), Andes region in Venezuela (Garcia et al. 2006), etc.

Garcia et al. (2006) performed a cluster analysis to find two closest stations corresponding to a rainfall station and fill in the missing value of the target station from those closest station. They applied their method to daily, weekly, bi-weekly, and monthly data of 106 rainfall stations in Andes region in Venezuela and assessed the performance of the proposed method using mean error (ME), MAE, RMSE, coefficient of correlation (r), and Willmott agreement index (d). The author did not compare the proposed method with any other methods.

Yozgatligil et al. (2013) suggested EM-MCMC algorithm as best technique in case of Turkish meteorological data after comparing simple and weighted arithmetic average methods, multilayer perceptron neural network, and MCMC-based multiple imputation methods. The comparison criteria used in the study were RMSE, coefficient of variation of RMSE (CVRMSE), and correlation dimension technique (branch of nonlinear dynamic time series analysis).

Ferrari and Ozaki (2014) compared nearest neighbor method, inverse distance-weighted ratio method, and linear regression method for imputation of missing values in precipitation data from the state of Parana in southern of Brazil according to the value of RMSE. The author stated the inverse distance weighted ratio method to be most appropriate for imputing missing precipitation data from 484 stations in the area of interest.

Silva et al. (2007) compared arithmetic mean method, normal ratio method, and inverse distance method to estimate the missing rainfall data in Sri Lanka according to the measurements of descriptive statistics of error, RMSE, mean absolute percentage of error, and correlation coefficient. The authors also proposed a new method named aerial precipitation ratio which selected stations representing each of seven major ecological zones in Sri Lanka, and monthly rainfall data was estimated using abovementioned methods taking the surrounding stations in each zone into account. The authors suggested different methods to be suitable for different zones in Sri Lanka with no indication of single best method.

Lo Presti et al. (2010) proposed methods to fill in the missing observations in daily rainfall data in the Candelaro River Basin (Italy) in two stages. In the first stage, the authors assessed the missingness mechanism present in the data and then applied four different regression methods (simple substitution, parametric regression, ranked regression, and Theil method) to estimate the missing daily rainfall data. By studying the absolute error distribution, the authors indicated the Theil method to be the most suitable one, though a very complex method. Simple substitution method was also marked as acceptable method.

The present study employed eight methods to estimate the missing rainfall data from a target station from each of the five climatic sub-zones of Bangladesh, so the methods are applicable to all other stations. This kind of climatic or ecological division was only made by Silva et al. (2007). Before estimating the missing daily rainfall amount in the target stations, the missingness mechanism of the missing rainfall data was tested following the suggestions of Lo Presti et al. (2010). To the best of our knowledge, none of the other studies have tested the missingness mechanism. The present study chose the eight methods, which is highest among all studies reported here, from all the mentioned literature on the basis of relevance, simplicity, and relative performance in other regions. The comparison of such a high number of methods allowed flexibility in making choice of the best method to estimate the missing data in daily rainfall observations. Also, seven comparison criteria used in the present study were combined from the previous studies. The K-S test to determine the goodness-of-fit test, apart from the present study, was only applied once (Simolo et al. 2010) to assess the performance of missing value estimation techniques in rainfall data. The result of K-S test has significant contribution in choosing the most suitable method in the present article. One of the unique element of this study was the inclusion of box plots for the selected missing observations in target and reference stations which helped to understand the actual scenario in different stations across Bangladesh on the days chosen to be missing, which has effect on the performance of particular missing value estimation technique. Though this study did not propose any new method, it integrated a wide range of methods and comparison criteria along with some descriptive measures to be able to estimate the missing data in daily rainfall which will give rise to further scientific studies involving continuous rainfall data in future.

4 Conclusion

A suitable method of estimating missing rainfall value is of great interest to the researchers worldwide. The reason behind such interest is to make use of rainfall data from long series where occasional missing values pose formidable difficulty in using such data. In the present paper, a comparison of different methods has been conducted in order to suggest the best possible choice under certain specific criteria. Although the focus of the current paper is Bangladesh, the statistical criteria that are being used in this study can be generalized on the basis of underlying statistical reasoning that are highlighted here. To estimate the missing value of daily rainfall observations for five climatic regions’ target stations of the country, the eight methods and seven comparison techniques are employed to identify the best suitable method for each of the stations. For performing these methods, three sets of daily rainfall missing data sample (1, 5, and 10%) with 1000 times repetitions are considered (Sect. 2). The performance of the estimation methods according to the comparison techniques are shown in Tables 3, 4, 5, 6, and 7. On the basis of these results, the discussions are made in Sect. 3. From the results and discussions of the study, the following conclusions can be drawn. We have made an attempt to find a single method that can be suggested for all the stations in Bangladesh. To examine whether the findings of this study hold for other countries, studies can be repeated for other countries as well. This may provide a consensual technique under varied conditions prevail in the nature and extent of missing values in the time series data on rainfall.

Let us consider five measures of comparison (out of seven measures included in this study) for identifying the best estimation technique, namely, (i) K-S test statistic, (ii) bias, (iii) RMSE, (iv) MAE, and (v) S index. Two other measures of comparison CVRMSE and ESD are ignored due to inclusion of similar measures RMSE and MAE. The Kolmogorov-Smirnov test statistic shows that among all the estimation techniques, only SBE provides consistently acceptable estimation technique for all the regions. Other measures of comparison such as bias, RMSE, MAE, and S index also confirm that SBE is consistently better as a technique of estimating missing values. In some cases, arithmetic average, EM-MCMC, provides good estimate along with the linear or multiple regression estimates but the results are not consistent for all the regions. Garcia et al. (2006) observed that closest station method as the best one to fill in the missing observations of rainfall data in different time scales in Andes region in Venezuela. Lo Presti et al. (2010) stated the simple substitution method, which is same as the SBE described in the present study, to be an acceptable technique of missing value estimation in daily rainfall data in the Candelaro River Basin (Italy) when the similarity value is particularly high and significant. In the present study, from Table 2, it can be observed that the target station had significant positive correlation with all its reference stations. For the SBE method, a single station is chosen for each target station, which has highest significant correlation with the daily rainfall observations in the target station, and it also happens to be the nearest station to the target station according to the distance (Table 2). Thus, the consistent performance of SBE method has both statistical reasoning and practical significance. Hence, we may conclude that the technique of single best estimator is singled out in this study as the possible choice of estimating missing values.