1 Introduction

Having complete and reliable time series of meteorological inputs is one of the most basic issues in efficient environmental monitoring and modelling. However, weather stations are frequently subject to some malfunctioning in the monitoring period (Tardivo and Berti 2012). This happens due to many reasons such as improper data registration, absence of the observer, destruction of gauging devices, power failure and elimination of the incorrect data. The uninterrupted precipitation data is critical for climate and hydrological studies, especially in cases such as the frequency analysis of extreme events (e.g., floods and droughts) and the design of water systems. Breaks in records of data acquisition systems are more serious in developing countries; in these countries, for most parts, hydrological data series are full of gaps (e.g., Dastorani et al. 2009). Thus, in order to be able to employ these data more efficiently, the missing observations need to be reconstructed.

Reliable estimation of missing climatological data has been a topic of interest for meteorologists, hydrologists and environment managers all over the world (e.g., Schneider 2001; Jeffrey et al. 2001; Geerts 2003; Kashani and Dinpashoh 2012; Li et al. 2013). The imputation of missing values is typically based on measurements for the same location (within-station) when data is available or on observations from nearby stations (between-station) for the same day (Kim and Pachepsky 2010). Short breaks in weather records can be reconstructed by simple within-station approaches, such as interpolation between available observations or moving averages (Kemp et al. 1983), or by using records before and after the date of missing observations in a nonlinear regression (Acock and Pachepsky 2000). However, utilizing the measurements of neighboring stations for filling data gaps has been done much more often, and also it seems that it has been much more accurate. In this regard, Tardivo and Berti (2012) have pointed out that “when the length of the missing period increases, between-station approaches, considering the specific variability of climate in the period to be reconstructed, tend to give better results.” On application of between-station approaches, Kim and Pachepsky (2010) have referred to some studies that take advantage of the geometrical distances between stations to use the values from the nearest station (Xia et al. 1999), the arithmetic averaging of observations from several adjacent stations (Willmott et al. 1994), linear regressions (LRs) to estimate the missing value as a median of the distribution of regression outputs (Ramos-Calzado et al. 2008) and distance-weighted techniques (Teegavarapu and Chandramouli 2005).

Many researchers assessed between-station approaches for reconstructing missing meteorological and hydrological measurements in different climates of Iran. Sadatinejad (1997) investigated the reference station, normal ratio (NR), geographical coordinate (GC), simple and multiple LR and autoregressive methods for the completion of annual precipitation gaps in Esfahan province. He suggested the NR method for arid and Mediterranean climates and the LR method for semi-dried climates as the best methods. Lookzadeh (2004) compared the NR, inverse squared distance, simple and multiple LR and geostatistical techniques for filling gaps in monthly, seasonal and yearly precipitation data in the central Alborz region. The results showed that the NR method is the best method in 69.2 % of cases; meanwhile, this method gives overestimates and underestimates in wet and dry years, respectively. Sadatinejad et al. (2009) examined the NR, inverse squared distance, GC, simple and multiple LR, autoregressive, artificial neural network (ANN) and fuzzy regression methods for reconstructing missing monthly discharge measurements of great Karoon Basin and suggested ANN as the most accurate method. In order to evaluate the performance of ANN and ANFIS models in filling breaks in flow data, Dastorani et al. (2009) employed two traditionally used methods (the NR and the correlation methods) and concluded that all four techniques presented acceptable predictions, however, the ANFIS technique indicated a superior ability to predict missing flow values, especially in arid land stations with variable and heterogeneous data. Khorsandi et al. (2011) compared the ANN, NR, inverse distance weighting (IDW), and GC methods to compensate missing monthly precipitation records at three stations around the city of Esfahan. They suggested ANN as the most efficient approach in comparison with the other three methods. Kashani and Dinpashoh (2012) investigated the capabilities of 11 different traditional and data-driven approaches to identify the best gap-filling techniques for hydro-meteorological observations in three distinct climates of Iran. Their results demonstrated the high importance of choosing the appropriate method for completing climatological datasets in Iran as well as other arid and semi-arid regions.

Finally, it is noteworthy that meteorological data in developing countries are full of gaps, and there is a pressing need to propose gap-filling techniques with no detailed requirements (i.e., appropriate for most parts of developing countries) and, of course, with satisfactory outcomes. The main objective of this work is to modify the GC method (Mahdavi 1998; Khorsandi et al. 2011) for reconstructing missing annual precipitation measurements based on a procedure that is applicable for regions having no detailed data. Furthermore, the improved methodology is carried out for different sites of Iran, and results are compared with estimates of the conventional GC, the NR, and the LR approaches to evaluate the efficiency of the modified GC technique and to study its suitability in various climatic and geographical conditions of Iran.

2 Materials and methods

2.1 Data description

In this study, 24 precipitation gauge stations located in different parts of Iran (Fig. 1) were selected and organized into six groups according to geographical proximity and climatic similarities. These gauges are under the supervision of Iran Water Resources Management Company. As can be seen in the web portal of the company, it acts as an agency of the Ministry of Energy to enforce the Law of Fair Distribution of Water and other rules and regulations related to water, including the management and control of water resources operations, and also to collect, prepare, provide and analyze basic information required for studying Iran water resources quality and quantity (http://www.wrm.ir/english/tabid/198/Default.aspx). Considering its consistency and relevance, the data provided by the company has been used in many of water resources management projects and research studies in Iran. However, the homogeneity of precipitation data series at selected stations was examined by the runs test, which was accepted at the 95 % confidence level.

Fig. 1
figure 1

General location of studied regions in the map of Iran

In each group, every station was experienced to be considered as a target station and also as a neighboring station. Then, by omitting a portion (about 15 %) of annual precipitation measurements for the target station, the attempt was made to estimate them using different “between-station” methods.

The general properties of stations and the monitoring period (i.e., common record-length) for each group are presented in Table 1. The De Martonne aridity index was applied to determine climate types. As shown in Table 1, six groups of stations cover a wide variety of climatic conditions ranging from arid to humid.

Table 1 General characteristics of stations arranged in six groups

2.2 Estimation of missing precipitation data

2.2.1 Normal ratio method

This traditional statistical pattern recognition method utilizes the normal precipitation (Chow 1964; Linsley et al. 1988) of the target station and nearby stations (usually three) to reconstruct missing values.

The data recorded at index (neighboring) stations are weighted by the ratios of the normal annual precipitation values. Thus, the formula is given as (Abebe et al. 2000):

$$ {P}_X=\left(\frac{1}{3}\right)\left(\frac{N_X}{N_A}{P}_A+\frac{N_X}{N_B}{P}_B+\frac{N_X}{N_C}{P}_C\right) $$
(1)

where X is the target station, A, B and C are index stations, P X is the missing precipitation record at the target station to be estimated, and P and N represent precipitation and normal precipitation (mean of index period) values, respectively.

2.2.2 Linear regression

In this traditional common approach, the correlation between observations at the target station and each of neighboring stations is assessed using the following equation (Eq. 2):

$$ r=\frac{{\displaystyle \sum xy-\frac{{\displaystyle \sum }x{\displaystyle \sum }y}{n}}}{\sqrt{\left[{\displaystyle \sum {x}^2-\frac{{\left({\displaystyle \sum x}\right)}^2}{n}}\right]\left[{\displaystyle \sum {y}^2-\frac{{\left({\displaystyle \sum y}\right)}^2}{n}}\right]}} $$
(2)

where y is the available data series at the target station, x is the data series of the index station, and n is the number of pairs of observations.

The amount of correlation coefficient (r) between the measurements of the target site and the index site must be whether in an acceptable level according to the Fisher table (Dastorani et al. 2009) or significant at a confidence level according to the Student t-test (Eq. 3).

$$ t=\frac{r\sqrt{n-2}}{\sqrt{1-{r}^2}} $$
(3)

If the value of the test statistic (t) falls in the critical region which is defined based on the degrees of freedom (i.e., n2) and the desired significance level (e.g. α = 0.05), there is a statistically significant linear relationship between y and x (see Chatterjee and Hadi 2006).

As shown in this approach, only one index station is applied to fill gaps of the target station; Thus, in this study, for each group of stations, the correlation coefficient (r) between the records of the target station and the other sites in the same group was separately computed, and the station having higher significant correlation with the target site was selected as index station. The missing observations at the target station were then predicted using the following equation:

$$ Y=a+ bX $$
(4)

where Y is the missing record to be estimated and X is the record at the index site corresponding to the missing observation at the target site (Y). The values of a and b can be estimated using Eqs. 5 and 6, respectively:

$$ a=\overline{y}-b\overline{x} $$
(5)
$$ b=\frac{{\displaystyle \sum xy-\frac{{\displaystyle \sum x}{\displaystyle \sum y}}{n}}}{{\displaystyle \sum {x}^2-\frac{{\left({\displaystyle \sum x}\right)}^2}{n}}} $$
(6)

where \( \overline{y} \) and \( \overline{x} \) are mean values of the data series, respectively, at the target and the index stations.

It is also feasible to estimate the precipitation at the target station (Y) using the precipitation at more than one index station (i.e., X 1, X 2, X 3 , …) through multiple linear regression (MLR). However, since the MLR relies on more than one neighboring station, it intrinsically assumed that the distance–correlation relationship is stationary over a possibly large area that covers all index stations. This is not satisfactory, especially for precipitation because of its higher spatial variation (Tardivo and Berti 2013) compared with other meteorological variables such as temperature.

2.2.3 Geographical coordinate method

This approach utilizes geographic coordinates (GCs) of stations to determine weight coefficients for them. So, the target station is considered as the center point of a Cartesian coordinate system and location points for surrounding stations is specified in the coordinate plane using their GCs (the longitude and the latitude). Then, the distance of each surrounding station is computed according to the center point of the coordinate system.

This technique, as well as other IDW methods, considers a more significant role for closer surrounding stations in order to fill gaps in measurement series at the target station. Thus, the higher weight coefficients will be assigned to nearer stations as follows:

$$ {W}_i=\frac{1}{X_i^2+{Y}_i^2} $$
(7)

where W i is the weight coefficient for station i, and X i and Y i are the longitude and latitude of the station, respectively.

By calculating the weight coefficients for neighboring stations, precipitation values at the target station can be reconstructed using the following equation (Mahdavi 1998; Khorsandi et al. 2011):

$$ {P}_x=\frac{{\displaystyle \sum_{i=1}^N\left({W}_i\times {P}_i\right)}}{{\displaystyle \sum_{i=1}^N{W}_i}}=\frac{W_A\cdotp {P}_A+{W}_B\cdotp {P}_B+\cdots }{W_A+{W}_B+\cdots } $$
(8)

where P i is the precipitation record at surrounding station i (A, B, …), P x is the missing precipitation record at the target station to be predicted and N is the number of surrounding stations.

2.3 Performance criteria

In this study, two performance evaluation criteria are employed. The Nash–Sutcliffe model efficiency coefficient (E) and the mean absolute error (MAE) are calculated by Eqs. 9 and 10, respectively.

$$ E=1-\frac{{\displaystyle \sum_{i=1}^N{\left({O}_i-{R}_i\right)}^2}}{{\displaystyle \sum_{i=1}^N{\left({O}_i-{\overline{O}}_i\right)}^2}} $$
(9)
$$ \mathrm{MAE}={n}^{-1}{\displaystyle \sum_{i=1}^N\left|\left({O}_{i-\kern0.5em }{R}_i\right)\right|} $$
(10)

where O i is the observed value, R i is the reconstructed value, \( {\overline{O}}_i \) is the average of the observed values, and n is the number of data points.

The MAE is chosen because of the results obtained by Willmott and Matsuura (2005). They recognized MAE as a more natural and a less ambiguous measure of average error compared with the RMSE (root mean square error). Based on their findings, inter-comparisons of average model-performance error should carry out based on MAE.

3 Results and discussion

3.1 Shortcomings of the GC method

According to the GC-based estimates for annual precipitation amounts in different sites of Iran, the GC method exhibits a low accuracy. This could be due to the following shortcomings in this approach.

  1. (a)

    When the mean annual precipitation (MAP; for the index period) at control stations is more or less than its amount at the target station, the GC method could not estimate the missing data with appropriate accuracy. For instance, if the MAP at all or the most of adjacent stations is over 500 mm and at the target station is less than 500 mm, then the estimated amount obtained by the GC method would be more than 500 mm that is an overestimation. Also, if the MAP amount at control stations is less than its amount at the target station, this would lead to underestimated values.

  2. (b)

    According to the basic assumption of the IDW approaches stated by Nafarzadegan et al. 2012, “nearby points ought to be more closely related than distant points to the value at the interpolate location. In other words, the IDW method applies the idea that the rate of influence in specific location decreases with increasing the distance from particular points”, the GC method as well as other IDW techniques allocate a more contributing role (weight) to the closer control stations to estimate the missing values of the target station. Hence, it is possible that an adjacent station having more altitude and climate similarity with the target station receives a lower weight (or even the lowest weight) than the other neighboring stations.

3.2 Modification of the GC method

The authors have taken advantage of the altitude ratio (A i ) and the MAP in order to modify the GC method. Eventually, two formulae are suggested. It is worth noting that choosing each of them to estimate annual precipitation records depends on physiographic (e.g., altitudes of stations) and climatic (e.g., MAP) conditions of the target region.

  1. 1.

    If there is a significant correlation at the 95 % confidence level between the altitude and the MAP of stations in each group, the following equation is presented.

    $$ {P}_x=\frac{{\displaystyle \sum_{i=1}^N\left({W}_i\times {P}_i\right)\cdotp {C}_i\cdotp {A}_i}}{{\displaystyle \sum_{i=1}^N{W}_i}}=\frac{W_A\cdotp {P}_A\cdotp\ \frac{{\mathrm{MAP}}_{\mathrm{T}}}{{\mathrm{MAP}}_A}\cdotp \frac{ \log {E}_{\mathrm{T}}}{ \log {E}_A}+{W}_B\cdotp {P}_{B\kern0.1em }\cdotp \frac{{\mathrm{MAP}}_{\mathrm{T}}}{{\mathrm{MAP}}_B}\cdotp \frac{ \log {E}_{\mathrm{T}}}{ \log {E}_B}+\cdots }{W_A+{W}_B+\cdots } $$
    (11)
    $$ {C}_i=\frac{{\mathrm{MAP}}_{\mathrm{T}}}{{\mathrm{MAP}}_i} $$
    (12)
    $$ {A}_i=\frac{ \log {E}_{\mathrm{T}}}{ \log {E}_i} $$
    (13)

    where P x is the annual precipitation at the target station to be reconstructed for a specific year, P i is the annual precipitation at the control station i (A, B, …), W i is the weighting coefficient of control stations, MAP T is the MAP at the target station during the index period, MAP i is the MAP at control stations i (A, B, …) during the index period, E T is the elevation at the target station, E i is the elevation at the control station i (A, B, …), C i is the MAP ratio that is calculated by Eq. 12, and A i is the logarithmic ratio of altitude that is calculated by Eq. 13.

    As is obvious, when the MAP i value is less than MAP T , the value of C i (the MAP ratio) becomes more than 1, thus increasing the estimated amounts for reconstructing gaps, and vice versa. The logarithmic ratio of altitude (A i ) also can play a significant role in reaching more accurate gap-filling. As mentioned above, this could occur when a high mutual connection exists between MAPs of stations under consideration and their elevation values (for instance, the higher the altitude values, the higher the MAP amounts become). In such circumstances, this coefficient (A i ) will alter (e.g., increase or decrease) the values estimated by the GC method for filling the precipitation gaps according to the nature of altitude–MAP relationship in each region.

  2. 2.

    If there is no acceptable correlation between the MAP and the elevation of stations in the target area, the altitude ratio will not be used in the modified equation of the GC method; thus, it suggested that:

    $$ {P}_x=\frac{{\displaystyle \sum_{i=1}^N\left({W}_i\times {P}_i\right)\cdotp {C}_i}\ }{{\displaystyle \sum_{i=1}^N{W}_i}}=\frac{W_A\cdotp {P}_A\cdotp \frac{{\mathrm{MAP}}_{\mathrm{T}}}{{\mathrm{MAP}}_A}+{W}_B\cdotp {P}_B\cdotp \frac{{\mathrm{MAP}}_{\mathrm{T}}}{{\mathrm{MAP}}_B}+\cdots }{W_A+{W}_B+\cdots } $$
    (14)

3.3 Comparison with other methods

In the following, outcomes obtained from the modified GC method for different parts of Iran are compared with the estimates of the conventional GC, the NR, and the LR techniques to evaluate the efficiency of the improved methodology and also to study whether the suitability of each approach varies with factors like topography and climatic zone.

As mentioned earlier, 24 stations located in different parts of Iran were selected and classified in six groups due to geographical proximity and climatic similarities. For performing the modified GC approach, the appropriate equation (Eq. 11 or Eq. 14) should be chosen based on the proposed methodology. Table 2 contains values of the altitude and the MAP for all stations. This table reveals the preferred equation for each group of stations based on the recommended procedure for modifying the GC approach. As shown in Table 2, outputs of correlation analysis for six groups demonstrate a broad range of values that are between 0.07 and 0.99. This could be interpreted as existence of various geo-climatic (Eccel et al. 2012) conditions in six selected territories.

Table 2 Selection of the suitable equation for each group of stations based on the proposed methodology for reconstructing omitted observations

Table 3 presents the performance of considered techniques in gap-filling at each station (i.e., a target station) based on the data of the rest of the stations in the group (i.e., neighboring stations). Of the four methods, the modified GC approach provided the best results, with an average MAE of 21.57 mm and an average E of 0.86. The NR technique had an average MAE of 22.67 mm and an average E of 0.84, and the LR method had an MAE of 25.66 mm and an average E of 0.82. The conventional GC method exhibits the worst outcomes with an average MAE of 60.58 mm and an average E of −2.28.

Table 3 Values of the Nash–Sutcliffe efficiency coefficient (E) and MAE criterion used for showing the accuracy of estimates obtained from applied methods

It is worth to emphasize that the accuracy of a gap-filling procedure using nearby stations varies among climate variables and has been the least successful for precipitation. This is due to the highly stochastic nature of precipitation and its higher space gradient compared with other weather variables such as temperature (Kim and Pachepsky 2010; Tardivo and Berti 2013), and consequently, difficulties in formulating it into precipitation estimation algorithms (Daly et al. 1994; Thornton et al. 1997; Xia et al. 1999). Despite the above limitations, it still seems feasible to recognize one or possibly even two satisfying techniques (among examined methods) for the significant number of stations under consideration.

Utilizing MAE results (Table 3), a priority setting process was performed at each station. In this way, the accuracy order for reconstructing methods was determined based on the fact that the method with the least MAE value at a given station would reasonably be the best technique (i.e., the first priority) for gap-filling. Ranking results for methods at each station according to priority scores are showed in Table 4. This table indicates that none of the examined approaches for completing precipitation gaps has dominant preference for all studied regions.

Table 4 Sorting of methods according to their priority rank (class) for each station

For a more rigorous analysis to address the issue of reliability of results, the bootstrap technique (Good 1999) could be employed to evaluate the uncertainty in outcomes of performance criteria. It should be noted that the required analysis is relatively cumbersome and quite time-consuming, because the procedure needs to repeat many times (e.g., 1,000 times) in order to provide an adequate number of bootstrapped data series to estimate probability distribution functions of applied accuracy measures (MAE and E). However, if the iteration number of same priority score for each method receives special attention, one can obtain some valuable information. Thus, repetition numbers of gap-filling approaches in each priority class are counted and then presented in Table 5. Based on Table 5, the maximum number of iterations in the first priority class belongs to the proposed methodology (modified GC approach), and the NR technique places in the second rank. Meanwhile, it is worth noting that the priority order of methods for the lowest priority class (i.e., P4) is as follows: GC, LR, NR and MGC.

Table 5 Iteration numbers of methods for each priority class

As shown in Tables 4 and 5, the conventional GC method has occupied no place in the first priority class for all stations, and more interestingly, the modified GC approach has not been in the lowest priority at all. This reveals that the authors, by taking the advantage of two variables (A i and MAP), have succeeded to dramatically heighten the performance of an inverse-distance technique (i.e., conventional GC) for a wide variety of climatic and topographical conditions.

4 Conclusion

The main goal of the study was to modify the GC method for completing gaps in annual precipitation datasets. For this purpose, authors utilized the altitude ratio (A i ) and the MAP to enhance the accuracy of the GC technique. In this respect, 24 precipitation gauge stations situated in six different territories of Iran were taken into account in order to investigate the performance of the proposed methodology. Furthermore, estimates of other techniques such as the NR, the LR and the conventional GC were assessed to examine whether the improved approach has adequate suitability for various geo-climatic conditions.

Average values of model error (MAE) and model efficiency (E) measures and also outcomes of the priority setting process for examined methods demonstrated that in the general point of view, the first priority method for reconstructing missing precipitation measurements in surveyed climatic regions was the modified geographical coordinate (MGC) technique, and the second priority approach was the NR method. Meanwhile, the interesting point was revealed for the conventional GC technique, so that it occupied no place in the first priority class for all stations under consideration, and more interestingly, the modified GC method had not been in the lowest priority at all. Hence, the proposed methodology was proven to be a reliable procedure for filling gaps in precipitation records. In other words, authors have succeeded to dramatically improve the efficiency of an inverse-distance approach (i.e., GC method) that shows poor performance in reconstructing missing precipitation observations and to give it remarkable preference in comparison with traditional techniques such as the LR and the NR. Additionally, the applicability of the proposed modification procedure is inevitable due to its requirements (e.g., A i and MAP), because these data are usually readily available in a large array of areas including developing countries.