Frequency based imputation of precipitation

Dikbas, Fatih

doi:10.1007/s00477-016-1356-x

Frequency based imputation of precipitation

Original Paper
Published: 19 November 2016

Volume 31, pages 2415–2434, (2017)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Frequency based imputation of precipitation

Download PDF

Fatih Dikbas ORCID: orcid.org/0000-0001-5779-2801¹

292 Accesses
6 Citations
Explore all metrics

Abstract

Changing climate and precipitation patterns make the estimation of precipitation, which exhibits two-dimensional and sometimes chaotic behavior, more challenging. In recent decades, numerous data-driven methods have been developed and applied to estimate precipitation; however, these methods suffer from the use of one-dimensional approaches, lack generality, require the use of neighboring stations and have low sensitivity. This paper aims to implement the first generally applicable, highly sensitive two-dimensional data-driven model of precipitation. This model, named frequency based imputation (FBI), relies on non-continuous monthly precipitation time series data. It requires no determination of input parameters and no data preprocessing, and it provides multiple estimations (from the most to the least probable) of each missing data unit utilizing the series itself. A total of 34,330 monthly total precipitation observations from 70 stations in 21 basins within Turkey were used to assess the success of the method by removing and estimating observation series in annual increments. Comparisons with the expectation maximization and multiple linear regression models illustrate that the FBI method is superior in its estimation of monthly precipitation. This paper also provides a link to the software code for the FBI method.

The effect of simple imputations based on four variants of PCA methods on the quantiles of annual rainfall data

Article 04 September 2018

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

Article 25 July 2023

An imputation method for the climatic data with strong seasonality and spatial correlation

Article 30 January 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The importance of accurate and reliable modeling, estimation and forecasting of precipitation is becoming increasingly apparent as the rapid worldwide increase in population and water demand puts pressure on limited water resources and dwindling water supplies (Leconte et al. 2013; Popp et al. 2016). Accurate and reliable observations of precipitation are essential to the performance of valid hydrologic studies; yet, many precipitation records are incomplete. Complete records improve the ability of these studies to determine spatial, temporal and quantitative variations in precipitation data, which is crucial to the design of water supply systems. Changes in the water cycle and precipitation patterns, coupled with a warming climate (Hou et al. 2014; Reager and Famiglietti 2009), increase the need for stronger precipitation models (Zhang et al. 2010).

Developments in software technologies in recent decades have allowed traditional hydraulic and data-driven models to support/complement hydrologic models (Solomatine et al. 2008). Data-driven models analyze time series data, but they should not be regarded as computational methods that ignore physical processes. Determining the spatial and temporal interrelationships between precipitation time series data is mathematically equivalent to determining the relationships between the drivers of precipitation. In other words, precipitation is a function of its contributing variables. Thus, the analysis of precipitation time series data comprises the consideration of all variables that contribute to precipitation (though the relationships and variations of the variables are not evaluated); and the success of making accurate estimations of missing data is directly related to the level of understanding of the temporal and quantitative relationships between observed data.

Though precipitation is generally seasonal, the high variability in numerous influencing factors sometimes indicates the existence of a chaotic (Jayawardena and Lai 1994; Sivakumar 2000; Sivakumar et al. 1999) and relatively random behavior. This nonstationary and sometimes erratic behavior results in distinct variations in precipitation across space and time and makes the observation, quantification, estimation and forecasting of precipitation challenging (Wang and Lin 2015). Consequently, although there are a vast number of data-driven modeling studies that estimate hydrologic processes such as streamflow (which generally occur continuously), a very limited number of studies address the data-driven estimation of missing precipitation records. Some prominent studies that have utilized data-driven methods to estimate precipitation have applied artificial neural networks (ANNs), fuzzy rule based systems (FRBSs), genetic algorithms (GAs), support vector machines (SVMs), particle swarm optimization (PSO) and expectation maximization (EM) in the computation of results.

Lack of generality and overfitting are two of the most important problems associated with existing data-driven methods, as discussed in detail by Remesan and Mathew (2015). Both issues result in model failure when the training and testing period ranges change. Unfortunately, most data-driven hydrologic modeling studies do not even mention (or test) these issues. Another problem associated with existing methods is that time series data is generally regarded as a one-dimensional vector. This results in a failure to acknowledge the variation of behavior seen through time series data. For example, hydrological time series generally indicate an annual cycle of seasonality, with values observed in the winter months varying greatly from those observed during the summer months. Instead of using a one-dimensional time series to represent this data, a two-dimensional matrix containing a full cycle in each row would better express this temporal hydrological variability in a more comprehensible way and would enable the investigation of the two-dimensional behavior of time series data (Dikbas 2016b). Detailed information about the concepts, approaches, experiences and problems associated with the data-driven modeling of hydrologic variables exist in literature (Elshorbagy et al. 2010a, b; Maier and Dandy 2000; Maier et al. 2010; Remesan and Mathew 2015; Sikorska et al. 2015; Solomatine et al. 2008; Solomatine 2006; Yozgatligil et al. 2013).

This paper discusses the implementation of the Frequency Based Imputation (FBI) method to analyze observation data from 70 precipitation stations in Turkey. The method was first used to analyze all streamflow observations from 34 stations on the Buyuk Menderes River (Turkey) (Dikbas 2016a). This approach is based on the assumption that an individual observation in a time series is more closely and quantitatively linked to data observed within a short period of time and with data from the same subsection of other periods if the time series is periodic (i.e., same season in different years). The method searches neighboring data cluster pairs of missing data within an observed series, and then estimates the probable range and value of the missing data by utilizing temporal relationships. It is direct and uses all existing raw data to obtain estimates of missing values; and it requires no training/testing periods or input parameters to execute the applied procedure.

2 Materials and methods

2.1 Description of the frequency based imputation method

When precipitation observations are placed on a matrix with months in columns and years in rows, we expect annual fluctuations in the horizontal direction and values similar to each other in the vertical direction. In this setup, the smallest scale representing the temporal and quantitative behavior of precipitation is an adjacent pair of data on the two-dimensional matrix. This micro-statistical reasoning allows the FBI method to extract valuable information based on relationships within the dataset and provides information on the possible range of missing observations.

Figure 1 illustrates the logic behind the FBI method. The blue cell at the center of Fig. 1e (January 1985) is the missing value to be estimated. The method considers that the neighbors within the 7 × 7 matrix surrounding the missing value contain the strongest clues about the expected range of the missing cell. A wider field would add cells with a poorer relationship to the data point in question (like trying to determine the influence of values in September or May on a value in January which are less likely to be as influential as the considered temporally closer values from October to April); and a narrower field would remove cells with potential relationship (like ignoring the influences of October and April on the January value). Similarly, expanding the field vertically would result in the consideration of observations four or more years preceding or following the missing value, even though these values are less likely to relate to the value in question when compared to the values in closer years. The numbers in each cell in Fig. 1 are cluster values calculated by using Eq. 1 or 2 after the observed series was sorted and divided into range clusters (Appendix 1).

After the cluster index values for each cell are determined, the process of generating a cluster frequency table for each missing value begins. To this end, all adjacent cluster pairs within the neighborhood of a missing cell are searched using a data matrix. Figure 1e shows eight of the many cluster pairs in the neighborhood of the missing value for January 1985. The remaining subfigures show the locations of the matching cluster pairs. The aim of the search for matching cluster pairs is to deduce the highest probable cluster value for the missing cell. This task is accomplished by looking at the cluster values of the blue-bounded cells at the relative location of the missing January 1985 cell. These clusters show the probable values for the missing cell in January 1985 by answering the questions constructed using the searched and matched cluster pairs. One of the eight questions illustrated in Fig. 1 is:

“What might the cluster value of the missing cell in January 1985 be when the cluster value in January 1983 is 8 and the cluster value in January 1984 is 10?”

The goal here is to find the third cluster value of three vertically aligned cells when the first value is 8 and the second value is 10. One of the answers to this question is shown in Fig. 1b and is written as follows:

“The cluster value for February 1974 is 9 when the cluster value for February 1972 is 8 and the cluster value for February 1973 is 10”. In other words, the cluster value for January 1985 might be 9 based on previously observed series values.

For all eight cluster pairs in Fig. 1e, the probable cluster values at the relative January 1985 location in the remaining figures are found to be: 12 (2 times), 11 (2 times), 10, 9, 5 and 3. When the search for all pairs in the neighborhood of the missing value is completed, the cluster with the highest frequency is considered to have the highest probability of being the missing value. The estimated precipitation value is calculated by taking the average of the observations that generated the greatest cluster frequency. Details of how the cluster frequencies were determined and generated are provided in Appendix 2.

2.2 Study area and data

To test the applicability of the developed method and provided software on various climate zones, a total of 34,330 monthly total precipitation observations from 70 stations across 21 different basins in Turkey were estimated (Fig. 2). Turkey has a moderately dry climate. Average precipitation tends to be high in the coastal regions of Turkey and decreases towards the inland regions. The area around Rize on the coast of the Black Sea receives an average annual precipitation of 2200 mm, while Salt Lake region receives 250–300 mm. The Aegean and Mediterranean coasts are wet in the winter but dry during the summer. The Black Sea coastline is the only region in Turkey that receives precipitation throughout the year. Figure 2 illustrates the average annual precipitation in Turkey between 1981 and 2010. The selected stations represent the majority of the climate and elevation zones, and cover nearly all hydrological basins in Turkey.

The General Directorate of State Hydraulic Works of Turkey observes precipitation throughout the country using pluviographs capable of measuring liquid (rainfall) and solid (snow, hail, freezing rain, grain, etc.) precipitation. Therefore, the observations used in this study include liquid precipitation and water equivalents of solid precipitation.

Table 1 outlines the descriptive statistics for all stations, including percentiles and best-fitting distributions. The highest and lowest values (excluding 0.0) are shown in bold in all tables throughout the article. The majority of precipitation series from all stations (48/70) were found to fit the Wakeby distribution. The skewness and excess kurtosis measures indicate that the probability distributions for all stations are positively skewed and leptokurtic (except 21-007). A majority of the stations (67/70) have a minimum monthly precipitation of 0. A total of 59% (41/70) of stations registered zero monthly precipitation for at least 5% of the year, while 36% (25/70) of stations measured zero monthly precipitation more than 10% of the year and 4% (3/70) of stations measured zero monthly precipitation data during more than 25% of the year.

Table 1 Descriptive statistics, percentiles and best-fitting distributions for all stations

Full size table

A comprehensive explanation of the applied steps for the estimation of the monthly total precipitation is presented below for the observations of station 07-016 in Çivril-Denizli (Turkey). The first seven values from 1962 are missing, and the total number of existing observations at station 07-016 is 521. When 12 observations (a year of data) are removed from the set to test the model’s ability to make estimations, this number decreases to 509, resulting in a missing data rate of 3.6% (Fig. 3).

The details of the estimation process are presented using the observed values from 1985. The entire estimation process was repeated for each missing data point. First, the software removed and estimated data for each year between 1962 and 1984. Then, the 1985 values were removed from the set and estimated. The January value was estimated first. Figure 4 shows the observed values for those months and years surrounding January 1985. The October–December columns represent values from the previous calendar year (current water year).

To assess the quantitative relationships between the observations, the observed series are sorted and divided into 2–12 clusters, as explained in Appendix 1. The greatest number of clusters (12) was chosen based on the length and variability of the time series. The results show that this number was sufficient to generate successful results. Figure 5 shows the cluster values for the field surrounding January 1985 at each clustering step. Lower values are shown in shades of red and higher values are shown in shades of green. When the observed data series is divided into two clusters, the first cluster contains the lower precipitation values (0–25.8 mm) from the sorted observations, and the second cluster contains the higher values (26.0–204.8 mm). Each data point is assigned a cluster index: 1 for the data in the first cluster and 2 for the data in the second cluster, as shown in the first table of Fig. 5.

In the remaining cluster divisions (3–12), the January 1985 (84.9 mm) value is always located within the highest range of observations and thus the last cluster (bounded in blue in Fig. 5). The temporal and quantitative relationships between the horizontally, vertically and diagonally adjacent cluster pairs in the neighborhood of the missing data are determined as explained in Sect. 2. Then, the relationships are used to estimate the probable cluster value of the deliberately removed data in the center of the neighborhood.

When the sorted observations of the station are divided into 12 clusters, 496 cluster pairs matching with the adjacent cluster pairs in the neighborhood of January 1985 were found in the data matrix. Eight examples of the searched pairs in the neighborhood of the missing data, and matched pairs from various regions of the data matrix are shown in Fig. 1. The process described above for 12 clusters is repeated for 2–11 clusters, and a cluster frequency table is obtained for each month of 1985 (Fig. 6).

From left to right, each column in each table shows the frequencies obtained after dividing the observed value range for station 07-016 into 2–12 clusters. Each column heading indicates the number of clusters into which the observed data range is divided. Each row heading indicates the cluster indices. The Min and Max columns on the right show the cluster ranges when the number of clusters is 12. For example, Cluster 1 includes 0 values, Cluster 2 includes values from 0.1 to 4.5 mm, Cluster 11 includes values from 65.1 to 80.3 and Cluster 12 includes the highest values (80.8–204.8 mm).

The frequency table for each month provides information on the possible value of the missing data point in that month. For example, the first column of the frequency table for January 1985 shows the frequency values obtained for the first (the lower values) and the second (the higher values) clusters when the data series is divided into two clusters. The frequency value of the second cluster (10,075) is higher than the frequency value of the first cluster (6623). This shows that it is more probable that the January 1985 data point range was identified by the second cluster (within the 29.0–204.8 mm range).

The division into three clusters yields frequencies of 1388, 2965 and 3397, respectively. The high value of the third cluster indicates that the desired value is most probably within the 45.9–204.8 mm range. Similarly, for the remaining clusters, the higher frequencies trend toward the bottom of the January 1985 cluster frequency table, indicating that the missing data point is most probably in the higher observation range.

The larger the number of clusters, the smaller the data range covered by each cluster. The increase in the number of clusters results in a green path that highlights the highest frequencies generally observed. This green path shows the clusters with the highest probability of representing the missing value range; in contrast, the red cells indicate those clusters with a lower probability of representing the missing value. Months with highly variable observations (like June) result in fuzzy frequency tables, while months with low variability (like August) produce more distinguishable red and green patterns. For 1985, the green trends are more apparent in the January–April and July–September frequency tables.

2.3 Estimation of missing values based on cluster frequencies

The 12th column in the frequency table for January 1985 (Fig. 6) is used to estimate the missing data for that date. The clusters that occur most often provide the most likely ranges of value for the missing data. In the January 1985 example, the highest frequency (70) occurs in cluster 10, which represents the precipitation range between 53.6 and 65.0 mm. The average of the 70 observations (60.04 mm) used to generate this frequency is the most likely estimation of the missing January 1985 value. The obtained estimate will always be within the range of the averaged cluster. In the present example, the actual observed value for the January 1985 data point was 84.9 mm (within the range of the 12th cluster).

The second highest frequency (64) obtained by the example model occurred in cluster 12 (80.8–204.8 mm range). The average of the 64 observations used to generate this frequency is 110.6 mm and is the second probable estimate for the January 1985 value. The third highest frequency (63) occurred in clusters 6 and 11, which represent the third and fourth most likely estimates (71.2 and 25.2 mm) of value. The green path in the January 1985 frequency table indicates that the most likely value will be within the range of clusters 10–12; and, of the first five estimations, the third estimate obtained (cluster 11) is the nearest to the real observed value. This approach is repeated for the five highest total frequency values for each month analyzed, and the five most likely estimates for each month are written in a correlation tables output file by the software. As previously stated, precipitation is relatively chaotic, and the most likely precipitation might not be the experienced precipitation. Therefore, generating multiple precipitation values with a high likelihood of occurrence is very useful to scientists and practitioners who work with precipitation data.

The three lowest frequencies obtained for the 12 clusters occurred in clusters 1–3, indicating that the range 0.00–11.0 mm is the least likely to represent the total precipitation that occurred in January 1985. The actual 1985 data points to be tested were removed prior to the application of the method and were not known by the software at any stage of the estimation process.

The ability of the FBI method to estimate precipitation values can be compared to estimates generated using the EM and MLR methods, which are also direct methods. EM is an iterative method used to identify the maximum likelihood estimates of parameters in statistical models (Dempster et al. 1977). It also enables parameter estimation in probabilistic models with incomplete data. A good introduction to the mathematical foundations and applications of the EM method is provided by Do and Batzoglou (2008). As with the FBI method, the EM and MLR methods have the ability to generate estimates for a series by using existing observations in the series itself; they do not require preprocessing of data, and unlike methods such as ANN, they do not require the adjustment of any input parameters to improve the results. To compare these two models with the FBI method, all existing station 07-016 observations were estimated using the EM and regression modules in the missing value analysis toolbox of the IBM S.P.S.S. software. The same approach used to estimate values in the FBI method was applied. The data from each year was removed and estimated using both methods. Table 2 shows the estimates obtained for the test year using the FBI, EM and regression methods, together with the long-term monthly averages. The correlations obtained using the EM (0.713), regression (0.778) and long-term average (0.733) methods are significantly lower than the correlation found using the FBI method (0.976).

Table 2 Correlations between the observed values and the best estimates from the FBI, EM, regression and long-term average methods for 1985

Full size table

To test the advantages of generating multiple estimates for a missing value, the increase in correlation with the increase of the number of estimations is assessed for all observations of the station 07-016 annually. Table 3 shows the correlations between the observed values and the best estimates generated within the first 2, 3, 4 and 5 estimations for each year. Annual correlations over 0.7 occurred between the observed values and the nearest estimates in the first two estimations in 58% of cases (25/43). This rate increased to 91% (39/43) when three estimates were generated and to 100% when four or five estimates were produced. Similarly, the rate of annual correlations over 0.8 was 28, 81, 98 and 100% for the first 2, 3, 4 and 5 estimates, respectively; and the rate of annual correlations over 0.9 was 5, 33, 74 and 98%, respectively. These results indicate that increasing the number of estimates generated increases the model’s reliability and accuracy.

Table 3 Correlations between observed values from station 07-016 and the estimates generated using the clusters with the five highest frequencies

Full size table

The last column in the table (titled “Whole”) shows the correlations between the entire observed series and the series of best estimates derived from the first 2, 3, 4 and 5 estimations. A correlation value of 0.843 obtained for the first three estimations might be regarded as sufficient to estimate precipitation. Increasing the number of estimates to 4 produces a correlation of 0.912, while increasing the number to 5 yields a correlation of 0.944 for the entire series. These correlations indicate the production of extremely reliable precipitation estimates.

Table 4 presents the correlations between the observed values from station 07-016 and the estimates derived using the FBI, EM and regression methods, as well as the long-term averages for each year. For all years, the correlations between the FBI method estimates and the observed values exceed the correlations between the EM, regression and long-term average values and the observed data. While 98% (42/43) of the annual correlations between the FBI method and the observed values are over 0.9, all annual correlations with the compared methods are under 0.9.

Table 4 Correlations between the observed values for station 07-016 and the estimates from the FBI, EM, regression methods, and the long-term averages

Full size table

The highest and lowest correlations produced by each method are shown in bold. The obtained results reveal that the estimates produced using the EM method tend to be more similar to the long-term averages than the observed values. This resulted similar correlation values for both the EM method and the long-term averages across the years. The correlations of the compared methods follow a similar pattern. Generally, the correlations increase or decrease together over the years. For example, the lowest annual correlations with the EM method (0.015) and the long-term averages (0.030) occurred in 1972. This year represented the sixth lowest annual correlation for the FBI method (0.904) and the third lowest for the regression (0.006) method.

To compare the general performance of the methods used to estimate precipitation values at station 07-016, five statistical measures (correlation (r), Nash–Sutcliffe efficiency coefficient (E), root mean squared error (RMSE), mean absolute error (MAE) and mean bias error (MBE)) were calculated and presented in Table 5. The FBI method performed best using all statistical measures except the MBE. The negative E value obtained using the regression method indicates that the observed mean is a better indicator of value than the regression method. The other statistical measures also reveal that utilization of long-term averages is preferred to use of the regression method. As expected, the MBE for long-term averages was zero, while the MAE was lowest for the FBI method, suggesting that the FBI method estimates are closer to the observed values. The MAE and MBE statistics should be considered together because equal averages for estimates and observed values does not generally mean that the estimations are sufficiently close to the observations. The averages may be similar even though there are significant positive and negative differences between the estimates and observed values. These differences can be detected by calculating the MAE, which has advantages over the RMSE and MBE in assessing average model performance (Willmott and Matsuura 2005).

Table 5 Statistical measures used to compare all observed values from station 07-016 with the estimates generated using the FBI, EM, and regression methods and the long-term averages

Full size table

The graphs shown in Fig. 7 compare the observed values from station 07-016 with the estimates produced using the FBI, EM and regression methods. A very good fit is seen between the FBI method estimates and the observed values across the time series, indicating that the method is sensitive to the variations in precipitation. On the other hand, the estimates produced using the EM and regression methods lack generality and sensitivity. Figure 7 also shows that the FBI method provides lower estimates for rarely observed high precipitation values even though it produces better estimates compared to the EM and regression methods. Low estimations of extreme values occur as a result of the estimation logic behind the FBI method, which considers the frequency of observed values; it is well known that the frequency of extreme precipitation is generally low. The graphs also show that the estimates obtained for extreme values are always higher than the remaining estimates. This may be considered a disadvantage of the method; however, its ability to estimate extreme values might be improved by considering observations from nearby stations.

2.4 Application of the FBI method using the remaining 69 precipitation stations

The above discussion was generated based on estimates and observations for a single station (07-016). A method’s ability to estimate values for a single station is not sufficient to claim that it will be successful in estimating values for other stations. To test the FBI method’s application across multiple stations, we used the above method to estimate precipitation values for 70 stations across 21 different basins in Turkey. Stations were chosen based on location and the variation in observed values. The stations reflect various climates in Turkey, ranging from dry to wet (see the descriptive statistics of the observed series in Table 1). Table 6 presents the statistical measures (r, E, normalized root mean squared error (NRMSE), mean absolute scaled error (MASE), MAE and MBE) generated for each station based on a comparison between produced FBI method estimates and the observed values from each station. The number of years data was available for each station is also presented in the table.

Table 6 Statistical measures of the comparisons between the estimated and observed values of the stations

Full size table

The correlations between the results of the FBI method and the observations exceeded 0.9 for 24% (17/70) of stations and exceeded 0.85 for 79% (55/70) of stations. The minimum correlation was 0.795, and the maximum correlation was 0.944. 11 of the 15 stations with the highest correlations are located in basins 4, 5, 6, 7, 8 and 9, which are all located within the Eastern Aegean and Eastern Mediterranean regions of Turkey. Similarly, 8 of the 15 stations with the lowest correlations are in basins 12, 13, 14, 15, 16 and 18, which are located in the central and northern regions of Turkey. While the lowest 14 correlation values occurred for stations fitting to the Wakeby distribution, none of the 10 stations with the highest correlations fit this distribution. 8 of the 15 best correlated stations (including the first 3) instead fit the GEV distribution.

All Nash–Sutcliffe efficiency coefficients exceeded 0.591; 73% (51/70) were over 0.70 and 11% (8/70) were over 0.80. The highest Nash–Sutcliffe efficiency coefficient was 0.889. The highest NRMSE value was 0.114; 83% (58/70) of the NRMSE values fell below 0.10, while the lowest NRMSE value was 0.050. All MASE values fell below 0.40; 90% (63/70) of these values were under 0.35 and 41% (29/70) were under 0.30. The lowest MASE value was 0.237. The MAE values ranged between 6.367 (obtained for 18-003) and 27.571 (obtained for 08-006), suggesting that the high correlation value (0.942) obtained for station 08-006 may be misleading because the MAE and MBE values for the station are higher than those for the remaining stations. MBE values ranged between 0.084 and −17.667, with data from 69 stations generating negative MBE values. This indicates that the precipitation estimates generated using the FBI method have a slight negative bias. Greater bias occurred at stations where extreme values and variations were much higher than at other stations, resulting in greater differences between the estimated and observed values. Future studies might investigate ways to obtain average estimates closer to average observations to eliminate bias errors without increasing MAE. A method to overcome this bias might be to multiply all estimates by the ratio between the averages of the observed values and the estimated values for each station. This intervention should only be made if the MAE between the observed and estimated values also decreases. Furthermore, though this intervention may improve the estimation of higher values, a much larger number of values in the lower ranges might be overestimated. A selected bias correction method will not produce the best results for all data series (Ajaaj et al. 2016); thus, the selection of a bias correction method should be left to the users of the FBI method where necessary.

3 Discussion and conclusions

This article assesses the ability of the FBI method to estimate non-continuous monthly precipitation data without the use of observation from neighboring stations. The goodness of fit measures calculated between the observed and estimated series show that the FBI method is capable of estimating monthly precipitation data obtained from various climatic zones. However, it is impossible to claim that the method will always successfully estimate values for stations in other regions without first applying the method to observations from those stations. The practical experiences in the literature show that no data driven methodology is perfect enough to provide the best results for all stations or for all variables.

This method may also be used to estimate weekly or daily precipitation data; however, given that the randomness of precipitation generally increases with decreasing observation periods, it is anticipated that the success of the method will be lower for precipitation estimates at the weekly or daily scale. The inclusion of observations from highly correlated neighboring stations improve the generation of estimates with shorter sampling frequencies. Further studies may investigate the influence of neighboring stations on the estimation power of the presented method.

As noted above, the method analyzed in this study may not be suitable for the estimation of extreme observations that occur at a very low frequency. Values that occur with a very low frequency in a data series also have a low occurrence probability and will not occur frequently enough to be determined among the highest possible values. As is valid for most data-driven methods, the length of the data series used may influence the performance of the proposed method. The method may be less useful when applied to short data series, as the estimates produced by the presented method are based on the frequencies of the observed value ranges. The input dataset should have at least seven rows of input data (i.e., 7 years for monthly data) and more data will generally provide more information about the frequencies of the observations, consequently supporting the possibility of better estimations.

Another limitation of the method is that writing a software code for its implementation might not be easy for every user. With this in mind, a link to the source code written in Visual Basic is provided to the readers in Appendix 3. This will enable users to implement the FBI method on other datasets or in other research areas. Users of other programming languages or operating systems will need to convert the code.

The FBI method may be applied in many scientific disciplines, as it is a generally applicable, direct analysis method that requires no determination of input parameters, nor does it require any preprocessing of data. While most existing methodologies are one-dimensional, the FBI method is two-dimensional and has been shown to perform better when compared to the EM and MLR methods in the estimation of precipitation.

References

Ajaaj AA, Mishra AK, Khan AA (2016) Comparison of BIAS correction techniques for GPCC rainfall data in semi-arid climate. Stoch Environ Res Risk A 30:1659–1675. doi:10.1007/s00477-015-1155-9
Article Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
Google Scholar
Dikbas F (2016a) Frequency based prediction of Buyuk Menderes flows. Tek Dergi 27:7325–7343
Google Scholar
Dikbas F (2016b) Three-dimensional imputation of missing monthly river flow data. Sci Iran 23:45–53
Google Scholar
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm?. Nat Biotech 26:897–899. http://www.nature.com/nbt/journal/v26/n8/suppinfo/nbt1406_S1.html
Elshorbagy A, Corzo G, Srinivasulu S, Solomatine DP (2010a) Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology-Part 1: concepts and methodology. Hydrol Earth Syst Sci 14:1931–1941. doi:10.5194/hess-14-1931-2010
Article Google Scholar
Elshorbagy A, Corzo G, Srinivasulu S, Solomatine DP (2010b) Experimental investigation of the predictive capabilities of data driven modeling techniques in hydrology-Part 2: application. Hydrol Earth Syst Sci 14:1943–1961. doi:10.5194/hess-14-1943-2010
Article Google Scholar
Hou AY et al (2014) The global precipitation measurement mission. Bull Am Meteorol Soc 95:701–722. doi:10.1175/BAMS-D-13-00164.1
Article Google Scholar
Jayawardena AW, Lai F (1994) Analysis and prediction of chaos in rainfall and stream flow time series. J Hydrol 153:23–52. doi:10.1016/0022-1694(94)90185-6
Article Google Scholar
Leconte J, Forget F, Charnay B, Wordsworth R, Pottier A (2013) Increased insolation threshold for runaway greenhouse processes on earth-like planets. Nature 504:268–271. doi:10.1038/nature12827
Article CAS Google Scholar
Maier HR, Dandy GC (2000) Neural networks for the prediction and forecasting of water resources variables: a review of modelling issues and applications. Environ Model Softw 15:101–124. doi:10.1016/S1364-8152(99)00007-9
Article Google Scholar
Maier HR, Jain A, Dandy GC, Sudheer KP (2010) Methods used for the development of neural networks for the prediction of water resource variables in river systems: current status and future directions. Environ Model Softw 25:891–909. doi:10.1016/j.envsoft.2010.02.003
Article Google Scholar
Popp M, Schmidt H, Marotzke J (2016) Transition to a Moist Greenhouse with CO₂ and solar forcing. Nat Commun. doi:10.1038/ncomms10627
Google Scholar
Reager JT, Famiglietti JS (2009) Global terrestrial water storage capacity and flood potential using GRACE. Geophys Res Lett. doi:10.1029/2009GL040826
Google Scholar
Remesan R, Mathew J (2015) Hydrological data driven modelling: a case study approach. Springer, Switzerland. doi:10.1007/978-3-319-09235-5
Book Google Scholar
Sikorska AE, Montanari A, Koutsoyiannis D (2015) Estimating the uncertainty of hydrological predictions through data-driven resampling techniques. J Hydrol Eng. doi:10.1061/(ASCE)HE.1943-5584.0000926
Google Scholar
Sivakumar B (2000) Chaos theory in hydrology: important issues and interpretations. J Hydrol 227:1–20. doi:10.1016/S0022-1694(99)00186-9
Article Google Scholar
Sivakumar B, Liong SY, Liaw CY, Phoon KK (1999) Singapore rainfall behavior: chaotic? J Hydrol Eng 4:38–48. doi:10.1061/(ASCE)1084-0699(1999)4:1(38)
Article Google Scholar
Solomatine DP (2006) Data-driven modeling and computational intelligence methods in hydrology. Encyclopedia of hydrological sciences. Wiley, Hoboken. doi:10.1002/0470848944.hsa021
Google Scholar
Solomatine D, See LM, Abrahart RJ (2008) Data-driven modelling: concepts, approaches and experiences. In: Abrahart R, See L, Solomatine D (eds) Practical hydroinformatics, vol 68, Water science and technology library. Springer, Berlin, p 17. doi:10.1007/978-3-540-79881-1_2
Chapter Google Scholar
Wang XL, Lin A (2015) An algorithm for integrating satellite precipitation estimates with in situ precipitation data on a pentad time scale. J Geophys Res Atmos 120:3728–3744. doi:10.1002/2014JD022788
Article Google Scholar
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim Res 30:79–82. doi:10.3354/cr030079
Article Google Scholar
Yozgatligil C, Aslan S, Iyigun C, Batmaz I (2013) Comparison of missing value imputation methods in time series: the case of Turkish meteorological data. Theor Appl Climatol 112:143–167. doi:10.1007/s00704-012-0723-x
Article Google Scholar
Zhang Q, Xu C-Y, Tao H, Jiang T, Chen YD (2010) Climate changes and their impacts on water resources in the arid regions: a case study of the Tarim River basin, China. Stoch Environ Res Risk A 24:349–358. doi:10.1007/s00477-009-0324-0
Article CAS Google Scholar

Download references

Acknowledgements

I would like to thank The General Directorate of the State Hydraulic Works of Turkey for providing the data used in this study and the editors and reviewers for their valuable contributions and comments, which greatly improved the manuscript.

Author information

Authors and Affiliations

Civil Engineering Department, Pamukkale University, Denizli, Turkey
Fatih Dikbas

Authors

Fatih Dikbas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatih Dikbas.

Appendices

Appendix 1: Determination of range clusters

For various reasons, there are generally gaps in any time series dataset, and the reliable estimation of the missing data has great value. In the FBI method, the missing data value at the center of the matrix in Fig. 8 (cell i, j) has temporal and quantitative relationships with nearby cells.

To estimate the probable range of the missing value at node i, j, the value ranges of all existing observations in the dataset should be determined. First, the observed data is sorted in ascending order and a three-dimensional vector containing the sorted data and associated coordinates in the data matrix is generated. The coordinate of each data point used in this study is the observed month (column) and year (row) of the data and is unique for each observation. The coordinate information is crucial because the observation time of a given value affects the temporal and quantitative investigation of time series data. Sorting and investigating statistical relationships for a variable without considering the observation times of each individual variable mean ignoring information about the temporal relationship between observations.

After sorting the observations, the observed time series range is divided into 2 to n range clusters to evaluate and estimate the possible clusters into which the missing data point may fall. The value of n may increase with the amount of available data; this increase would provide more precise results, as the value range for each cluster would be narrower. The number of clusters should be chosen so that the distribution of the observed values is sufficiently represented. Currently, the maximum number of clusters is determined by running the software for various number of clusters. It must be noted that the selected cluster number may not be optimum for obtaining the best results, although the method may still produce successful results. A good approach to determine the maximum number of clusters might be to start with a high number of clusters (like 50). Then, the cluster number that produces sufficient frequency values and cluster ranges might be chosen by looking at the generated frequency tables. Future studies should propose a method for determining the optimum number of clusters based on the number and variability of observations to further improve the successful estimation of missing values.

Clusters may be generated using two different approaches. In the first approach, each cluster has as equal a number of elements as possible (the clusters have varying ranges). Observed values are assigned to clusters using Eq. (1).

$$Cl_{i} = int\left( {\frac{{i*n_{cl} }}{{n_{d} }}} \right) + 1$$

(1)

In the second approach, range values are equalized (the clusters have a varying number of elements). The bounds of the cluster ranges are the lowest and highest observations belonging to that range. Observed values are assigned to clusters using Eq. (2).

$$Cl_{i} = int\left( {\frac{{\left( {X_{i} - X_{min} } \right)*n_{cl} }}{{X_{max} - X_{min} }}} \right) + 1$$

(2)

In the above equations: n _d is the total number of observations in the sorted data vector, i is the rank (index number) of the observation in the sorted data vector (changes between 1 and n_d), n _cl is the number of clusters used to divide the sorted data vector, Cl _i is the cluster index to be assigned to the i-th observation (changes between 1 and n_cl), int() is the function converting a decimal number into an integer, X _i is the i-th observation in the sorted data series, $X_{min} ; X_{max}$ the minimum and maximum observations.

Both approaches have advantages and disadvantages over each other. Selection of the appropriate clustering method completely depends on the diversity of the observed time series. For example, if the number of elements in certain clusters become too high compared to other clusters, then it would be better to generate clusters with an equal number of elements. For the precipitation data used in this paper, the first approach was used; each cluster included a similar number of elements. For example, for station 07-016, the first 11 clusters cover the range 0.0–80.3 mm while the 12th cluster covers the range 80.8–204.8 mm (1.54 times greater than the cumulative range of the first 11 clusters).

Appendix 2: Generation of the cluster frequency table

The clustering process explained in Appendix 1 assigns a cluster index to each observation. The cluster index value of each cell is the key to finding the cluster value of the missing cell. When the observed range is divided into two clusters, the first cluster includes the lower values and has a cluster index of 1, and the second cluster includes the higher values and has a cluster index value of 2. All adjacent cluster pairs in the data matrix near the missing cell are searched. Frequency values for the probable clusters are set to zero prior to the initiation of the search process. At the first clustering step, there are two possible clusters (1 or 2) into which the missing data may fall. When a match for a cluster pair is found in the matrix, the frequency of the cluster value at the relative location of the missing data point is increased by one. The maximum number of unique cluster pairs near the missing data point is 158. This number decreases if there is more than one missing data point in the neighborhood. The following rules provide three examples of the 158 unique rules used to find matching cluster pairs.

1.
If [Cl(X_i,j−2) = a & Cl(X_i,j−1) = b] and if [Cl(X_p,q−2) = a & Cl(X_p,q−1) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.
2.
If [Cl(X_i−2,j) = a & Cl(X_i−1,j) = b] and if [Cl(X_p−2,q) = a & Cl(X_p−1,q) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.
3.
If [Cl(X_i−2,j−2) = a & Cl(X_i−1,j−1) = b] and if [Cl(X_p−2,q−2) = a & Cl(X_p−1,q−1) = b & Cl(X_p,q) = c] then freq(c) = freq(c) +1.

In the above rules, Cl(X) is the cluster index of the observed value X; i and j are the row and column numbers of the missing node at the center of the 7 × 7 cell field; p and q are the row and column numbers of the cell at the relative location of the missing data at i, j and a, b and c are the cluster numbers of the related cells. When the entire dataset is divided into two clusters, a, b and c might have values of 1 or 2; for n clusters, they may have values ranging between 1 and n. The values of a, b and c may differ for each rule because they may represent different locations within the data matrix. The above three rules represent the horizontal cluster pair to the left of the missing node, the vertical cluster pair above the missing node and the diagonal cluster pair to the top left of the missing node, as shown in Fig. 9a in orange, yellow and green, respectively. Figure 9b shows the location of the first pair match for the first rule. With the first match, the frequency of the cluster number of the cell at the relative location of the missing data point is increased by one (the cell at p, q shown in pink). This is done because the cluster value at cell p, q is a probable value for the missing node at i, j, given that both cells have the same cluster pairs to the left. The search for the same pair then continues until all matching pairs are found and the frequencies of the clusters at the corresponding cells p, q are increased by one (for each match, the values of p and q might be different because the matching pairs will be at different locations within the data matrix).

After the search for the first cluster pair is completed, the above process is repeated for the next pair until all pairs near the missing data point have been searched and the total frequencies for each probable cluster determined. The clusters with the highest frequencies will be the most likely clusters into which the missing node will fall. Some cluster frequencies might remain at zero, indicating that it is unlikely that the missing data point will fall within that cluster.

In the first step, the observed data range was divided into two clusters. After the determination of the frequencies of both clusters, the observed range is divided into three clusters. This time, the cells in the data matrix will have cluster values ranging from 1 to 3. The process used to assign values to the two clusters above is repeated for the three clusters. For the missing value, the frequency of the three probable clusters will be zero to start. Then, all cluster pairs near the missing data point will be searched, and the frequencies of the clusters found at the relative location of the missing data point will be increased by one for each cluster pair match. The clustering, searching and cluster frequency determination process continues until the process has been applied for the greatest number of clusters. During this process, a cluster frequency table is generated to show the frequencies of the clusters determined at each clustering step. The highest frequency values in this table indicate the most likely clusters into which the missing data point will fall.

A dataset might have more than one missing value. The above method can be applied to each missing data point in the set and a frequency table generated for each missing cell. As the locations of the missing data points in the matrix will be different from one another, the neighbors of each missing cell will be unique; consequently, the frequency table for each missing data point will also be unique. To avoid repetition, cluster frequency table samples and details about how the estimates are calculated using the cluster frequencies are presented in the Application of the FBI Method section.

Appendix 3: The frequency based imputation software

The software developed to implement the method used in this study was written in Visual Basic in the Microsoft Visual Studio environment. The software is a console application that makes use of the interoperability feature, which enables synchronous operation of Microsoft Visual Basic and Microsoft Excel. The flowchart in Fig. 10 shows the general application procedure of the developed method and the software.

The first step in the application of the method is to read all observed values in the selected time series from the input file. The file is an Excel spreadsheet containing a two-dimensional matrix of the observed data. In this study, the columns in the data file represent months and the rows represent years. For each run, all observed data for a single station is evaluated. The method requires no preprocessing of data and uses all observed values from a station to generate the frequency tables for each observation; estimations are then made for the entire series. No observations are ignored and no smoothing occurs.

The software generates four output files containing the frequency tables, the estimations and their correlations with removed observations and statistical measures comparing the observed and estimated series to one another. Conditional formatting is used in the output files to visualize the differences between the values. The code is separated into distinct sections and explanations about the implementation of the method by the software are provided in the code itself.

The frequency based imputation software is distributed under the terms of the GNU General Public License version 3, and a copyright notice is provided at the beginning of the code. The software code may be downloaded using the following link: https://www.dropbox.com/s/l9eavvjiywipl19/FrequencyBasedImputation.vb?dl=0.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dikbas, F. Frequency based imputation of precipitation. Stoch Environ Res Risk Assess 31, 2415–2434 (2017). https://doi.org/10.1007/s00477-016-1356-x

Download citation

Published: 19 November 2016
Issue Date: November 2017
DOI: https://doi.org/10.1007/s00477-016-1356-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Frequency based imputation of precipitation

Abstract

Similar content being viewed by others

The effect of simple imputations based on four variants of PCA methods on the quantiles of annual rainfall data

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

An imputation method for the climatic data with strong seasonality and spatial correlation

1 Introduction

2 Materials and methods