1 Introduction

Access to reliable time series of climatic variables with sufficient length is critical for the study of several geophysical and environmental processes (Beniston et al. 2012). Changes in environmental conditions (Beniston et al. 2012; Kottek et al. 2006); vegetation patterns (Sandholt et al. 2002; Weng et al. 2004), as well as changes on growth, development; and the adaptation of plants and animals (Zavala 2004; Pörtner 2001) are usually explained by climatic variables.

Unfortunately, climate records with the desired length are not always available. Indeed, the densification of weather networks has only occurred recently with substantial investments in standard weather stations and the incorporation of automatic weather stations. According to the National Climatic Data Center, the number of weather stations in the USA varied from little more than 3000 in 1900 to almost 95,000 in 2000. In addition, available data sets usually show discontinuities due to sensor failure or human error.

Several methods have been developed to fill out discontinuities in temperature and precipitation time series. Some of them are deterministic (producing the same output for given initial conditions) and others are stochastic, which provide likely realizations of the variable of interest.

Gap filling methods usually rely on time series for temporal estimation and spatial neighbors as regression variables to extrapolate values in space (Tardivo and Berti 2014). For the case of temperature, Daly (2006) compared six commonly used methods for estimating missing climatic data and showed that Daymet (local regression), PRISM (local regression), and a Regional regression perform better. Moreover, the author also noted a significant influence of the coast (included as distance from coastline) over minimum temperature, especially at scales ≤ 10 km. Tardivo and Berti (2012) used data from a set of stations within a regional domain to fill gaps in extreme temperature data, interpolating values using only meteorological variables and without the inclusion of geographic or spatial variables that could have improved estimations.

In the case of rainfall, literature shows several examples of gap filling and data generation methods according to climate projections from general circulation models (e.g., Diez et al. 2005; Maidment et al. 2012; Nagata 2011; Rojas et al. 2010; Schmidli et al. 2006; Fowler et al. 2007; Vrac and Naveau 2007). Another common approach is the use of stochastic downscaling techniques from global circulation models, but in this case, estimates of individual values are only likely realizations of climate conditions (Castro et al. 2013). In general, these methods usually capture trends but tend to be biased when are disaggregated in time (Casanueva et al. 2012). Other methods use algorithms based on existing data sets. For instance, Ramos-Calzado et al. (2008) presented a method based on error propagation theory taking into account the uncertainty of the precipitation measurements. Another example of this approach is found in the method named CUTOFF (Feng et al. 2014), where the estimate of a missing value is obtained using similar observed temporal information from the nearest spatial neighbors.

One common problem of simple gap filling methods is that they do not always consider local topography due to the relatively coarse grid size (Schmidli et al. 2006) having problems when representing rainfall variability observed as a consequence of specific local conditions (Colle 2004). Climatic processes respond to a number of topographic variables, such as a slope, aspect, latitude, and longitude, which are correlated with spatial gradients (e.g., the coastal influence on temperature) across the land (McCune 2007; Daly 2006). For this reason, Digital Elevation Models (DEM) are often the most useful variable for spatial estimation of precipitation (Hong et al. 2005; Lookingbill and Urban 2003; Ruiz-Arias et al. 2009), allowing us to capture the altitudinal gradient of these variables.

The use of information from remote sensing is another alternative to account for local scale effects (see examples in Bustos and Meza 2014, Sobrino et al. 2004; Suga et al. 2003; Wan and Li 1997), especially when satellite observations have a fairly good spatial resolution (i.e., between 30 m and 1 km), and is able to capture changes in temperature as a function of land physiography. Unfortunately, these satellite observations have moderate to low temporal resolution, and their use is limited by the presence of clouds. In addition, the moment in which the sensor captures land images may not be adequate to infer climatic conditions. For instance, if maximum temperatures are usually observed after midday, and the satellite passes before noon, the consequent estimate from the processed image can only be regarded as a proxy for maximum temperature (Sandholt et al. 2002).

Regarding record extension, only few methods deal with meteorological time series and their extension into the past (Begert et al. 2005; Perry and Hollis 2005; Sherwood et al. 2008), mainly due to the lack of covariates with the same spatial and temporal structure as the data series. In these cases, estimates not only tend to be biased, but their reliability is sometimes questionable (Chen and Hwang 2000). Some examples of record extension can be found in the case of frost days (Perry and Hollis 2005), the use of memory processes for the case of temperature (Blender et al. 2008), and an interesting approximation followed by Jung-Woo and Yakov (2010) to combine artificial neural networks and regression trees. The authors showed that this method improved accuracy and was more robust when compared to artificial neural networks and regression trees alone.

The NCEP/NCAR Reanalysis project corresponds to a systematic effort to produce data sets for climate monitoring and research that date back to 1948 (Kalnay et al. 1996; Kistler et al. 2001). This data set has been compared with meteorological observations that are not part of the assimilation network to develop forecast systems and/or describe local weather behavior (Bengtsson et al. 2004; Beniston et al. 2012; Bojanowski et al. 2014; Fuka et al. 2013; Harnik and Chang 2003; Kubik et al. 2012; Linares-Rodriguez et al. 2011; Maidment et al. 2012; Montecinos and Aceituno 2003; Wright et al. 2009). Bao and Zhang (2012) used four different Reanalysis products (i.e., NCEP/NCAR, NCEP–CFSR, ERA-Interim, and ERA-40) to evaluate relations between variables such as temperature, relative humidity, and wind speed on the Tibetan Plateau and the same variables measured in a network of 11 radiosondes stations (none of them assimilated into these four Reanalysis data sets). They found consistency between the values of temperature and wind speed measured in the field with those delivered by Reanalysis for an average of 3 months, but not for relative humidity, for which a significant bias was found. The Reanalysis NCEP/NCAR provided the most consistent results. Nevertheless, these relationships can be more consistent in certain places, given the amount of remotely sensed data that provide real information for assimilation interpolations (Ramella and Haimberger 2014).

The use of Reanalysis variables for climatic studies and applied meteorological models has been described before (Bao and Zhang 2012; Beniston et al. 2012; Betts et al. 2006; Fuka et al. 2013; Harnik and Chang 2003; Maidment et al. 2012; Saha et al. 2010; Wright et al. 2009), but few studies have tried to find temporal or spatial dependences of environment variables with predictors from the Reanalysis projects. Linares-Rodriguez et al. (2011) used four variables (total cloud cover, surface temperature, total water vapor column, and total ozone column) from the ERA-interim Reanalysis (Simmons et al. 2007) to generate global radiation data at a daily scale through artificial neural networks. Kubik et al. (2012) used information from MERRA Reanalysis (Rienecker et al. 2011) to obtain wind data through simulations.

This study proposes a methodology for the estimation of maximum and minimum temperatures and precipitation using variables from the NCEP/NCAR Reanalysis project at daily scale. Since the spatial correlation of climatic variables is somehow captured by large-scale features of the atmosphere detectable within the Reanalysis data, it should be possible to find statistical relationships between large-scale variables and local ground observations. In this context, a downscaling methodology for low spatial resolution Reanalysis data (2.5° Lat. × 2.5° Lon.) was generated to estimate daily maximum (Tx) and minimum (Tn) temperatures and Precipitation (Pp) in order to fill data gaps and extend time series back to 1950. The method was applied to nine locations in the Maipo River basin (Central Chile). The estimates are the result of the interaction of observed data at the station level and Reanalysis grid values with high statistical similarity within a region representing the local climate of the area. This joint approach has already been reported by some authors but only as a downscaling approach (Bastola and Misra 2014; Brands et al. 2012; Hwang et al. 2013; Misra et al. 2012; Yoshimura and Kanamitsu 2008); thus, its use in gap filling and record extension needs to be evaluated.

2 Methods

2.1 General characteristics of the study area

The region under study corresponds to the Maipo basin in Central Chile (33° S). Mean annual temperature values are around of 14 °C (20 °C in summer and 7.5 °C in winter). Rainfall is concentrated in the winter months (June to August) with 80% of the total ~ 350 mm falling during these months (Fig. 1). Snow accumulation occurs above 1500 m.a.s.l. during winter. The region is known to have a significant El Niño Southern Oscillation (ENSO) footprint. Montecinos and Aceituno (2003) reported a relationship between precipitation anomalies and ENSO, indicating a statistical dependence between the amount of rainfall in a year and the increase in sea surface temperature in the eastern equatorial Pacific.

Fig. 1
figure 1

Seasonal variation of mean temperature (black line) and precipitation (bars) at the Quinta Normal meteorological station

2.2 Meteorological data

We selected ten meteorological stations with daily records of temperature and precipitation (Table 1), seven from the Dirección General de Aguas (National Water Directorate; DGA) and three from the Dirección Meteorológica de Chile (Meteorological Directorate of Chile; DMC). These stations are spread out and represent different locations within the Maipo basin (Fig. 2). Nine stations were used in the reconstruction experiment. The remaining one (Quinta Normal station) is the only one with a complete time series from January 1, 1914 to the present, so it was used as an independent data set in the algorithm. We evaluated the procedure with and without the inclusion of this reference station as a way to measure the effect of its presence and contrast results in the reconstructed time series against real data.

Table 1 General characteristics of selected stations in the Maipo basin
Fig. 2
figure 2

Spatial distribution of ten selected stations over the Maipo basin

Prior to the development of the model, Tn and Tx data had been grouped by month and transformed into standardized anomalies (mean = 0 and standard deviation = 1) to avoid heteroscedasticity in residuals (Royer and Poirier 2010). In each month, we evaluated the presence of outliers, removing them following the methodology described by Laurikkala et al. (2000) for univariate data.

Since Pp is represented as an occurrence and amount process, we first transformed their values into a time series of binary variables representing occurrences (1) and non-occurrences (0) of the event (defined strictly as days with precipitation greater than zero or not, respectively). For the amount process, we selected only values greater than zero and expressed them as standardized anomalies.

We noted a non-homogeneous distribution of the frequency values registered by precipitation gauges over the ten analyzed stations (Fig. 3). Data shows extraordinarily large frequencies of values in 0.5 classes that suggest a bias in precipitation records, normally associated to measurement errors. Given the large number of observations that fall in each of the 0.5-mm classes, we grouped all values in 0.5-mm range intervals (i.e., those values between 0.1 and 0.4 were rounded to 0.5, those between 0.6 and 0.9 were rounded to 1.0, etc.). Alternatively, we could have introduced assumptions on the distribution of values to disaggregate the data with the risk of invalidating the model for gap filling. Note that for a general methodology, this step is not essential and is to be regarded as a convenient solution, given the nature of the observations available.

Fig. 3
figure 3

Histogram of precipitation in the Cerro Calán station for the 0–5-mm range

2.3 Reanalysis variables

We selected 14 variables from the Reanalysis NCEP/NCAR (Table 2). We used 12 of them for Tn and Tx estimation and the full set (including precipitable water surface and precipitation rate) in the model for the estimation of precipitation occurrence and amount at each station. We selected data from the 500-mbar pressure level and surface data. Some of the variables used are available at both levels, like air temperature and the east-west (u) and south-north (v) components of the Geostrophic Wind (Table 2). We selected the surface level, because it is the closest to the level at which ground observations are taken, whereas the 500-mbar level variables were chosen because some meteorological variables, like precipitation and frosts, are sensitive to large-scale phenomena such as the position and strength of the South American Anticyclone. Moreover, variables from the 500 mbar are normally included in weather forecasting models (Gershunov and Cayan 2003; Flannigan and Wotton 2001) as they usually show significant correlation with ground observations.

Table 2 Reanalysis variables used in the algorithm for gap filling and record extension

2.4 Proposed algorithm

We restricted the domain of Reanalysis data to the coordinates 17.5° to 50° S and 65° to 175° W (Fig. 4). The method is independent of the selected area as it is based on selecting the pixels with the highest level of correlation. However, the two major controllers of Central Chile’s climate can be found in this region (the South American Anticyclone and the El Niño Southern Oscillation phenomenon). Although no mechanistic explanation is provided in this model, and any highly correlated pixel could be used, we believe that the use of this window reduces the risk of incorporating spurious relations. Nevertheless, the majority of the cells selected are found in the 5 × 5 square grid whose centroid corresponds to the cell where the basin is located. Therefore, the method could be applied using only the closest neighbors.

Fig. 4
figure 4

Selected spatial window for searching correlations between station data and Reanalysis variables

We extracted 546 time series (the number of pixels in the restricted Reanalysis domain) for the 14 NCEP/NCAR Reanalysis variables. Data of each Reanalysis variable were previously grouped by month and expressed as standardized anomalies over the period of interest (1950–2012). We correlated each variable of each pixel with the temperature and precipitation series of the nine locations for each month. Then, for each Reanalysis variable, we searched for the pixel (grid cell) having the best (i.e., highest in absolute value) Pearson’s correlation coefficient and extracted its monthly data to create a vector of predictors with the highest linear individual correlation. In a few cases, a log or square root transformation was needed to obtain higher correlation values (Wang and Murphy 2004). The process is repeated independently for the three climatic variables (Fig. 5).

Fig. 5
figure 5

Diagram of the gap filling and record extension algorithm. Additional steps for SPEI calculations are shown

2.4.1 Temperature estimation

After selecting the Reanalysis standardized values of the corresponding months with maximum absolute correlation, a reconstructed time series of monthly dependent and independent variables was built. Orthogonal scores of a principal component analysis with Varimax rotation (Drosdowsky and Chambers 2001) were used to avoid multicollinearity and verify those Reanalysis variables that show the highest influence on the selected scores used in the regression model, according to the eigenvalues selection proposed by Peres-Neto et al. (2003). Finally, for Tn and Tx, a generalized linear mixed model (identity link function) was fitted (Eq. 1) assigning month as an additional variable. The model is fitted following a bidirectional stepwise procedure using Akaike information criterion (Akaike 1974) to select the most parsimonious model.

$$ {\widehat{\mu}}_{ij}=\left({\beta}_0+{\beta}_j+\left(\sum \limits_{h=1}^{12}{\beta}_h\times {X}_{ih}\right)+\theta \times {\mu}_{ij\_ QN}\right) $$
(1)

Here, \( {\widehat{\mu}}_{ij} \)corresponds to the estimate of the anomaly of Tn or Tx for day i in month j; β0 is the model intercept; βj corresponds to the coefficient associated to each month; βh corresponds to fixed effect coefficients for each principal component, Xih; θ is the coefficient associated to recorded value of the anomaly of temperature (either Tx of Tn) at Quinta Normal station (μij _ QN), which is the station with a complete record in this experiment; Xih corresponds to the principal component scores of the standardized predictors from the Reanalysis data (i.e., the 12 variables from surface and 500-mbar levels).

The individual contribution of each Reanalysis variable is evaluated from the PCA loadings. We examined the frequency of absolute values of significant loadings (p > 0.05) over 0.3.

2.4.2 Precipitation estimation

We obtained the binary vector of occurrences (1) and non-occurrences (0) of precipitation for each station. Sometimes, a threshold different than zero is used to classify days at dry/wet days. This method is not sensitive to the selected threshold unless it is applied to a very dry region where the selection of a threshold modifies the relative frequency of days with precipitation. Since precipitation shows a strong seasonal behavior, and summer months usually have few rain events, we fitted a model by season, generating four subtables (in locations with a higher frequency of rainy days, it would be advisable to do this by month). Two more Reanalysis variables were included in this case: precipitable water surface (PWS) and precipitation rate (PR). Once again, we searched for the best linear relationship within station binary data and Reanalysis anomalies (Wang and Murphy 2004).

For the occurrence process, only standardized anomalies of predictors are needed. The algorithm applies a GLMM with a logit link function to estimate the probability of the occurrence of precipitation. In this case, the season enters as an additional variable in the model. The occurrence of a precipitation event on a day is determined to occur when the estimated probability exceeds the 0.5 threshold. Model coefficients and PCA best predictor vector scores were found using a bidirectional stepwise process.

We followed the same method used for temperature estimation (Eq. 1) to detect the presence of recurrent specific components in the model using loading analysis. In this case, we used the logit function of the predictand \( {\widehat{\mu}}_{ij} \) which is calculated as:

$$ \mathrm{logit}\left({\widehat{\mu}}_{ij}\right)={\log}_e\left(\frac{{\widehat{\mu}}_{ij}}{1-{\widehat{\mu}}_{ij}}\right) $$
(2)

Note that the variables involved in the estimation of precipitation occurrence would also be present in a model that predicts precipitation amounts. In fact, there is a strong (nonlinear) relationship between the estimates of the precipitation occurrence in the observed time series and the amount registered (Fig. 6). With this, we can estimate the daily precipitation amounts from the probabilities of occurrence obtained with the binomial GLMM. To achieve this, a generalized linear model with a gamma link function was fitted. Here, the predictor variables are the log values of the discretized rainfall amount for Quinta Normal (a way to assess the spatial local effect of a representative neighbor station) and the estimated probabilities of occurrence for each station obtained from the GLMM in Eq. 2.

Fig. 6
figure 6

Occurrence probabilities of a precipitation event calculated at 5-mm intervals from a generalized linear mixed model for the Cerro Calán station

2.5 Performance of the algorithm

To assess the performance of our method, we compared it against two common interpolation procedures, the inverse distance weighting and spline methods. We selected the values recorded in 1979 at the Pudahuel and Melipilla stations to evaluate the results of the applied methods, and we used Quinta Normal, Cerrillos, and Pirque station as neighbors for the IDW and spline methods. The year 1979 was selected, since temperature and precipitation records are complete for all five.

In addition to the comparison described before, and as a way to evaluate how dependent the method is on pixel (grid cell) selection with the highest absolute correlation value, we applied the proposed method using not the best pixel in terms of absolute correlation coefficient, but the Reanalysis data for the pixel (grid) where the station was located. In this case, we compared the results obtained for both methods for the Cerro Calán station, using the model R2 and RMSE values.

At each station, we performed a cross-validation experiment to test the robustness of the algorithm. We randomly selected a subsample containing 10% of the observations. The model was then fit using the remaining 90% of the observations and applied to the excluded data set to evaluate the performance of the method comparing observed vs estimated values and calculating goodness of fit statistics. The procedure was repeated 100 times to determine mean and variance of RMSE, MAE, Bias, etc. In addition, the Standardized Precipitation Evapotranspiration Index (SPEI) (Paulo et al. 2012; Vicente-Serrano et al. 2010) was used as a pseudo-independent validation method, as it depends on nonlinear combinations of the estimated variables. Potential evapotranspiration is needed for the SPEI calculation, which is estimated from the modified Hargreaves equation (Doggers and Allen 2002). We calculated extraterrestrial radiation (Ra) using the latitude of the stations. We computed a 3-month aggregated SPEI value for the time period 1980–2010 in all stations separately. R2 and RMSE were obtained from the regression between two SPEI series and used to evaluate whether the algorithm was capable of capturing temporal trends and could be used for filling and extension of climatic variables at the basin level.

All the steps and procedures described were conducted in the statistical software R, using the NCEP/NCAR Reanalysis data. Required packages were “ncdf” (Pierce 2011), “RNCEP” (Kemp and Kemp 2012), “reshape” (Wickham 2007), “nlme” (Pinheiro et al. 2012), “lme4 (Bates et al. 2012), “lmerTest” (Kuznetsova et al. 2013), and SPEI (Beguería and Vicente-Serrano 2013).

3 Results

3.1 Minimum and maximum temperatures

The highest correlation values between the Reanalysis data and temperature variables were found in grid cells (pixels) that are located close to the stations. Figure 7 shows an example of Pearson’s correlation values for the Reanalysis variable air temperature at 500 mbar (ATM500) and Tn for the Cerro Calán station. Since Reanalysis values are based on several coarser observations, one would expect this behavior and eventually restrict the domain only to pixels that are in the surroundings of the location whose data record is to be expanded (or missing data is to be filled). In accordance with what is expected theoretically, other variables such as the 500-mbar geopotential height (HGT500mbar), relative and specific humidity at 500mbar (RH500mbar, SH500mbar), and air temperature at 2 m (AT2m), show positive associations. The opposite is verified for the remaining predictors.

Fig. 7
figure 7

Spatial distribution of correlation coefficients between the Cerro Calán station data (33.2° S, 70.5° W) and Reanalysis variable ATM500mbar for July

As a way to generalize the previous results, we calculated correlations between a regional mean of minimum and maximum temperatures as well as mean precipitation and correlated them with the Reanalysis variables for a domain composed only of the closest 25 pixels. Figure 8 presents the frequency with which each pixel is selected as the best linear correlation and, thus, becomes a candidate for its inclusion in a GLMM for the case of minimum temperature. (The same conclusion was obtained when looking at the figures corresponding to maximum temperature and precipitation with a slight tendency of maximum temperature to select more pixels outside the closest domain; data not shown.) It is clear that pixels closer to the basin centroid (33.58° S, 70.59° W) are more likely to be selected based on highest absolute correlation.

Fig. 8
figure 8

Relative frequency of selection of pixels (grid cells) in the 5 × 5 grid whose centroid corresponds to the location of the basin (dashed grid) for the estimation of minimum temperature. See Table 2 for the description of the Reanalysis variables

3.1.1 Precipitation

In the case of precipitation, we found that the Reanalysis variables PWS, PR, RH500mbar, SH500mbar, AT2m, UW2m, UW500mbar, and VW500mbar have a consistent positive association with almost all station data. In the case of AT2m, the results were somewhat surprising, because in this type of regime, precipitation has a negative correlation with maximum temperature and a positive association with minimum temperature.

In this case, there is also a strong association between Reanalysis variables within the nearest pixels and the precipitation variable of the basin; however, only the variables NLR2m, PWS, PR, RH500mbar, and VW2m concentrated more than 60% of the maximum absolute correlations in the central pixel or its eight neighbors (a figure that increase up to 70% when we included 16 border pixels).

3.2 Estimations

3.2.1 Maximum and minimum temperatures

Since the algorithm for temperature estimation involves the development of one model per station, and that each variable can be estimated with up to 26 coefficients in Eq. 1, we only present the summary statistics of model performance rather than reporting the individual regression coefficients. The mean R2 value for the estimation of both minimum and maximum temperatures was 0.72 for the case where the reference station of Quinta Normal (QN) was not included, and 0.75 when added (details by station are presented in Table 3). The Cerro Calán station showed the highest correlation (R2 = 0.78) using only the Reanalysis variables; this value increases to 0.81 when QN station was included. The RMSEs for minimum and maximum temperatures were 1.40 and 1.22 °C, respectively. All coefficients are significant at 5% level.

Table 3 Summary of model fit statistics for minimum (Tn) and maximum (Tx) temperature

We observe that, in general, the predictive power of the models increased by 2.4% when the reference station QN was included. For the case of minimum temperature, such increase is 3.3%, whereas the RMSE varied from 1.57 to 1.49 °C when QN was added. Furthermore, seasonality is very important. In fact, the use of a coefficient that accounts specific effects at monthly level increases the predictive power of the GLMM by 8.8% as compared to using non-grouped data.

Regarding the contribution of the NCEP/NCAR Reanalysis variables, in all fitted models, the first and second principal components were selected. For Tn, the variables HGT500mbar, SLP, AT500mbar, and VW5000mbar had significant loading values over 0.3 contrasted with the first component (PC1) and were part in five of the nine GLMMs fitted. Using the same indicator, variables NLR2m, RH500mbar, and VW2m are more relevant in PC2. For the case of Tx, the variables HGT500mbar, SHF2m, AT2m, AT500mbar, and UW500mbar are significant in PC1, participating in the estimation for all nine stations. PC2 had NLR2m, RH500mbar, AT2m, and VW2m as the most significant variables.

HGT500mbar and AT500mbar had significant loading values with PC12. The scores of this principal component related with HGT500mbar participated in eight of the nine estimation for Tn, and in all stations for Tx. AR500mbar participated thought the PC12 in the 18 possible estimations. In 17 of the 18 estimations, at least nine of the 12 principal components participated. For the Embalse el Yeso station, only seven PCs were needed to obtain a model with the lowest AIC.

3.2.2 Precipitation

The principal components PC1 and PC4 were the only variables used to estimate the occurrence of precipitation events in the GLMM, explaining 14.38 and 8.64% of the variance. Individually, PC1 loadings were associated with HGT500mbar (above 0.3 in the all the 9 stations), UW500mbar (used in 8 stations), and AT500mbar (5 stations). Meanwhile, PC4 was related to RH500mbar (7 stations), SH500mbar (6 stations), SHF2m, and VW500 (5 stations each). Despite PC2 having significant loadings in eight of the nine stations, with the additional variable PWS, this principal component was selected in only three cases (Cerro Calán, Cerrillos, and Embalse el Yeso). The other added variable associated to precipitation (PR) had a presence in six out of nine of the loadings of PC6 and was used in the GLMM in six occasions.

Occurrence probabilities of precipitation were estimated for each station. The performance of the binomial GLMM increased by 25.5% on average (6.2–39.6%) when the reference station (QN) was added to the model. Mean hit rate values (Wilks 2006) were around 90% in all stations, with the lowest being Melipilla (87%) and the highest El Yeso Embalse (95%).

Unlike the estimation of extreme temperatures and the precipitation occurrence probabilities, the effect of the reference station in the estimation of precipitation amounts is very important (Table 4). The poorest result is observed in the station El Yeso Embalse probably due to the fact that the station has the highest elevation, and thus, some of the precipitation falls as snow in winter time. In addition, we noticed that in all precipitation estimations, we did not obtain an estimated value greater than the highest observed value during a single day. Figure 9 shows water precipitation amounts for discretized data and the final disaggregation to the complete time span for Cerro Calán station.

Table 4 Summary of model fit statistics for the estimation of precipitation amount
Fig. 9
figure 9

a Estimated vs observed precipitation data for discretized values in 5-mm intervals. b Estimation of the amount of precipitation greater than 0.5 mm. Both graphics refer to the Cerro Calán station

3.3 Comparison with other methods and validation

We compare the results with a simple and yet commonly used interpolation method. We used a regional regression model with several correlated stations and geographical variables derived from a digital elevation model as covariates (lat, long, distance from shoreline, and major water bodies, vegetation indices derived from remote sensing). The model was run for the period 2000–2012 at 1-km spatial resolution. We applied it to Pirque and Los Panguiles stations with the following results: min temperature (Tn), at los Panguiles had an RMSE value of 2.52 °C and R2 of 0.827, while at Pirque, the values were 0.8 °C and 0.876, respectively. For the case of maximum temperature (Tx), values found for Los Panguiles were an RMSE value of 2.03 and R2 of 0.907, and at Pirque, values were 1.67 °C and 0.905, respectively.

We then used IDW interpolation method to estimate Tx and IDW and spline interpolation for Pp. We estimated the values registered at Pudahuel and Melipilla stations during year 1979. These methods were compared with the results obtained using the method based on Reanalysis without the inclusion of the observations of Quinta Normal as covariate.

IDW outperforms the method based on reanalysis data. At Pudahuel station, RMSE values of 0.85 and 3.3 °C for IDW and the method based on Reanalysis were obtained, whereas RMSE values of 2.2 and 3.4 °C were obtained for these methods at the Melipilla station. IDW better reproduce heat waves (or extreme cold temperatures) as it uses observed values in the neighbor stations at the moment of interpolation. This advantage becomes more evident if the neighbor stations are close in space, as they tend to be highly correlated.

In the case of precipitation, the method based on the Reanalysis data outperforms IDW and spline methods for the Pudahuel station (RMSE of 0.78, 1.24 and 3.44 mm, respectively) and gives better results than the spline method for the Melipilla station (RMSE of 3.94 for Reanalysis, 2.6 for IDW and 14.6 mm for spline). IDW highly depends on the occurrence of precipitation in one station the occurrence in the neighbor stations. Spline functions tend to smooth values, but the effect of distance is very relevant leading to inconsistencies when stations are not strongly correlated.

Results of the cross validation experiment are presented in Table 5. While maximum and minimum temperatures show little variation among stations in terms of RMSE and Bias and show the highest R2 values, precipitation shows a higher variability. Results are more variable for the case of precipitation. In two stations (El Yeso Embalse and Melipilla), R2 values are around or less than 0.5; nevertheless, the regression coefficients of the model are all significantly different than zero at 0.05 level contributing to partially explain the variance of the observations and, thus, significantly improve the estimations of missing data.

Table 5 Summary of goodness of fit statistics of a cross validation experiment for each station. Values in parentheses correspond standard deviations of 100 simulations

To test the performance of the method in generating secondary information, we calculated the Standardized Precipitation Evapotranspiration Index. SPEI is an indicator of the temporal changes between dry periods (SPEI < 0) and wet periods (SPEI > 0). We observed a good agreement when comparing SPEI values calculated using the gap filling/record extension method and when using real data (Fig. 10) in both series for the same time period. The average of R2 was 0.87, varying between 0.78 for Melipilla station to 0.91 for Cerrillos station (Table 6). It is possible to detect climatic trends in all stations consistent with observed data.

Fig. 10
figure 10

a SPEI values calculated with real Tn, Tx, and Pp data. b SPEI values calculated with estimated data of Tn, Tx, and Pp. Both graphics refer to the Cerrillos station (R2 = 0.87)

Table 6 Coefficient of determination (R2) between SPEI values calculated from real and estimated data

4 Discussion

There is no limitation regarding of the plausible Reanalysis variables to be used in the presented methodology: however, the selection of those having the highest temporal correlation with climatic variables increase the predictive power of the method, especially on a monthly basis. Therefore, any climatic variable could be potentially estimated for reconstruction of past time series. The main limitation is the initial extension of the candidate time series being reconstructed, considering that a large temporal database is crucial for detecting temporal trends and periodicity. Also, in this case, monthly decomposition needs a sufficient number of years to produce estimates with high goodness-of-fit. This work used data with at least 30 years of continuous data. The length of the time series is related to the consistence of the obtained products, independent of the estimation procedure applied (Bengtsson et al. 2004). Nevertheless, in most cases, this criterion is used for filling temporal data gaps with its own data (Tardivo and Berti 2012) than for data extension in any time direction.

At the meteorological station level, the values estimated for some particular variable to fill temporal gaps have been the focus in the use of neighbor stations (Sterl 2001; Tardivo and Berti 2014). In this case, the restriction lies in the time series length of the shortest neighbor station, which means that the final result is limited by data length within the neighborhood. With sufficiently large series for calibration, this method can be useful in expanding records and detecting temporal trends in variables like temperature, precipitation, and solar radiation (Betts et al. 2006; Bojanowski et al. 2014; Refslund et al. 2014).

Since 1979, the assimilation algorithm of the NCEP/NCAR Reanalysis began using remote sensing data (Kalnay et al. 1996), modifying the database slightly with respect to the 1948–1971 period. This could be a source of error that may introduce problems in trend detection and be misleading.

In precipitation, the error associated with the observer can be very important. The discretization approach followed here is a solution but might not be the best one, particularly if a particular study requires greater precision in the estimation of the amount of precipitation.

The selection of potential predictors is highly relevant in improving the results of the algorithm. The selection of the Reanalysis variables was conducted by choosing those with a seasonal and meteorological component; their combined effects increase model predictive power (Hwang et al. 2013; Yoshimura and Kanamitsu 2008). In addition, the use of a reference station has a positive effect in improving estimates, especially if a strong correlation with the candidate station is found (Daly 2006), because it provides non-explained local effects. This inclusion is more relevant in the estimation of precipitation than in the case of temperature.

The local effect in each station, mainly influenced by its topography and altitude, can be detected within the heterogeneous area of the Maipo basin. Three stations (Tobalaba, Quinta Normal, and Pudahuel) are located in the Santiago metropolitan area, with a marked heat island effect, which can affect estimates (Buyadi et al. 2013). However, monthly aggregated estimates in the case of extreme temperature and seasonal estimations of rainfall appear to mask this negative effect.

Not all complementary variables showed the same degree of contribution to the models in the algorithm. AT2m, AT500mbar, and RH500mbar were the most significant in models of extreme temperatures, while UW500mbar played a significant role in estimating occurrences of precipitation. The variables HGT500mbar and RH500mbar were found to have a high presence through their weight in the precipitation and both temperature models.

Concerning spatial correlation selection, over 70% of these were located in a 25-pixel neighborhood around the central pixel from the station location for almost all complementary variables. Some selected pixels had low geographical and physiographical association with the basin. However, we decided not to omit them, because their inclusion improved the final adjustment. Both positive and significantly negative correlations appeared along the defined spatial window, and although there were no spatial criteria, we selected only those pixels with a greater absolute correlation, no matter where they were.

One potential limitation of the method proposed here that arises when the length of the time series to be filled or expanded is significantly large is that the assumption of homogeneity of variance may no longer hold as well as other characteristics of the distribution. In our example, we tested the hypothesis that the reconstructed minimum temperature followed the normal distribution using the Kolmogorov-Smirnov test. The hypothesis is rejected only in 10% of the 108 time series (9 stations times 12 months). On the other hand, only half of the time series fulfill the assumption of variance homogeneity when looking at the Bartlett homogeneity test. Because of the statistical properties of a parametrical method, extreme temperature values are normally underestimated (we detected a deviation of 3.6 and 3.4 °C for maximum and minimum temperatures, respectively). If the number of data gaps or length of the time series to be reconstructed is large, there is a risk of obtaining a heteroscedastic model that fails in the estimation of extreme temperature values.

5 Conclusion

We have presented a methodology for filling gaps and expanding data records of minimum and maximum temperatures and precipitation within a GLMM framework, using climatic information from the Reanalysis project. Results show that the estimations of temperature and precipitation are good and correctly represent the spatio-temporal pattern observed in the basin under study. The use of a reference station improves the estimations of precipitation amounts, although the effect is not significant if the model is fitted considering only the probabilities of rainfall occurrences extracted from the binomial GLMM.

This methodology can be used to fill gaps and expand data records of information gathered since 1950. Since it is not exclusive, new weather variables may be included for estimation. It is important to emphasize that in the future, land use associated with the weather station may be considered in order to include the effect of urban heat islands, water bodies, and other land uses that significantly influence local climatic variables.