Keywords

1 Introduction

Sri Lanka is a country with a tropical climate situated in the North Indian Ocean. It lies between the tropics 5° 55′ to 9° 51′ North latitude and between 79° 42′ to 81° 53′ East longitude. Sri Lanka has an irregular surface with low-lying coastal plains running inland from the northern and eastern shores, and the south central part of the country contains the highest mountains. The rainfall pattern in Sri Lanka is mainly influenced by the Asian monsoon system which mainly depends on the winds of the Indian Ocean and Bay of Bengal. This has provided the basis to divide the climate during a year into four seasons. That is two monsoon periods and two inter monsoon periods. The Southwest Monsoon (SWM) occurs from May to September and Northeast Monsoon (NEM) occurs from December to February. And in between these two periods the two inter monsoonal periods take place. March to April First Inter-Monsoon (IM1) and from October to November Second Inter Monsoon (IM2). Sri Lanka is a country with an agricultural based economy; rainfall is the primary source of water supply for rice which is the staple food of Sri Lankans, as well as for other agricultural crops. However, floods and droughts cause sizable reduction in the crop yield [1] with the help of occasional and efficient irrigation, droughts can be managed. Thus it is necessary to have good rainfall data so that the irrigation projects can be planned efficiently. Due to many direct and indirect benefits, it is vital to investigate the variability in rainfall and its relationship with topographic variables. Measuring rainfall at every point in space is not practical. The Department of Meteorology, Sri Lanka (DMSL) collects the amount of rainfall from more than 1000 meteorological stations located throughout the country. And, estimating rainfall in Sri Lanka has been done using different statistical methods in the past using these rainfall data.

A domain or a sub population is considered as large if domain sample size is large enough to make direct estimates. A domain can be a state, county, district... etc. And it is considered as small if domain specific sample size is not large enough to make direct estimates. “Small Areas” is used to denote such domains [2]. Small Area Estimation (SAE) includes several statistical techniques that can be used to make estimates for “Small Areas” with an adequate level of precision.

According to [3] one possible way to estimate rainfall is to use SAE technique. In SAE the data obtained at a discrete irregular point are considered as a sub \population of the total population of interest which is all the stations maintained by DMSL. Here, a domain specific sample may not be large enough to yield direct estimates with adequate precision. Also many domains of interest may have zero sample size. These domains are referred to as ‘small areas’ [2]. Hence Small Area Estimation can be used in the estimation of parameters for such domains.

SAE has been used in several studies in various fields [2], but not many studies relating to estimation of rainfall. [1] applied the nested error regression model to estimate area under corn and soybeans for each 12 counties in North Central Iowa, for this study farm interview data in conjunction with LANDSAT satellite data have been used. The authors calculated the ratio of the model-based standard error of the empirical best linear unbiased prediction (EBLUP) estimate to that of the survey regression estimate. This ratio decreased from about 0.97 to 0.77 as the number of sample segments, n, decreased from 5 to 1. The reduction in standard error is considerable when n1 < 3. [4] has carried out a study on robust small area estimation. Basic area-level and unit-level models have been studied in the literature to obtain empirical best linear unbiased prediction (EBLUP) estimators of small area means. [5] used a regression synthetic estimate to produce county estimates of wheat production in Kansas. For this they used a non-probability sample of farms assuming a linear regression model (without the small area effect) relating wheat product of the farm of the county to predictor variables with known county totals. [6] did a study to generate small area statistics for household income in southern province in Sri Lanka. The authors have used composite estimator which is a combination of the two broad types of estimators, direct (or sample-based), and indirect (or model-based) to generate small area statistics in this study. [7] states that in Uganda, using survey data, estimate of under-five mortality have only been available at national and regional levels. This study uses small area estimation techniques in a Hierarchical Bayes framework to derive estimates of relative risk of under-five mortality up to District level.

Another commonly used technique is to spatial distribution of rainfall amounts [8, 9]. In Spatial interpolation was given the rainfall amounts of stations maintained by DMSL as a set of sample points, we use an interpolation method to predict rainfall values at every point. For an unknown point, we take some form of weighted average of the rainfall values at surrounding stations to predict the value at the point where the value is unknown. Generally near points receive higher weights than far away points. Kriging is one of the most common spatial interpolation techniques for spatially continuous data such as rainfall.

[10] did a study of geostatistical approaches for incorporating elevation into the spatial interpolation of rainfall. The author presented three multivariate geostatistical algorithms for incorporating a digital elevation model into the spatial prediction of rainfall: simple Kriging with varying local means; Kriging with an external drift; and collocated Kriging. Cross validation is used to compare the prediction performances of the three geostatistical interpolation algorithms with the straightforward linear regression of rainfall against elevation. [11] carried out a comparison of two statistical methods for spatial interpolation of Canadian monthly mean climate data. Thirty-year monthly mean minimum and maximum temperature and precipitation data from regions in western and eastern Canada were interpolated using thin-plate smoothing splines (ANUSPLIN) and a statistical method termed ‘Gradient plus Inverse-Distance-Squared’ (GIDS). Data were withheld from approximately 50 stations in each region and monthly mean values for each climatic variable at those locations were predicted using both methods. [12] did comparative analysis of different techniques for spatial interpolation of rainfall data to create a serially complete monthly time series of precipitation for Sicily, Italy. In this study different spatial interpolation algorithms (deterministic methods) such as inverse distance weighting, simple linear regression, multiple regression, geographically weighted regression and artificial neural networks, and geostatistical models such as ordinary Kriging and residual ordinary. Kriging have been applied to the mean annual and monthly rainfall data. These various interpolation methods has been carried out using a subset of the available rainfall data set (modeling set) while the remaining subset (validation set) has been used to compare the results.

Kriging is an interpolation technique that considers both the distance and the degree of variation between known points when estimating values in unknown areas. It is a weighted linear combination of known sample values around the point to be estimated. Several types of Kriging exist that are used to model different types of spatial data and they have different underlying assumptions. For example simple, ordinary, universal and block Kriging can be named. In this study universal Kriging is used. In universal Kriging, the expected values of the sampled points are modeled as a polynomial trend. Kriging is carried out on the difference between this trend and the values of the sampled points.

Any method has its own advantages and disadvantages. The purpose of this paper is to evaluate and compare the accuracy of SAE and Kriging technique in estimating seasonal rainfall in Sri Lanka.

2 Methodology

This study is entirely based on data available from DMSL and Online application GPS Visualizer. Monthly total rainfall data along with longitude and latitude values of 100 meteorological stations maintained by DMSL for the year 2011 are available for this study. [3] used this data to estimate precipitation in Sri Lanka using Spatial Interpolation. Elevations of these 100 meteorological stations were obtained using online application GPS Visualizer.

Though there is a great network of meteorological stations collecting rainfall data throughout the country, many stations do not properly function due on various reasons. Therefore a rainfall record contains missing data as well as suspicious data. Out of the 100 meteorological stations 24 contained missing values for some months. Since 24% of data are missing, imputation would distort the structure of the dataset. Therefore stations with missing observations were discarded. Suspicious data were also discarded.

The remaining 75 meteorological stations were used in this study. 61 stations were used for model building. These stations were selected using stratified sampling. Here each province was considered as a stratum and allocation was done proportionally to the area. Remaining 14 stations were used for model validation.

Latitude and longitude can locate the exact position on the surface of the earth but they are not uniform units of measure. So latitude and longitude do not have a standard length so it is difficult to measure distances between points/areas accurately or to display data on a map or a computer screen. In order to do GIS analysis or mapping more stable coordinate framework is required as in the projected coordinate system, which is a two dimensional representation of earth that has constant lengths angles and across two dimensions. For this study, projections of geographic coordinates to x and y coordinates were carried out using in R statistical software.

3 Small Area Estimation

Three different small area models [13, 14] were fitted to find out the most suitable model for each season.

3.1 Generalized Linear Mixed Model (GLMM)

Generalized Linear Mixed Model (GLMM) is an extension of the generalized linear models (GLM). In mixed models linear predictor includes additional random effects term as well as a fixed effects term. Random effects help in the fitting of the model by accounting for different types of hidden structures. GLMM are commonly used for the analysis of small area estimation. Most small area models can be considered as a special case of the following general linear mixed model of the following form Rao (2003).

$$ \varvec{y}^{P} = \varvec{X}^{P} \beta + \varvec{Z}^{P} {\text{v}} + {\text{e}}^{P} $$
(1)

Here e P and v are independent with e P ∼ N(0, σ 2 φ P) and \( {\mathbf{v}}\,\sim \,N\left( {0,\sigma^{2} \varvec{D}\left(\varvec{\lambda}\right)} \right) \) where φ P a known positive definite matrix and \( \varvec{ D}\left(\varvec{\lambda}\right) \) is a positive definite matrix which is structurally known except for some parameter \( \varvec{\lambda} \) typically involving some ratios of variance components of the form \( \sigma_{i}^{2} /\sigma^{2} \). \( \varvec{X}^{P} \) and \( \varvec{Z}^{P} \) are known design matrices and \( \varvec{y}^{P} \) is a N × 1 vector of population y-values.

3.2 Unit Level Models Without Area Level Variances

The model used here is,

$$ y_{ij} = \mu_{ij} + \varepsilon_{ij} $$
$$ \mu_{ij} = \alpha + \beta x_{ij} $$
(2)

Where, \( y_{ij} \) is the unit level target variable and x ij is the unit level covariates and \( \varepsilon_{ij}^{'} {\text{s}} \) are assumed to be normally distributed random variables with mean 0 and variance σ 2.

3.3 Unit Level Models with Area Level Variances

This model allows for internal variation to change between areas and the model is,

$$ y_{ij} = \mu_{ij} + \varepsilon_{ij} $$
$$ \mu_{ij} = \alpha + \beta x_{ij} $$
(3)

Where, y ij is the unit level target variable and x ij is the unit level covariates and \( \varepsilon_{ij}^{'} {\text{s}} \) are assumed to be normally distributed random variables with mean 0 and variance \( \sigma_{i}^{2} \), where \( \sigma_{i}^{2} \) is the variance of the units in area i. Here districts in the country are taken as areas. Hence there are 24 areas in the model.

After each model rainfall values were predicted for the stations in validation data set. Finally, they were compared to find the most appropriate model for each season. Root mean squared error (RMSE) and the correlation coefficient of predicted and observed rainfall values were used to compare the 3 small area models.

4 Kriging

4.1 Trend Surface Analysis

The initial step of spatial interpolation using classical geostatistical method is estimating the mean function of the process under study. The main objective of trend surface analysis is to explain the variation of rainfall as much as possible with the available covariates. For spatial data like rainfall geographic variation may depend on the elevation (z) of the location in addition to the projected coordinates x and y Typically starting with the linear trend model first order and the quadratic trend (second order polynomial) was only fitted as limited number of data points were available so with the increment of order the number of coefficients to be estimated also increases. When regression model was fitted to the response variable taking x, y coordinates and elevation z as the explanatory variables the residual plots of these model seemed to violate the assumptions of constant variance and the normality. To overcome this problem natural log transformation was applied to the response variable. After log transformation the residual plots seemed to be satisfactory.

First order polynomial

$$ \ln \left( {r_{xxx} } \right) = \beta_{0} + \beta_{1} x + \beta_{2} y + \beta_{3} z $$
(4)

Second order polynomial

$$ ln\left( {r\_xxx } \right) = \beta_{0} + \beta_{1} x + \beta_{2} y + \beta_{3} z + \beta_{4} xy + \beta_{5} yz + \beta_{6} xz + \beta_{7} x^{2} + \beta_{8} y^{2} + \beta_{9} z^{2} $$
(5)

where rainfall is denoted by r xxx where xxx denotes the model (IM1, SWM, IM2, and NEM).

4.2 Estimating Spatial Correlation: The Variogram

Residuals obtained from trend surface i.e. residual of the fitted mean functions were further analyzed for their spatial structure. In geostatistics the spatial correlation is modeled by the variogram.

Sample variograms of residuals for each season (IM1, SWM, IM2 & NEM) was obtained and then they were used to fit a suitable variogram model for each season. Certain models (i.e. mathematical functions) that are known to be positive definite are used in the modeling the variogram. Among many variogram model types are available. Spherical, Exponential, Gaussian models were only used. For each season several variogram models were fitted using different values for sill and range parameters using R statistical software. The nugget was always forced to zero. Sill and range values were guessed by looking at the sample variogram. For each fitted model sum of squared error (SSE) was also obtained. The model types which provided the lowest SSE was selected as the final variogram model.

4.3 Directional Variograms

In the analysis directional variograms were obtained in different directions 0°, 45°, 90°, 135° to see whether there are any changes in the structure of variogram from one direction to the other. In other words direction variograms are obtained to explore any possible anisotropy. Anisotropy is detected when same sill parameter is present in all directions but the range changes with direction. In this study directional variograms did not show any directional dependence. Hence adjustment for anisotropy was not required [15,16,17,18,19,20,21].

5 Results

In this study out of 75 meteorological stations 14 meteorological stations were reserved for cross validation. We noted clearly that these stations were not considered for the model building process.

Under SAE, a general linear mixed model was selected to IM1, SWM and IM2 while unit level model without area (district) level variances was selected for NEM. Then using these selected models for each season, rainfall at above mentioned 14 stations were predicted.

Only for IM1 a Gaussian model was selected and for the rest of the seasons the exponential model gave the lowest SSE. Once the trend and the variograms have been fitted to each season, rainfalls at same 14 locations were predicted using Universal Kriging for each season.

Root Mean Squared Errors (RMSEs) and correlation coefficients between observed and predicted rainfall values were then calculated for the same validation data set under SAE and Kriging. These values are then used for comparison purposes of the two techniques as well as scatter plots of observed and fitted rainfall values.

From Table 1 it can be seen that the models fitted to seasons IM1 and SWM using Kriging have smaller RMSEs between observed and fitted rainfall values. The model fitted to IM2 have large RMSE for both techniques which means none of the techniques have done a good job in tracking the variation in rainfall during Second Inter-monsoon (IM2). The model fitted to Northeast monsoon (NEM) using SAE has a smaller root mean squared error between observed and fitted values. Thus when the RMSEs are compared, we can say that both SAE and Kriging are appropriate in estimating different seasons.

Table 1. Root Mean Squared Errors of SAE and Kriging

In Table 2 the correlation coefficients between observed and fitted values are given. Models for seasons IM1 and SWM using Kriging have higher correlation coefficients (>0.9) between observed and fitted values than SAE models. For IM2 models fitted using both techniques has a weak correlation between observed and fitted values where correlation is less than 0.5. NEM both SAE and Kriging models have higher correlation coefficients (very close values) approximately 0.95 and 0.91 respectively.

Table 2. Correlation coefficient of SAE and Kriging

Scatterplot of observed and fitted rainfall values of IM1, SWM, IM2, and NEM are given in Figs. 1, 2, 3 and 4 respectively

Fig. 1.
figure 1

Plot of observed and fitted values SAE & Kriging - IM1

Fig. 2.
figure 2

Plot of observed and fitted values SAE & Kriging - SWM

Fig. 3.
figure 3

Plot of observed and fitted values SAE & Kriging - IM2

Fig. 4.
figure 4

Plot of observed and fitted values of SAE & Kriging - NEM

From Fig. 1, we can see actual rainfall values have a good linear relationship with Kriging predictions than the predictions of SAE model in IM1. At higher rainfall values SAE model has largely underestimated the rainfall at most points.

Figure 2 shows the scatter plot of observed and fitted rainfall values of SWM. It can be clearly seen that Kriging model fitted to SWM shows a good linear relationship between the observed and fitted rainfall. Predictions of small area model have underestimated at higher rainfall values when compared to Kriging.

Figure 3 shows clearly that the linear relationship of observed and fitted rainfall values of Second Inter-monsoon (IM2) of both SAE and Kriging are not at their best since both models predictions do not show a good linear relationship with observed rainfall.

Scatter plot of observed and fitted rainfall values of Northeast-monsoon (NEM) is given in Fig. 4 above. It can be seen that there is a good linear relationship between predictions of both SAE and Kriging models and observed rainfall values. But, when compared to SAE at higher rainfall values Kriging predictions are very poor since it can be seen that rainfall is underestimated highly by the Kriging model. This must be the reason for NEM to have a large RMSE for Kriging.

To extrapolate and interpolate rainfall under SAE and Kriging requires a set of prediction locations. Therefore the 4 × 4 km Sri Lanka grid constructed by Nayanee (2012) was used for this. The elevation values of each location in the grid were then obtained using online application ‘GPS Visualizer’. For SAE districts of the locations in the grid were then found. Then the seasonal rainfall is extrapolated for each season using final model selected under small area estimation. Similarly using Universal Kriging, rainfall over the region of interest that is rainfall over Sri Lanka was also interpolated. Here the attention has been given to mapping rainfall.

Following Figs. 5, 6, 7 and 8 shows the corresponding maps of extrapolated and interpolated seasonal rainfall using SAE and Kriging for the year 2011.

Fig. 5.
figure 5

Map of extrapolated/interpolated rainfall using SAE & Kriging -IM1

Fig. 6.
figure 6

Map of extrapolated/interpolated rainfall using SAE & Kriging -SWM

Fig. 7.
figure 7

Map of extrapolated/interpolated rainfall using SAE & Kriging –IM2

Fig. 8.
figure 8

Map of extrapolated/interpolated rainfall using SAE & Kriging -NEM

According to the DMSL, the distribution of rainfall during IM1 period shows that the entire South-western sectors at the hill country receiving about 250 mm of rainfall, with localized area on the South-western slopes experiencing excess rainfall than that. When we look at the maps in Fig. 5 predictions of both techniques clearly depicts above. But the excess rainfall received localize area on the South-western slopes are clearly seen only in Kriging map.

In SWM both maps show widespread rainfall in southwestern part of Sri Lanka with no effective rains in the Dry zone (See Fig. 6). During this period the highest rainfall is received in the mid-elevations of the western slopes where a significantly higher amount of rainfall is observed, also during this period southwestern coastal belt experience rainfall lower than that. The results of Kriging map are clearly in line with this.

During IM2, depression and cyclones occur in the Bay of Bengal which influences the weather system. This is the period where the whole country experiences the most evenly balanced distribution of rainfall. This may be the reason why both techniques did not capture the variation in the rainfall in IM2.

Throughout NEM, the dominant wind direction is northeast As stated by DMSL over this time the North, Eastern slopes of the hill country receives a higher rainfall than the rest of the dry zone. Maps of both techniques clearly illustrate this. As the map of extrapolated rainfall using Kriging clearly shows higher rainfall received by North, Eastern slopes of the hill country.

6 Discussion

At the modeling stage of this study, only 3 explanatory variables (longitude, latitude, and elevation) are considered. If we can find explanatory variables that are strongly correlated to seasonal rainfall, predictions of both methods would be much better. Thus the availability of good auxiliary information is vital for a technique like small area estimation. There can be many other variables that are related to rainfall like temperature, distance to the sea, humidity, slope etc. In this study, only 14 stations were used for cross-validation. If we can allocate more stations to perform cross-validation it will be easier to identify the behavior, as well as a more effective comparison, can be carried out.

7 Conclusion

In general this study finds that when considered both Root Mean Squared Errors and correlation coefficients between observed and fitted rainfall values Kriging do a better job in estimating seasonal rainfall in Sri Lanka for IM1 and SWM than small area estimation. This may be due to underestimation of high rainfall values by small area models.

The performance of SAE and Kriging models were not much successful in estimating rainfall during season IM2. This can be due to neither of techniques being able to capture a good relationship between rainfall in IM2 and topographical variables.

For NEM both SAE and Kriging provided a high correlation between observed and fitted rainfall values approximately 0.95 and 0.91, respectively. This indicates that both techniques performed equally well in estimating rainfall in that particular season.