1 Introduction

Rainfall is the most important hydro-climate variable because of its significance for sustainable water management (Chowdhury and Beecham 2013). The accurate prediction of water availability is immensely beneficial for making management decisions and can assist with the sustainable operation of water resource systems.

Rainfall prediction is a challenging task because of the dynamic nature of climate phenomena and random fluctuations involved in the physical process. Such a prediction is particularly challenging in Australia where in a long-term analysis the rate of change in the frequency and intensity of rainfall extremes can often be greater than the rate of change for average rainfall (Garnaut 2008).

Rainfall prediction models can be classified into two categories: physical and data driven models. Physical models use the physical laws to model the relevant processes that contribute to rainfall process. Data driven models use historical data to make predictions. The most frequently applied models are Multiple Linear Regression (MLR) (Chattopadhyay et al. 2010; Mekanik et al. 2013), Artificial Neural Networks (ANNs) (Abbot and Marohasy 2012, 2014; Chattopadhyay et al. 2010; Mekanik et al. 2013), and k-Nearest-Neighbours (k-NN). Studies have shown that data driven models, in general, give better results than the physical models (Abbot and Marohasy 2012, 2014).

Rainfall predictions are made for some time periods which include weekly, monthly and seasonal predictions. In rainfall prediction, the month is used to define the start, duration and end of the rainy season. Moreover, monthly rainfall data provide more accurate an intra-year rainfall distribution than seasonal rainfall data (Omotosho et al. 2000). Such information may help to significantly improve decisions with regard to irrigation needs and their timings and also decisions on water conservation strategies for dams and on operation of water infrastructure (Omotosho et al. 2000; Sharma et al. 2013). Accurate prediction of streamflow a month ahead is essential information to help water resource managers for efficient planning (Wang et al. 2011).

Australia’s climate is highly variable having high rainfall tropical regions in the north and the driest desert region in the interior. The performance of prediction models alter depending on climatic zones. Therefore, comparative assessment of models depending on meteorological variables and climatic conditions is important. Such a comparison may help a decision maker to choose appropriate models depending on input variables and climatic zones. There are several papers on the comparison of rainfall prediction models (see, for example, Abbot and Marohasy 2012; Aksoy and Dahamsheh 2009; Mekanik et al. 2013). To the best of our knowledge the comparison of models depending on meteorological variables and climatic zones has not been studied.

The aim of this paper is a comparative assessment of the performance of various data-driven prediction models depending on meteorological variables and climatic conditions. These models include linear SVMReg, SVMReg with RBF kernel, MLR, k-NN and ANNs with one hidden and without hidden layer. Rainfall data with five meteorological variables (maximum and minimum temperatures, evaporation, vapour pressure and solar radiation) over the period of 1970 – 2014 from 24 geographically diverse weather stations across Australia are used for evaluation of models. Prediction performance of models was evaluated by comparing observed and predicted rainfall using performance measures Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Coefficient of Efficiency (CE).

There are several aspects that distinguish this paper from other papers on the comparison of rainfall prediction models. First, in this paper we consider most well-known prediction models whereas other papers consider only very few of them. Second, we use most important meteorological variables as input variables which have not been considered together as input variables to compare rainfall prediction models. Third, the performance of prediction models is compared using data from weather stations distributed over all Australia and located in different climatic zones.

2 Models and Methods

2.1 Support Vector Machines for Regression

Consider the training data X = {(x1, y1),(x2, y2),…,(xk, y k )}⊂ Rn × R, where xi is an input vector and y i R is a corresponding output, i = 1,…, k, k is the number of observations in the training set. Given an ε > 0, the aim of SVMReg is to find a function f(x) that has at most ε deviation from the targets y i for all the training data. In the linear SVMReg, the regression function f is written as: f(x) = wTx + b, where wRn is the weight vector, bR and T stands for transpose of a vector. w and b are estimated by solving the following minimization problem (Collobert and Bengio 2001; Müller et al. 1997; Smola and Schölkopf 2004):

$$\begin{array}{@{}rcl@{}} \left\{\begin{array}{llll} \text{minimize}\quad & R = \frac{1}{2}w^{T} w + C{\sum}_{i = 1}^{k} (\xi_{i} + \xi^{*}_{i}), \\ \text{subject to} & y_{i} - [w^{T} x^{i} + b] \leq \varepsilon+\xi_{i},\\ & [w^{T} x^{i} + b] - y_{i} \leq\varepsilon+\xi^{*}_{i},\\ & \xi_{i},\xi^{*}_{i} \geq 0. \end{array}\right. \end{array} $$
(1)

Here C > 0 is a penalty parameter which determines the trade-off between the flatness of f and the amount up to which deviations larger than ε are tolerated. ξ i and \(\xi ^{*}_{i}\) are slack variables introduced to deal with infeasibility. The linear SVMReg model is extended to the non-linear SVMReg model using kernel functions, for example, the radial basis function (RBF) (see Collobert and Bengio 2001; Müller et al. 1997; Smola and Schölkopf 2004, for details). There have been several applications of the SVMReg model for rainfall prediction, see, for example, Feng et al. (2015), Kisi and Cimen (2012), Lin et al. (2009), Mercer et al. (2013), and Nayak and Ghosh (2013).

2.2 Multiple Linear Regression

MLR is an extension of a simple linear regression method, where two or more independent variables are used to predict one dependent variable through the least squares method. A general form of the MLR model can be presented as: y = β0 + β1x1 + … + β p x p + ε, where y is the dependent variable, x1,…, x p are independent variables, β0 is y-intercept, β1,…, β p are regression coefficients of the corresponding independent variables and ε is the noise in the data. MLR models have been used for rainfall prediction in Aksoy and Dahamsheh (2009), Ramirez et al. (2005), and Shukla et al. (2011).

2.3 k-Nearest Neighbours Method

The k-NN method is a non parametric statistical pattern recognition procedure, extended to time series prediction in Yakowitz and Karlsson (1987) (see, also Al-Qahtani and Crone 2013).

Next, we briefly describe k-NN for univariate time series. Consider a finite time series u(t), t = 1,2,…, m without input variables. In the first step, series is transformed into equal length d-dimensional feature vectors: ud(t) = (u(t), u(t − 1),…, u(t − (d − 1))). Here d < m is a predetermined integer called embedding dimension and td. In next step either a set of md overlapping vectors with t = (d, d + 1,…, md) or a set of m/d non-overlapping vectors with t = (d,2d,…, md) is defined. These vectors are called d-histories.

In the third step either the distance or correlation between the last observed vector ud(m) = (u(m), u(m − 1),…, u(m − (d − 1))) and all d-histories is computed. Here any distance function can be used, however in k-NN predominantly the Euclidean distance is used. In the fourth step the calculated distances are ranked and k vectors having lowest distance from the target feature vector are selected. Then these k feature vectors are used for a prediction, often using a simple arithmetic mean with equal weights. In case of correlations, k vectors having highest correlation with the feature vector are used to form a prediction. The k-NN method for univariate time series is extended to multivariate time series by extending construction of vectors for each input variable.

2.4 Artificial Neural Networks

ANNs consist of simple neurons, and links that process information in order to find relationship between inputs and outputs. ANNs take input, apply the activation function to combine the input into a single value and to produce an output. The activation function generally consists of the combination and the transfer functions. The combination function assigns weights to inputs and combines the weighted inputs in a single value. The transfer function produces an output. The sigmoid, hyperbolic tangent and step functions are widely used as the transfer functions. There exist many algorithms to train neural networks, but the back propagation algorithm and its variations are the most computationally efficient (see, for example, Haykin (2001) for more details). The application of ANNs for rainfall prediction can be found in Abbot and Marohasy (2012), Abbot and Marohasy (2014), Aksoy and Dahamsheh (2009), Awan and Bae (2014), Karamouz et al. (2008), Lorrai and Sechi (1995), Mekanik et al. (2013), Ramirez et al. (2005), and Shukla et al. (2011).

3 Data

Historical monthly rainfall data was taken from the Scientific Information for Land Owners (SILO) available at www.longpaddock.qld.gov.au/silo/. SILO is an enhanced climate database hosted by the Queensland Government Department of Science, Information Technology and Innovation. The data is reliable and quality checked.

There are six major climatic zones in Australia: temperate, grassland, desert, tropical, subtropical and equatorial (Australian weather and seasons 2013). We selected two weather stations from the tropical zone, two from subtropical, five from desert, seven from temperate and eight from grassland zones. The number of stations depends on areas of zones. The equatorial zone is not considered as its area is small.

We used data of six meteorological variables from 24 weather stations for the period January 1970 - December 2014 to develop prediction models. Meteorological variables used in this study are: Monthly rainfall, Maximum temperature (TMax), Minimum temperature (TMin), Evaporation (Evap), Vapour pressure (VP), and Solar radiation (Rad). These variables were selected because they are interdependent and influence precipitation. There are 540 records for each weather station. The geographic details as well as climatic zones of these stations and descriptive statistics of the monthly rainfall are given in Table 1 and a location map is given in Fig. 1. The average monthly rainfall varies across these sites from 15.07 mm to 125.87 mm.

Table 1 Geographic details, climatic zones, elevation (m.), minimum, maximum and average of monthly rainfall values for weather stations
Fig. 1
figure 1

Location map

Correlations between rainfall and input variables for each weather station are given in Table 2. In this table for each station we also present the number of high (H) (between − 1 and − 0.5 and between 0.5 and 1), medium (M) (between − 0.5 and − 0.3 and between 0.3 and 0.5), low (L)(between − 0.3 and − 0.1 and between 0.1 and 0.3) correlations and the number of no correlations (N) (between − 0.1 and 0.1). In all locations there is at least low correlation between rainfall and some input variables. In most locations, more specifically, in 20 out of 24 there is at least one medium correlation. Finally, in 14 out 24 locations there are high correlations between rainfall and some input variables. These observations justify the use of meteorological variables for rainfall prediction.

Table 2 Correlations between monthly rainfall and input meteorological variables

4 Implementation and Evaluation of Models

Statistical package R-Version 3.2.2 is used to implement all models. R is an environment for statistical computing and graphics including time series analysis, clustering, classification, modeling and statistical tests (R Core Team 2013).

We use the R package nnet for ANNs (Venables and Ripley 2002), kknn for k-NN (Schliep and Hechenbichler 2016) and e1071 for SVMReg (Meyer et al. 2015). In implementing k-NN, the most important step is the selection of the number of neighbours. Different values were evaluated ranging from 1 to 12 and the model with the minimum RMSE value was selected. We implement ANNs both without hidden layer and with one hidden layer, linear SVMReg and SVMReg with the RBF kernel function.

Prediction models were developed using all combinations of input variables without repetition. There are total of fifteen such combinations. Then the best combination for each model is selected according to prediction performance measures described in next subsection. All models were developed for each weather station using training data sets consisting of 360 records and evaluated by using test data sets consisting of 180 records.

Prediction performance of models was evaluated by comparing observed and predicted rainfall using three measures of prediction accuracy calculated from the test sets: the Root Mean Squared Error (RMSE), the Mean Absolute Error (MAE) and the Coefficient of Efficiency (CE). It is well-known that MAE is less sensitive to outliers than RMSE. The small values of RMSE and MAE indicate small deviations of the predictions from actual observations.

CE, proposed in Nash and Sutcliffe (1970), is a normalized statistic that determine the relative magnitude of the residual variance and data variance. CE ranges from − to 1. An efficiency CE = 1 means a perfect prediction. An efficiency of 0 indicates that the model predictions are as accurate as the mean of the observed data and an efficiency − < CE < 0 occurs when the observed mean is a better predictor than the model.

5 Results and Discussion

All models are trained using data from Jan 1970 to Dec 1999 and tested using data from Jan 2000 to Dec 2014 with each combination of input variables in all 24 locations. Negative predicted values were adjusted to zero rainfall before the calculation of performance measures. The best combination of input variables for each model was determined using test data and RMSE and MAE as primary performance measures.

Tables 3 and 4 summarize the prediction performance of models with best combinations of input variables. In tables best results among all models are highlighted in bold.

Table 3 Prediction performance of models in the temperate and grassland zones
Table 4 Prediction performance of models for monthly rainfall prediction in desert, tropical and subtropical zones

Results for the temperate zone are presented in Table 3 and illustrated in Fig. 2. These results show that SVMReg(RBF) and ANN(1) models outperform other models. According to all performance measures SVMReg(RBF) provides best predictions at four out of seven stations and all of them are coastal stations. ANN(1) gives best results at Koppio and Dookie. At Moss Vale, these two models demonstrate the similar performance.

Fig. 2
figure 2

Graphical display of the performance of models in the temperate zone

The best predictions with SVMReg(RBF) for Port Elliot and Cape Otway had inputs TMax, TMin, Evap and Rad; for Peppermint Grove TMax, TMin and VP; for Moss Vale TMax, TMin, Evap and VP; while for Orbost the best combination was the full set of five variables. The most accurate predictions with ANN(1) model for Koppio had TMax and TMin as inputs; for Dookie Rad was the only input variable; while for Moss Vale the best combination was the full set of five variables.

Fig. 3
figure 3

Graphical display of the performance of models in grassland zone

Results presented in Fig. 2 show that according to RMSE and MAE all models provide best predictions at Port Elliot and worst predictions at Moss Vale. CE indicates that all models performed well at Port Elliot and Koppio, while worse at Orbost and Moss Vale. Models failed to predict extreme rainfall values at all locations.

Table 3 presents monthly rainfall prediction results and Fig. 3 illustrates the performance of models in the grassland zone. For this zone we also include a visual comparison of observed and predicted rainfall over the test period which is given in Fig. 6.

Fig. 4
figure 4

Graphical display of the performance of models in desert zone

These results show that the SVMReg(RBF) and ANN(1) models outperform other models at most locations. However, the performance of other models are not always significantly different from that of SVMReg(RBF) and ANN(1). At least one performance measure indicates that ANN(0) is the best at three locations; MLR at two; SVMReg(linear) and k −NN at one location.

Results presented in Fig. 3 show that the performance measures RMSE and MAE give different results than CE. RMSE and MAE indicate that with respect to some tolerance all models have the lowest prediction error at Dowerin and the highest prediction error at Newry. According to CE all models, except ANN(1), provide predictions with the lowest error at Richmond and the highest error at Annuello. The ANN(1) model predictions have the lowest error at Ricmond and Alexandria and the highest error at Blinman. Figure 6 demonstrates that all models follow the series patterns at Newry and Alexandria, however, this is not true for Warren. Models failed to predict extreme rainfall values at all three locations.

Fig. 5
figure 5

Graphical display of the performance of models in tropical and subtropical zones

Fig. 6
figure 6

Observed rainfall (grey line) vs model predictions (dotted line) for grassland zone

Fig. 7
figure 7

Scatter plot of the performance measures for all 24 locations

The SVMReg(RBF) model produced best predictions with inputs TMax, TMin and VP at Newry and Ningaloo, with inputs TMax, TMin, VP, Rad at Warren, Richmond, Annuello and Dowerin and with the full set of five inputs at Alexandria and Blinman.

Table 4 presents results for monthly rainfall predictions in desert zone and Fig. 4 illustrates the performance of models. One can see that overall, in the desert zone the ANN(0) and SVMReg(RBF) models produce better predictions than other models. SVMReg(linear) provides the best prediction for Marree and MLR gives the best prediction for Wilcannia weather station. The subset of best input variables strongly depends on location. For example, the SVMReg(RBF) model gave best predictions for Henbury with inputs Evap, VP and Rad; for Boulia with TMax, TMin, VP, and Rad; and for Marree with the full set of input variables.

The performance measures RMSE and MAE imply that all models give predictions with the lowest error at Marree and with the highest error at Henbury and Wiluna. The performance measure CE is not in full agreement with RMSE and MAE. According to it the best performance of models is at Boulia and the worst performance is at Marree and Wiluna.

Monthly rainfall predictions and illustration of the performance of models for tropical and subtropical classification zones are given in Table 4 and Fig. 5, respectively.

Results show that the k-NN model gives the best predictions at Palmerville and Fairymead. At Katherine k-NN model’s results are similar to the best results obtained by SVMReg(RBF). According to the performance measures RMSE and CE, ANN(1) gives best predictions at Yamba. Again the subset of input variables providing the best performance of models depends on a location. For example, the SVMReg(RBF) model gives best predictions at Katherine with inputs TMax, TMin, Evap and VP; at Palmerville with inputs TMax, TMin and Rad and at Yamba with the full set of input variables.

Figure 5 shows that there is an agreement between all three performance measures in determining a location with the best prediction results. All of them indicate Katherine. However, there is some inconsistency in determining a location with the worst prediction results. RMSE determines Fairymead, MAE Yamba and CE Fairymead (except the k −NN model) as the location with the worst prediction results. k −NN gives the worst prediction at Yamba (Fig. 6).

The scatter plot of three performance measures for all 24 locations and models is given in Fig. 7. This figure shows that RMSE and MAE give similar results on the quality of prediction in all climatic zones while CE not always follows their patterns. There is some disagreement between RMSE and MAE on one side and CE on the other side on the quality of predictions.

6 Conclusions

This paper reports results on a comparison of monthly rainfall prediction models using meteorological input variables. Data from 24 weather stations distributed over five climatic zones in Australia are used for this purpose. This data set consists of 540 records (from January 1970 to December 2014) and six meteorological variables, one output variable: rainfall and five input variables: maximum and minimum temperatures, vapour pressure, evaporation and solar radiation. The use of different climatic zones allowed to study the performance of the prediction models depending on different climate and hydrological regimes.

Six prediction models: SVMReg(linear), SVM with the RBF kernel function (SVMReg(RBF)), ANN without hidden layer (ANN(0)), ANN with one hidden layer (ANN(1)), k-NN and Multiple Linear Regression (MLR) were selected for comparison. All the selected models were developed for each weather station using training sets and evaluated using test sets. The prediction performance of models was evaluated by comparing observed and predicted rainfall using performance measures RMSE, MAE and CE.

Based on obtained results we can draw the following conclusions:

  1. 1.

    Among all six models, SVMReg(RBF) and ANN(1) are most accurate for rainfall prediction. Although k-NN and ANN(0) models give the best predictions for some locations, they are not as accurate as SVMReg(RBF) and ANN(1) for many other locations. Two linear models, SVMReg(linear) and MLR, in general, are not accurate models for rainfall prediction. The SVMReg(RBF) and ANN(1) models are especially accurate in temperate, grassland and desert zones.

  2. 2.

    In tropical and subtropical zones the k −NN model is the most suitable model for monthly rainfall predictions where this model obtained best predictions or close to the best predictions. In these zones prediction errors by all models are higher than those for other climatic zones because of higher rainfall variability and extreme values.

  3. 3.

    All six models at all locations, with a very few exceptions, fail to predict extreme rainfalls.

  4. 4.

    Prediction performance of all six models varies considerably both within and across climatic zones. In tropical and subtropical zones, predictions have a large deviation from the actual rainfall observations.

  5. 5.

    The performance measures RMSE and MAE give approximately similar results, while in some locations CE provides opposite results to that of by RMSE and MAE. This is very clear from the scatter plot of the performance measures for all 24 locations given in Fig. 7. One reason for such a behavior of performance measures is extreme rainfall values. In the case of large number of extreme rainfall values the RMSE measure is better than the MAE measure as in this case the former measure takes into account the effect of these values.

  6. 6.

    Results show that both RMSE and MAE should be considered as primary measures to identify a subset of best input variables, that is the subset of input variables which provides the best prediction. This is due to the fact that these measures allows to determine almost the same subset of input variables across all weather stations for a given climatic zone, whereas for the CE measure this subset varies significantly even within a climatic zone.

  7. 7.

    We use results from papers (Abbot and Marohasy 2012, 2014) to compare the performance of the ANN model with that of presented in this paper. These two papers and the current paper use the data from the same weather stations in Queensland, Australia. However, the sets of input variables are not the same. The comparison is based on the RMSE measure and it shows that there is no any significant difference in the performance of ANN presented in these papers. However, this comparison cannot be considered conclusive as data used are not the same. There are no results with other models on similar data sets and therefore, it is not possible to compare their performance.

Rainfall is a very complex climate variable. It is controlled by physical processes involving random fluctuations. Relationship between rainfall and climatic or meteorological variables is highly nonlinear. Results confirm that data-driven modelling presents a powerful approach for rainfall prediction. Models which are able to capture nonlinearities are most suitable for such predictions. Our results on the SVMReg(RBF) and ANN(1) models confirm this conclusion. However, results from this paper also show that mainstream models are not always successful for rainfall predictions and there is a need for better models. Such models should be able, in particular, to predict extreme rainfall events which are real challenge for existing models.