1 Introduction

Rainfall is one of the important elements of the hydrological cycle. Due to the complexities involved in precipitation process (from convection to cloud formation to rainfall), an accurate prediction of the occurrence of rainfall episodes over a given location becomes a difficult task without appropriate tools. Moreover, the regional and seasonal dependence of rainfall over a given location further add up to the complexity of the problem. In the Indian context, the Asian summer monsoonal season accounts for more than 80% of the total rainfall over the country (e.g., Ding and Sikka 2006). Despite having a consistency in the pattern of rainfall over different seasons, researchers working towards the prediction of rainfall, find it difficult to provide a credible tool for the prediction of rainfall episodes of different categories (e.g., French et al. 1972). The occurrence of heavy rainfall eventually leads to major floods as seen over Mumbai in July 2005, over Chennai and Bengaluru in October 2005, and more recently over Kerala in August 2018 (Goswami et al. 2006; Rajeevan et al. 2008). The occurrence of such extreme rainfall events is increasing regionally as well as globally, and there is no imminent pattern in such events as they have large variabilities over different geographical regions (Parthasarthy et al. 1995; Dash et al. 2009). Some of the heavy rainy events led to severe floods in many parts of the country during recent years and have caused large-scale destruction of life and property. Since 1950s, numerical weather forecasting has advanced remarkably. A wide range of rainfall forecasts is employed in the area of Numerical Weather Prediction (NWP) at regional and global scales. Synoptic weather forecasting is generally dominated by the computer-based NWP models, which are based on the governing equations of atmosphere and parameterization of physical processes. Over the last five to six decades, these models are serving as the backbone of weather forecasting and their credibility has been consistently increasing ever since their inception (e.g., Bauer et al. 2015; Subrahamanyam et al. 2019).

Conventionally, NWP models are used as the primary tools for the prediction of irregular heavy rainfall events. Though the NWP models provide fairly reliable prediction of extreme weather events, occasionally these traditional approach yields inconsistent results about the occurrence of heavy rainfall events. In such a scenario, the Machine Learning (ML) algorithm of Artificial Intelligence (AI) offers a promising alternative approach for the prediction of rainfall events. Recently, there has been a growing interest in the usage of AI in the weather as well as the climatic forecasts. ML algorithms have attained wider acceptance in several research domains (Schneider et al. 2017; O’Gorman and Dwyer 2018; Reichstein et al. 2019). ML is a sub-area of AI, where computers can be trained to learn a probable relationship between different variables from a large training dataset, and it provides easier evaluation with high performance compared to the NWP models (Mosavi et al. 2012; Mekanik et al. 2013). There are some studies on the prediction of rainfall and other parameters in regional as well as the global scales (Schneider et al. 2017; Scher and Messori 2018). Goswami and Srividya (1996) designed a neural network to predict rainfall over the Indian region. Venkatesan et al. (1997) have used Error Back Propagation approach for the prediction of Indian summer monsoon rainfall. Sahai et al. (2000) have also used Artificial Neural Network (ANN) methods for prediction of all Indian summer monsoon rainfall. Chaudhuri and Chattopadhyay (2005) have developed a feed-forward neural network for the estimation of maximum surface temperature and relative humidity. Chattopadhyay (2007) has carried out a comparative study by using different ANNs for prediction of the rainfall. Vathsala and Koolagudi (2017) have predicted the peninsular Indian summer monsoon rainfall using multi-layered perceptron (MLP) model. Karmakar et al. (2009) have developed an Artificial Neural network (ANN) model for estimation of the rainfall over the Indian region. Wang and Sheng (2010) have proposed a Generalization Regression Neural Network (GRNN) for forecasting the yearly rainfall over China. Hung et al. (2009) have forecasted the hourly rainfall in Bangkok using Generalized Feed Forward Neural Network. Nagahamulla et al. (2011) have used ANN to predict the seasonal monsoon rainfall in Sri Lanka and Kashiwao et al. (2017) have developed a prediction model for hourly rainfall prediction in Japan using Multilayer Perceptron (MLP) and Radial Basis Function Neural Network (RBFNN). Yan et al. (2020) developed a deep learning approach for precipitation forecast.

Although AI-based meteorological prediction has its advantages, it comes with its own set of challenges and limitations. Dueben and Bauer (2018) documented the challenges and various designing choices for forecasting based on ANN. Most of the studies mentioned above were limited to the prediction of hourly, daily and monthly or average rainfall over a region using different AI- and ANN-based algorithms. Performances of these methods were found to be largely dependent on their training method and the historical datasets used in the study. Whenever the training data volume was not sufficient and did not cover large dynamical values, the ANN and ML algorithms failed to provide promising results. Despite good progress in the area of AI and ML techniques, there are very limited studies on the prediction of heavy rainfall events over a given location.

The aforementioned studies were for the prediction of seasonal, daily and monthly rainfall, but there is no study seen in the literature on the prediction of heavy rainfall days, which impact the livelihood of many people and trigger havoc in the country. Therefore, in the present study, we propose an AI-based approach for the prediction of heavy and light rainfall events for Sriharikota, a coastal station on the eastern coastline of the Indian peninsula. Currently, there is a whole gamut of AI algorithms to choose from, however, this work only focuses on the efficacy of the Gaussian Process Regression (GPR) which is a nonparametric, Bayesian approach to regression. We have used rainfall data for a period of 80 years spanning from 1901 to 1980 for the training of AI model, and subsequently allowed the model to make prediction of light and heavy rainy days from 1981 to 2016. This article is organized into four sections. Section 2 provides details about the model, data and method of analysis. Section 3 describes important results in the context of AI/ML approach. Section 4 highlights the summary and implications of the present results on futuristic prediction techniques.

2 Data pre-processing and description of GPR algorithm

In the present work, we have focused on AI/ML technique for prediction of light to heavy rainy days for Sriharikota (13.82°N, 80.22°E), which is a coastal station located on the eastern coastline of the Indian peninsula (figure 1). The climate of Sriharikota is mainly wet (higher humidity) and it encounters both southwest and northeast Indian monsoon. The climatological average annual rainfall over Sriharikota is about 1311.9 mm. More than 50% of annual rainfall occurs between October and December over Sriharikota. About 14% of average annual rainfall days are due to moderate to heavy rain event days. The study of heavy rainfall rates is essential for launch missions like Polar Satellite Launch Vehicle (PSLV), Geo-Synchronous Satellite Launch Vehicle (GSLV) and pre-launch day-to-day activities at Satish Dhawan Space Centre (SDSC), Sriharikota. Here, we have used high-resolution spatially gridded (0.25° × 0.25°) daily rainfall data of India Meteorological Department (IMD) extracted for Sriharikota for a period of 116 years from 1901 to 2016 forms the main database (Rajeevan et al. 2006). The IMD data provides 24-h accumulated rainfall amount on a daily basis. We have classified the rainfall data into a total of nine distinct and standard categories from ‘No Rain’ to ‘Exceptionally very high rain’ (IMD). Table 1 provides standard threshold of rainfall amounts used by IMD for categorization to different classes of rainy events. Here, we have chosen ‘heavy rain’ and ‘light rain’ cases to demonstrate the potential for ML-based model.

Figure 1
figure 1

Box denotes the selected location.

Table 1 Classification of various categories of rainfall (in table, we can convert percentage to equivalent day. This will show, how many days in a year can we expect one particular category of rain. You simply need to multiply % with 365).

2.1 Gaussian processes regression (GPR)

In the present study, we have used GPR model for the prediction of light and heavy rainy days. This is a supervised learning model and it can be used for classification and regression problems. The outputs are discrete class labels in a classification problem, while others are predictions of continuous quantities in regression. Gaussian process (GP) is a collection of random variables and it is described by its mean and covariance functions. Covariance function in GPR also called as kernel plays a crucial role in determining the shape of prior and posterior of the GPR. Rasmussen and Williams (2006) have described the details about Gaussian processes and their covariance functions in neural network algorithms. Kernel functions in the GPR model characterize the correlation among different input points in the process. Various kernel functions can be used for the GP model, such as radial basis function (RBF), Martern kernel, rational quadratic kernel, ex-sine-squared kernel and dot-product-kernel. The inverse of the length-scale determines the relevance of input data. In the present study, we have used the RBF kernel as optimization in the GPR model and White kernel as the noise level. Liu et al. (2001) also used an adaptive RBF neural network for estimation of rainfall. Equation (1) represents the general form of GPR model (Rasmussen and Williams 2006) and figure 2 depicts the flowchart of GPR processes performed in the present study.

Figure 2
figure 2

The flowchart diagram for implementation of the GPR model in the present study.

Let us assume a given training dataset Ɗ = {(xi, yi) | i = 1, …, n}, where yi is the daily accumulated rainfall corresponding to a given day xi. In the finite set of given data Ɗ, f(x(1)), f(x(2)), …, f(x(n)) is the random variable and obeys the Gaussian distribution. The statistical form of the general GP can be represented by the mean function m(x) along with its covariance function k(x, x*) and is given in equation (1)

$$ m(x) = E\left[ {f(x)} \right] $$
(1)
$$ {{Y}} = {{F}} \times {{X}} +\varepsilon $$
(2)

where X is the input vector, F is the weight of the model and Y is the observed target. The error associated with the GP is given by ε, i.e., εƝ(0,\(\sigma _{{{n}}}^{2}\)), where \(\sigma _{{{n}}}^{2}\) stands for the variance of the noise.

The GP predictive distribution at a test input x* is given by

(3)
$$\mu \left( {m\left( {x^{*} } \right)} \right) = k^{*{\text{T}}} \left( {{{K}} +\sigma _{{{n}}}^{2}{{I}}} \right)^{ - 1} {{Y}} $$
(4)
$$\sigma _{{{n}}}^{2} \left( {m\left( {x^{*} } \right)} \right) = k^{**} k^{{\text T}{*}} \left( {{{K}} +\sigma_{{{n}}}^{2}{{I}}} \right)^{ - 1} k^{*} $$
(5)

where K is the kernel matrix constructed as Kij = k(xi, xj), k** = k(x*, x*), k* = k(X, x*) and k is the squared exponential covariance function. Equation (6) represents the form of the squared exponential function, which has been successfully applied to different applications.

$$ k_{{{\text{SE}}}} \left( {x_{i} ,x_{j} } \right) = \sigma_{{{f}}}^{2} \exp \left[ {{-}\left( {1/2} \right) \, \left( {x_{i} {-}x_{j} } \right)^{\text T} M^{{{-}2}} \left( {x_{i} {-}x_{j} } \right)} \right] $$
(6)

where {M, \(\sigma _{{{f}}}^{2}\), \(\sigma _{{{n}}}^{2}\)} is the set of hyperparameters, denoted by \( \theta \) and \(\sigma _{{{f}}}^{2}\) defines the variance of the kernel function. The optimal hyperparameter \( \theta \) can be solved by the maximum likelihood estimate function \( L(\theta ) \), expressed as:

$$ L(\theta ) = - \log \left( {P\left( {Y|X,\theta } \right)} \right). $$
(7)

To get the prediction accuracy using GPR model, the time-series of heavy rainfall data have been divided into two sets, such as training (80%) and testing (20%) data. In this experiment, yearly observed 116 years (1901–2016) of heavy and light rainfall days along with their corresponding time stamps are given as input to the GPR model. After appropriate training and cross-validation, the GPR model could provide a better prediction of heavy and light rainfall days as discussed in section 3.

3 Experiment results and discussion

Figure 3(a and c) shows the yearly occurrence of heavy and light rainy days and 3(b and d) shows the histograms of heavy and light rainy days over Sriharikota during 1901–2016. It is observed from the figure that there is an increasing trend in heavy rainfall days (0.342 per decade with 95% confidence), while a decreasing trend is observed in light rainy days (−0.602 per decade with 95% confidence). The present observations are corroborating with the existing results. It has been observed that frequency of occurrence of heavy rain events is increased significantly over India during 1950–2015 (Goswami et al. 2006; Kulkarni et al. 2020) and a decreasing trend observed in the moderate rain events (Goswami et al. 2006). The rise in heavy rainfall events is seen over several geographical locations and sometimes it exceeds the seasonal mean rainfall (Westra et al. 2013). Histograms of heavy and light rainy days are shown in figure 3(b and d), which showed a Gaussian distribution with skewness of 0.47 and 0.19, respectively.

Figure 3
figure 3

Time-series of yearly occurrence of heavy and light rainy days during 1901–2016 (a and c) and histograms of heavy and light rainy days (b and d).

After the GPR model is trained, the regression model results can be displayed by the response plot, i.e., the predicted response vs. record number as shown in figure 4. Figure 4 shows the prediction of heavy rainfall days using the GPR model. Blue curve shows the data (1901–1988) used to train the model. Green and red colour line indicates the actual and predicted heavy rainfall days from 1989 to 2016. Overall, the predictions are good and able to reproduce the variability of heavy rainfall days as shown in figure 4. Interestingly, it could capture the peak years of heavy rainfall days. Figure 5 is the same as figure 4, for light rainfall days, which also showed a good prediction compared to observations. Further, we also used various other models such as k-nearest neighbour (KNN), random forest and decision tree models to predict the heavy rainfall days as shown in figure 6. All the models were able to reproduce the variability in heavy rainfall days as seen in figure 6. We further evaluate the model performance by using the validation parameters, such as Root Mean Square Error (0.161), Mean Square Error (0.026), and Mean Absolute Error (0.126), for the model predicted results. The comprehensive comparison is shown in table 2. The performance of difference GPR model is compared with other models using the following model statistics.

Figure 4
figure 4

The prediction of heavy rainfall days using the GPR model.

Figure 5
figure 5

Same as figure 4, but for light rainfall days.

Figure 6
figure 6

The prediction of heavy rainfall days using the KNN, random forest and decision tree models.

Table 2 Comparison of accuracy of various models.

We extended the model for prediction of daily average rainfall over the Indian region (60°–100°E; 0°–35°N) using 36 years (1979–2014) of data from Global Precipitation Climatology Project (GPCP). The GPCP daily rainfall product is an integration of various global precipitation satellite datasets of land and oceans and a gauge analysis overland (Huffman et al. 1997). This data has been widely used for various regional and global precipitation studies. The GPR model is trained from 1979 to 2005 (27 years) for each grid (1° × 1°) point over the entire Indian region and validated for 2006–2014. Figure 7 shows the GPR-predicted rainfall along with the GPCP rainfall for July 2006. Also the Gujarat and Maharashtra high rainfall was not captured. Overall, the GPR prediction shows a good agreement with GPCP observations (R2 = 0.84). During July, which is the peak month of SW-monsoon season, a large amount of rainfall is observed over the head Bay of Bengal and the Western Ghats regions as seen in figure 7(a). The GPR model was able to reproduce the locations with heavy rainfall, however, magnitudes of rainfall predicted by the GPR model were underestimated as against the GPCP observations. Thus, the present study highlights the potential use of GPR in prediction of drastic weather episode such as heavy and light rainy days. However, there are still some limitations for the magnitude. The improvement of amplitude of predicted rainfall could be considered. Further, continuous evaluation or modification of the AI-based model is required, particularly for heavy rainfall events. Also, if ML-based models are blended with conventional NWP output, more accurate forecasts can be achieved, which needs further study.

Figure 7
figure 7

(a) GPCP daily rainfall and (b) predicted GPR daily rainfall for July 2006.

4 Conclusion and future scope

In the present communication, the GPR is applied to historical time-series rainfall data to predict heavy and light rainfall days over Sriharikota located over east coast of India. The model predicted for heavy and light rainfall days with reasonably good accuracy (RMSE = 0.161). The GPR could reproduce the variability and capture the peak years of heavy rainfall days. The performance of the GPR model was found to be better when compared with the results from other three ML-based models. Further, the GPR model was trained using GPCP daily rainfall data and prediction made for the spatial distribution of monthly rainfall (1° × 1°) over the Indian region which also showed good agreement with the observations (R2 = 0.84 and RMSE = 3.324). The results obtained from the present ML-model encourages the utilization of the GPR model as one of the promising machine learning tools for the prediction of heavy rainfall days over a given location, which impact average rainfall over the region, ecosystems, landscape through erosion processes and leads to major floods. It is recommended that in future the GPR model could be coupled with other ML-based models, so that the prediction accuracy can be improved in terms of magnitude as well as the occurrence of heavy rainfall days. In the world of changing climate where anthropogenic emissions of aerosols influence cloud properties and hence rainfall, the model prediction made in the present study needs to be evaluated continuously for the improvement in the model.