1 Introduction

Establishing a reliable and accurate probability distribution is vital for hydrology, meteorology, and other related fields of study. The choice of distribution particularly affects the verification and analysis of heavy rainfall and the improvement of simulation results using hydrological models (Deguenon et al. 2009; Hanum et al. 2015). Accurate probability distributions of rainfall data can be used to forecast precipitation amounts several hours ahead. For these reasons, various approaches for determining rainfall probability distributions have been actively researched (Coe and Stren 1982; Baxevani and Lennartsson 2015).

Gamma (Coe and Stren 1982) or log normal distributions are commonly used as probability distributions for rainfall data, although mixed distributions have become a more recent popular approach. The gamma distribution does not exhibit good performance in the case of heavy precipitation (high rainfall with low frequency), which requires high accuracy. For this reason, the generalized Pareto (GP) distribution has been applied in extreme rainfall cases (Deguenon et al. 2009); however, it cannot be used if the light rain frequency is high. To compensate for these deficits, spliced distributions, which fit two different distributions in each support and consider covariates for seasonality and teleconnections, are the focus of this research. Hanum et al. (2015) showed that fitting a spliced distribution consisting of a gamma distribution and Pareto distribution to tropical heavy rainfall data of Jakarta, Indonesia, obtained a better result than fitting single distributions such as gamma, Pareto, or GP distributions alone (see also Li et al. 2012). However, it is generally known that the spliced distribution does not satisfy continuity and differentiability at the thresholds of two distributions. A hybrid distribution has therefore been developed to compensate for this problem, although the hybrid distribution only satisfies continuity at the threshold and not differentiability. Previous studies have attempted to splice the gamma distribution into a GP distribution. For example, Baxevani and Lennartsson (2015) proposed a precipitation generator composed of a hybrid gamma and GP distribution after fixing the threshold as a constant.

Stochastic weather generators are often used to simulate a time series of weather variables on a daily scale. They can also be used to temporally downscale climate information, such as monthly or seasonal forecasts (Wilks and Wilby 1999; Benestad et al. 2008). The parametric weather generator can incorporate covariates for seasonality and teleconnections associated with El Niño (Furrer and Katz 2007). However, this approach tends to underestimate the observed inter-annual variance of seasonally aggregated variables, which is termed “overdispersion”, and shown in Fig. 1 (Buishand 1978; Katz and Parlange 1998; Benestad et al. 2008). Recently, filtered time series of seasonal total precipitation using locally weighted scatter-plot smoothing (LOESS, Cleveland 1979; Hastie and Tibshirani 1990) or a hidden variable reflecting the unobserved seasonal shifts in climate regimes have been incorporated as covariates in the GLM based weather generator (Kim et al. 2012; Kim and Lee 2017). However, there are still some issues concerning the choice of smoothing parameter.

Fig. 1
figure 1

Overdispersion phenomenon in GLM weather generator

The purpose of this study is to investigate the conditions of the modified hybrid distribution considering the differentiability of an existing hybrid distribution and to suggest a modified hybrid gamma and generalized Pareto distribution that can satisfy both continuity and differentiability for rainfall data which is essential for maximum likelihood estimation of parameters of hybrid distribution. For this purpose, we evaluate the appropriateness of the modified hybrid gamma and GP distribution model through various simulation results. In addition, we reveal the practical applications of the model using daily summer precipitation amounts observed in Seoul, Korea, from 1961 to 2011. Finally, a suggestion is made for additional studies to produce a weather generator that can reduce overdispersion with our proposed distribution.

2 Modified Hybrid Gamma and Generalized Pareto Distribution

Here, we first describe the spliced distribution suggested by Klugman et al. (2004) and Nadarajah and Bakar (2014). We then separately introduce the gamma and GP distributions before finally presenting our modified hybrid gamma and GP distribution.

2.1 Spliced Distribution

A spliced distribution is typically constructed using several distribution functions f1(x), f2(x), … , fk(x). The general form of the spliced distribution denoted by f(x) can be expressed as follows (see Klugman et al. 2004; Nadarajah and Bakar 2014):

$$ f(x)=\left\{\begin{array}{ccc}\kern0.5em {a}_1{f}_1^{\ast }(x)& \mathrm{if}& {c}_0<x<{c}_1\\ {}\kern0.5em {a}_2{f}_2^{\ast }(x)& \mathrm{if}& {c}_1<x<{c}_2\\ {}\begin{array}{c}\vdots \\ {}\kern0.5em {a}_k{f}_k^{\ast }(x)\end{array}& \begin{array}{c}\\ {}\mathrm{if}\end{array}& \begin{array}{c}\vdots \\ {}{c}_{k-1}<x<{c}_k\end{array}\end{array}\right. $$
(2.1)

where ai is the mixing weight and satisfies \( {\sum}_{i=1}^k{a}_i=1,\left({a}_i>0\right) \) and \( {f}_i^{\ast }(x) \) denotes the truncated probability density function, which is of the form

$$ {f}_i^{\ast }(x)=\frac{f_i(x)}{\int_{{\mathrm{c}}_{i-1}}^{c_i}{f}_i(x) dx},\kern0.5em i=1,2,\dots, k. $$

In the special case of (2.1), a simple spliced distribution combines the head part of the probability density function f1(x) with the tail part of f2(x), and can be shown as follows:

$$ f(x)=\left\{\begin{array}{ccc}\kern0.5em {a}_1{f}_1^{\ast }(x)& \mathrm{if}& -\infty <x<\theta \\ {}\kern0.5em & & \\ {}{a}_2{f}_2^{\ast }(x)& \mathrm{if}& \theta <x<\infty \end{array}.\right. $$
(2.2)

As mentioned before, a1, a2 are positive mixing weights that satisfy a1 + a2 = 1, and \( {f}_1^{\ast }(x) \) and \( {f}_2^{\ast }(x) \) have the following forms

$$ {f}_1^{\ast }(x)=\frac{f_1(x)}{F_1\left(\theta \right)}\ \mathrm{and}\ {f}_2^{\ast }(x)=\frac{f_2(x)}{1-{F}_2\left(\theta \right)}, $$

where F1(x) and F2(x) are cumulative distribution functions corresponding to the density functions f1(x) and f2(x), respectively. Note that θ denotes the limit of the domain and is regarded as a model parameter (Klugman et al. 2004). However, it is not generally guaranteed that a spliced distribution of the form (2.2) is a valid continuous density function. Thus, a spliced distribution requires the following condition to be continuous: f(θ−) = f(θ+). In addition, it needs another critical condition such as differentiability (Bakar et al. 2015). To be differentiable at every point x, the probability density function f(x) is required to satisfy the following condition at threshold θ:

$$ {f}^{\prime}\left(\theta -\right)={f}^{\prime}\left(\theta +\right). $$

Under the above conditions, the spliced distribution with probability density function f(x) can be differentiable on the support. In addition, the effect of reducing the parameters of the spliced distribution is commonly obtained by deriving the relationship between the parameters of each head and tail of the distribution from these conditions.

According to the continuity condition, the mixing weights a1, a2 in (2.2) can be represented as

$$ {a}_1=\frac{1}{1+\delta }\ \mathrm{and}\ {a}_2=1-{a}_1=\frac{\delta }{1+\delta }, $$

where

$$ \delta =\frac{f_1\left(\theta \right)\left[1-{F}_2\left(\theta \right)\right]}{f_2\left(\theta \right){F}_1\left(\theta \right)}. $$

Note that δ > 0 and it is well known that mixing weights that depend on the combined distribution parameters give better fits than constant weights (Scollnik 2007). However, this function might not always have an explicit form.

Under both continuity and differentiability conditions of the probability density function f(x), the equation \( {f}_1^{\prime}\left(\theta \right){f}_2\left(\theta \right)-{f}_1\left(\theta \right){f}_2^{\prime}\left(\theta \right) \) is satisfied at threshold θ. Alternatively, it can be achieved by solving

$$ \frac{\partial }{\partial \theta}\log \left(\frac{f_1\left(\theta \right)}{f_2\left(\theta \right)}\right)=0, $$
(2.3)

as shown by Bakar et al. (2015). In general, the parameters of both the tail of the probability distribution function f1(x) and the head of the probability distribution function f2(x), as well as the mixing weights a1, a2, depend on threshold θ. However, threshold θ can also be obtained after estimating all parameters of the spliced distribution. Thus, a certain number of iterations is necessary.

2.2 Modified Hybrid Gamma and Generalized Pareto Distribution

For the case of frequent light rainfall, a gamma distribution is commonly cited as an appropriate method because it has short right tail. With a long tail to the right, it might not display goodness of fit. However, the GP distribution generally displays a good fit with a long right tail in the case of heavy rainfall, although it might not show goodness of fit with a short right tail. In addition, data loss occurs because the shape of the GP distribution is truncated below the threshold value.

Therefore, we suggest a modified hybrid gamma and generalized Pareto distribution that uses the gamma distribution when rainfall is light and the GP distribution when rainfall is heavy.

Suppose that part of the head, f1(x), is a gamma distribution with shape parameter α and scale parameter β, and part of the tail, f2(x), is a GP distribution with location θ, scale parameter σ, and shape parameter ξ. Then, threshold θ satisfying (2.3) is indicated by

$$ \theta =\left(\alpha -1\right)\beta . $$

Thus, threshold θ of the modified hybrid gamma and GP distribution only depends on the parameters of the gamma distribution when α > 1. In addition, the δ value for mixing weights can be expresses as follows:

$$ \delta =\frac{\sigma {\theta}^{\alpha -1}\exp \left(-\theta /\beta \right)}{\Gamma \left(\alpha, \theta /\beta \right)\ {\beta}^{\alpha }}, $$

where Γ(α, θ/β) is the lower incomplete gamma function. Note that δ is determined by parameters of both the GP and gamma distributions. Therefore, the probability density function of the proposed distribution is

$$ f(x)=\left\{\begin{array}{ccc}\ \frac{1}{\left(1+\delta \right)\ {\beta}^{\alpha}\Gamma \left(\alpha, \theta /\beta \right)}{x}^{\alpha -1}\exp \left(-\theta /\beta \right)& \mathrm{if}& -\infty <x<\theta \\ {}\kern0.5em & & \\ {}\ \frac{\delta }{\left(1+\delta \right)\ \sigma }{\left(1+\frac{\xi \left(x-\theta \right)}{\sigma}\right)}^{\left(-\frac{1}{\xi }-1\right)}& \mathrm{if}& \theta <x<\infty \end{array},\right. $$
(2.4)

where α > 1.

2.3 Parameter Estimation of the Proposed Distribution

Here, we compute the maximum likelihood estimators of parameters of the modified hybrid gamma and GP distribution. From general formation of the likelihood function of the spliced distribution, the log-likelihood function of (α, β, ξ, σ, θ) can be expressed as follows:

$$ {\displaystyle \begin{array}{l}\log L\left(\alpha, \upbeta, \upxi, \upsigma, \uptheta \right)=-\log \left(1+\delta \right)+{\sum}_{x_i\le \theta}\log {f}_1\left({x}_i\right)-M\log {F}_1\left(\uptheta \right)\\ {}+{\sum}_{y_i>\theta}\log {f}_2\left({y}_i\right)+m\log \delta \end{array}} $$

where n = M + m, \( M={\sum}_{i=1}^nI\left({x}_i\le \theta \right) \), \( m={\sum}_{i=1}^nI\left({y}_i>\theta \right) \), f1(x) is the probability density function of the gamma distribution, F1(x) is the cumulative density function of the gamma distribution, and f2(x) is the probability density function of the GP distribution. As it is not possible to obtain the explicit form of the maximum likelihood estimator of (α, β, ξ, σ, θ) by maximizing log-likelihood logL(α, β, ξ, σ, θ), we consider the differential evolution (DE) algorithm to achieve global optimization. Note that the DE algorithm does not require the optimization problem to be differentiable required by classical optimization problem and so is useful for finding approximate solutions when there are non-differentials, multiple local minima, and non-linearities, etc.

3 Numerical Studies

We focus on the class of problems where the behavior of distributions over (or below) a high (or low) threshold is of interest; i.e. those that characterize extreme events. As mentioned in the introduction, a mixture of the gamma and GP distributions with a threshold has emerged as an efficient way to generate more realistic weather scenarios for impact assessments.

3.1 Simulation Study

In this section, we report the simulation results for the optimal threshold of the proposed distribution and examine the efficiency of the model estimation method. The results of the maximum likelihood estimators and threshold of the spliced distribution are provided. The parameters α and β are for the head part of gamma distribution, ξ and σ are the parameters for the tail part of GP distribution, and θ is a global parameter for the GP distribution, Each sample is extracted from eq. (2.4) and the mixing weights are represented as a function of other parameters of the model. To compare and summarize the performance of the simulation results, we consider mean square error (MSE) as follows:

$$ MSE=\frac{1}{\mathrm{N}}{\sum}_{i=1}^N{\left({\widehat{\theta}}^{(i)}-\theta \right)}^2 $$

where N is the number of iterations, \( {\widehat{\theta}}^{(i)} \) is the estimated threshold at the ith iteration, and θ is the true threshold.

Table 1 shows the simulation results from the modified hybrid gamma and GP distribution with α = 5, β = 4, ξ = 0.3, σ = 8, θ = 16, and N = 100 (Simulation 1). For each simulation, sample size n is 500, 1000, and 2000, resprctively; as the sample size increases, the maximum likelihood estimator of the parameters becomes more stable. Figure 2 is a plot of the fitted modified hybrid gamma and GP distribution for Simulation 1, which clearly shows an effective estimate of the proposed distribution. Table 2 shows the simulation results under different parameter conditions (α = 9, β = 7, ξ = 0.2, σ = 3, θ = 56, and N = 100 (Simulation 2). Tables 3 and 4 show the results for a simulation with a relatively small threshold, where α = 2, β = 4, ξ = 0.7, σ = 4, θ = 4 (Simulation 3), and a simulation with a very large threshold, where α = 14, β = 20, ξ = 1, σ = 4, θ = 260 (Simulation 4), respectively. The overall simulation results indicate that, as the sample size increases, MSE decreases and the estimate becomes more stable. Figures 3, 4 and 5 show the fitted modified hybrid gamma and generalized Pareto distribution corresponding to the results of Simulation 2, 3, and 4.

Table 1 MSE of modified hybrid distribution parameters with relatively small threshold values
Fig. 2
figure 2

Fitted modified hybrid gamma and generalized Pareto distribution with relatively small threshold values

Table 2 MSE of modified hybrid distribution parameters with relatively large threshold values
Table 3 MSE of modified hybrid distribution parameters with very small threshold values
Table 4 MSE of modified hybrid distribution parameters with large threshold values
Fig. 3
figure 3

Fitted modified hybrid gamma and generalized Pareto distribution with relatively large threshold values

Fig. 4
figure 4

Fitted modified hybrid gamma and generalized Pareto distribution with very small threshold values

Fig. 5
figure 5

Fitted modified hybrid gamma and generalized Pareto distribution with very large threshold values

3.2 Real Data

To verify and demonstrate the performance of the proposed modified hybrid gamma and GP distribution using real case rainfall data, we use the daily summer precipitation amounts observed in 62 weather stations, Korea, from 1961 to 2011 and retrieve the maximum likelihood estimates. In each station, there are 4692 data sets and we used 2051 of them, excluding any classed as 0, which indicated that there was no rain recorded. In Seoul, descriptive statistics are summarized in Table 5. The 50-year rainfall data is positively skewed, revealing that the maximum value exhibits a large difference both in the 1st and 3rd quartiles. The data features multiple instance of low rainfall and only few large amounts of rainfall appears together (Table 6). Summary statistics for estimated parameters of modified hybrid gamma and generalized Pareto distribution using rainfall data in 62 weather stations during 1961–2011 are provided in Table 7. The plots of fitted modified hybrid gamma and generalized Pareto distribution with histogram of rainfall data in several weather stations in Korea during 1961–2011 are provided in Fig. 6. In addition, Fig. 7 and Fig. 8 show estimated parameters of modified hybrid gamma and generalized Pareto distribution using rainfall data in 62 weather stations during 1961–2011. Some parameters share geographical trend in common with threshold parameter (θ).

Table 5 Descriptive statistics for rainfall data in Seoul during 1961–2011
Table 6 Approximate maximum likelihood estimates of modified hybrid gamma and generalized Pareto distribution using rainfall data in Seoul during 1961–2011
Table 7 Summary statistics for estimated parameters of modified hybrid gamma and generalized Pareto distribution using rainfall data in 62 weather stations during 1961–2011
Fig. 6
figure 6

Fitted modified hybrid gamma and generalized Pareto distribution with histogram of rainfall data in several weather stations (Seoul, Incheon, Gwangju, Daejeon) in Korea during 1961–2011

Fig. 7
figure 7

Estimated threshold parameters (θ) of modified hybrid gamma and generalized Pareto distribution using rainfall data in 62 weather stations during 1961–2011

Fig. 8
figure 8

Estimated parameters (α, β, ξ, σ) of modified hybrid gamma and generalized Pareto distribution using rainfall data in 62 weather stations during 1961–2011 (α, β, ξ, σ, clockwise from top left)

To determine the maximum likelihood estimator of the proposed distribution, we used the DE algorithm. Through multidimensional global optimization, we found the approximate maximum likelihood estimates that minimize the log likelihood function. In addition, we calculated the standard error, which is the standard deviation of each estimator, after deriving Fisher information through a Hessian matrix. This result is summarized in Table 6. Finally, we compared the goodness of fit results for the gamma distribution, GP distribution, and modified hybrid gamma and GP distribution using the rainfall data, which confirm that the proposed model results in a better estimate.

4 Concluding Remarks

In this study, we first introduced a general spliced distribution and its corresponding gamma distribution, which forms the head in the curve, and a generalized Pareto distribution, which forms the tail. Then, we examined the threshold condition for our proposed distribution and defined a new probability density function accordingly. We further derived a likelihood function for the distribution and estimated approximate maximum likelihood estimates using the DE algorithm for multiple simulations for minimization. At the same time, by presenting the MSE for each sample size, the precipitation generator model was evaluated according to the size of the sample. Finally, we used 2051 data sets of measurable daily summer precipitation observed in Seoul, Korea, from 1961 to 2011. As a result, the estimated threshold of the modified hybrid gamma and generalized Pareto distribution was 0.1455. After deriving Fisher information using a Hessian matrix, we also presented the standard error of the maximum likelihood estimator.

This study represents the first attempt to use a modified hybrid approach, which will be built on in future research. Our work has two major advantages. Firstly, the thresholds are usually fixed as constants in a spliced distribution including a generalized Pareto distribution. However, by using the modified hybrid gamma and the generalized Pareto distribution proposed in this study, the threshold and other parameters can be estimated simultaneously. Therefore, the result will be different from that of a general mixed distribution because it satisfies both continuity and differentiability at the threshold. Secondly, generating rainfall data using the modified hybrid gamma and generalized Pareto distribution will reduce the overdispersion that occurs in existing parametric weather generators and provide more accurate probability estimates.