1 Introduction

Floods have become most common natural hazards, increasingly posing a significant risk to human life and environment. At the drainage basin scale, consideration of flood risk plays an important role in planning of water infrastructure projects. In design of hydraulic structures (e.g., dam spillways, diversion canals, dikes, and river channels), urban drainage systems, cross-drainage structures (e.g., culverts and bridges), reservoir management, and flood hazard mapping, it requires risk analysis of floods as a design criteria. However, because of limited hydrological data and higher uncertainty associated with hydrological variables, it is also very important to use more effective method for assessment of flood risks. Traditionally for probabilistic assessment of flood risks (also referred as flood frequency analysis), various methods were developed. Basically, flood frequency analysis consists of obtaining relationship between flood quantiles and their non-exceedance probability (also referred as return periods). Conventional method of frequency analysis involves use of observed annual peak discharges as a function of recurrence interval or exceedance probability (Stedinger et al. 1993). In general, such applications are incapable of giving adequate information of floods since flood is a multivariate phenomenon characterizing flood peak, volume, and duration of the flood hydrograph.

In most of the hydrologic literature, the application of multivariate modeling is mainly contained to bivariate cases. Krstanovic and Singh (1987) used bivariate normal and exponential distributions for modeling joint distribution of flood peak and volumes as well as for obtaining conditional distribution of flood volume given peak discharge. Similarly, there are several studies that have used different conventional bivariate distributions for flood frequency analysis, bivariate generalized extreme value distribution (Adamson et al. 1999; Raynal and Salas 1987; Yue et al. 1999; Yue 2001a; Nadarajah and Shiau 2005; Escalante 2007), bivariate gamma distribution (Yue 2001b; Yue et al. 2001), bivariate normal distribution (Goel et al. 1998; Yue 1999), bivariate lognormal distribution (Yue 2000), bivariate exponential distribution (Choulakian et al. 1998) etc. Very few studies dealt with trivariate distributions for flood frequency analysis. For example, trivariate gumbel distribution (Escalante and Raynal 1998), trivariate generalized extreme value distribution (Escalante and Raynal 2008) are used for trivariate flood frequency analysis.

Many of the conventional models have major limitation in terms of assuming fixed probability distribution for all flood properties (i.e., assuming all flood properties are well represented by a single probability distribution). However, in practice, this may not be the case, as the flood variables may follow different distributions. Moreover, conventional estimates of flood exceedance are heavily dependent on right tail of the underlying frequency distribution, which is most difficult to estimate from observed flood data. Hence, it is desired to have evaluation of marginal distribution of flood variables separately and at the same time a function is needed that can join these marginal distributions by preserving the dependence structure of flood variables. In this context, the theory of copulas provides some advantages. Copulas are multivariate distribution functions, used for capturing the association or dependence between two or more random variables (Joe 1997). Also, the other advantage of copulas for multivariate analysis is that, it is invariant to monotonic transformations of marginal variables (i.e., data transformations, such as taking natural logarithms or applying Box–Cox transformations do not influence copula).

Recently, copulas have been applied in the field of hydrology (Favre et al. 2004; Salvadori and De Michele 2004; Grimaldi and Serinaldi 2006; Zhang and Singh 2007; Chowdhary et al. 2011; Salvadori et al. 2011). Although many studies in the past have focused on bivariate analysis of peak flow and volume for analyzing flood risks, a more complete analysis can be possible by considering mutually correlated flood properties—peak flow, volume and duration of flood events, and modeling by copulas. Very few studies used copulas for trivariate analysis of flood flows. Grimaldi and Serinaldi (2006) compared trivariate symmetric Frank copula, asymmetric or fully nested form of Frank copula and Gumbel logistic distribution for Kanawha watershed, West Virginia and found better results using asymmetric scheme of copula. Zhang and Singh (2007) used symmetric Gumbel–Hougaard copula for trivariate modeling of peak flow, volume and duration; and noted that copula fits empirical joint distribution better than trivariate normal distributions. Serinaldi and Grimaldi (2007) applied fully nested Archimedean copulas to two different hydrologic data set, viz., trivariate flood frequency analysis and multivariate sea wave frequency analysis. Genest et al. (2007) applied meta-elliptical copulas for multivariate frequency analysis of annual spring flood at Romaine River in Québec, Canada and concluded that meta-elliptical copulas can help in better modeling the dependence structure of random variables when observed differences between the bivariate margins restricts the use of exchangeable copula families (i.e., the Archimedean copulas). Salvadori et al. (2011) illustrated application of copula in multivariate flood quantiles estimation by employing joint distribution of annual maximum flood peak, volume, and the corresponding initial water levels of the dam.

In most of the studies, the researchers applied either bivariate copulas or symmetric trivariate copulas for flood analysis. They consider simplifications in terms of deriving return periods, like assuming the dependence between only two flood properties and/or independence of the third variable. Then, copulas used to model pair-wise dependency in bivariate way. But in this study, asymmetric copulas are applied for modeling trivariate flood properties peak flow, volume and duration simultaneously (instead of considering pair-wise variable separately). The main aims of present study are: (1) to develop trivariate models based on Archimedean and elliptical class of copulas for modeling flood characteristics, (2) to evaluate efficacy of copulas in modeling hydrological extremes and demonstrate potential of copula models for flood risk analysis through a case study, and (3) to evaluate the importance of multivariate return periods for estimation of flood risks through comparative analysis of the trivariate return periods with univariate and bivariate return periods.

In the following, first the definition of copula function and its properties are presented. Then, the details of copulas adopted in the study (i.e., three fully nested Archimedean families of copulas and one elliptical Student’s t copula) are presented. The subsequent section describes various performance measures including distance based statistics, graphical tests as well as tail dependence measures to select suitable copula model for analyzing multivariate dependence of flood properties. Then details of flood properties and procedures for estimation of joint return periods using copulas are explained. Later, application of the methodology to a case study is presented and the results are discussed. Finally, brief summary and conclusions of the study are presented.

2 Multivariate modeling using copulas

2.1 Copula function

In general, copula function is a multivariate distribution function, used for capturing the dependence between two or more random variables. Let X = (X 1,, X d ) be a random vector with continuous marginal cumulative distribution functions (CDF) F 1,,F d . By following Sklar theorem (Sklar 1959), the relation between the joint CDF H ( X ) and copula C can be represented as,

$$ \matrix{{*{20}{c}} {H( X) = C\left\{ {{F_1}\left( {{x_1}} \right), \ldots, {F_d}\left( {{x_d}} \right);\theta } \right\}}{X \in {R^d}} \\ } $$
(1)

where the function C: [0,1]d →[0,1] is called a d-dimensional copula, with association parameter θ. More details on theoretical background and properties of various copula families can be found in Nelsen (2006). In the following brief details of copulas used in the present study are presented.

2.2 Multivariate Archimedean copula

The Archimedean copulas are widely applied in hydrology, because they can be easily generated and are capable of capturing wide range of dependence structure with several desirable properties, such as, symmetry and associativity. The d-dimensional Archimedean copula (Nelsen 2006) is expressed as,

$$ C\left( {{u_1}, \ldots, {u_d}} \right) = {\phi^{{ - 1}}}\left( {\phi \left( {{u_1}} \right) + \phi \left( {{u_2}} \right) + \ldots + \phi \left( {{u_d}} \right)} \right) $$
(2)

where u i = F i (X i ) is marginal CDF of variable X i (i = 1,…,d);ϕ(•) is known as generator of the copula and ϕ [−1](•) is the pseudo inverse of ϕ(•). If ϕ(0) = ∞ the pseudo-inverse describes an ordinary inverse function (i.e., ϕ [−1] = ϕ −1) and in this case ϕ is known as strict generator. Equation 2 represents a d-dimensional Archimedean copula if and only if the generator ϕ(•) is a completely monotonic function. This symmetric form of copula is developed under the assumption of homogeneous dependence across the variables and has the limitation that in this case all dependencies are averaged to a same value. But for hydrologic variables such assumption is not feasible in practice. In order to increase flexibility and to allow for heterogeneous dependence—fully nested Archimedean (FNA) copulas were suggested in recent past (Whelan 2004; Savu and Trede 2010). The structure of FNA copula joins two or more ordinary bivariate or higher-dimensional Archimedean copulas by another Archimedean copula. The dependence structure of FNA in trivariate case (Savu and Trede 2010) is expressed as,

$$ C\left( {{u_1},{u_2},{u_3}} \right) = {\phi_2}\left( {\phi_2^{{ - 1}} \circ {\phi_1}\left[ {\phi_1^{{ - 1}}\left( {{u_1}} \right) + \phi_1^{{ - 1}}\left( {{u_2}} \right)} \right] + \phi_2^{{ - 1}}\left( {{u_3}} \right)} \right) = {C_2}\left[ {{C_1}\left( {{u_1},{u_2}} \right),{u_3}} \right] $$
(3)

with condition that the first derivative of \( \phi _{2}^{{ - 1}} \circ {{\phi }_{1}} \) is completely monotonic. In above expression, ϕ 2 and ϕ 1 are Laplace transforms. In Eq. 3, the pair (u 1, u 2) has bivariate margin of the form of Eq. 2 with Laplace transform ϕ 1, whereas (u 1, u 3) and (u 2, u 3) has the bivariate margins of the form Eq. 2 with laplace transform ϕ 2. Thus, the three variable asymmetric copula is composed of two bivariate copulas C 1 and C 2, in which C 1 is the copula describing the dependence between variables u 1 and u 2; and the outer copula C 2 is a function of the inner copula C 1 and u 3. It follows that this asymmetric scheme can be only applied in cases when the correlation values among two variables (the most nested) is stronger than the correlation between these variables and the third variable.

The properties of trivariate FNA structures for Clayton, Gumbel–Hougaard, and Frank copulas are presented in Table 1. To apply FNA three-dimensional copula, it has to satisfy the condition that the rank correlation coefficients (i.e., Kendall’s τ or Spearman’s ρ) between inner pair (u 1,u 2) is higher than that of between the other pairs (u 1,u 3) and (u 2,u 3).

Table 1 Mathematical expressions for trivariate Archimedean family of copulas and their associated properties

2.3 Student’s t copula

The Student’s t copula belongs to the class of elliptical copulas. The Student’s t copula is specified by multivariate Student’s t distribution. Let Σ∈R d × d for xR d denotes a symmetric shape parameter matrix (i.e., Σ is correlation matrix of multiple variables in d-dimension), then multivariate Student’s t copula for u = (u 1,,u d ) ϵ[0,1]d with ϑ degrees of freedom is defined as (Mashal and Zeevi 2002),

$$ C\left( {{u_1}, \ldots, {u_d};\vartheta, \Sigma} \right) = {t_{\vartheta }}_{{,\Sigma }}^d\left( {t_{\vartheta }^{{ - 1}}\left( {{u_1}} \right), \ldots, t_{\vartheta }^{{ - 1}}\left( {{u_d}} \right)} \right) $$
(4)

where,

$$ t^{d}_{{\vartheta , \Sigma }} {\left( x \right)} = {\int_{ - \infty }^x {\frac{{\Gamma {\left( {{\left( {\vartheta + d} \right)}/2} \right)}}}{{\Gamma {\left( {\vartheta /2} \right)}{\left( {\vartheta \pi } \right)}^{{d/2}} {\sqrt {{\left| \Sigma \right|}} }}}} }{\left( {1 + y^{T} \Sigma ^{{ - 1}} y/\vartheta } \right)}^{{ - {\left( {\vartheta + d} \right)}/2}} dy $$

Here y = {y 1,…,y d } and \( {y_i} = t_{\vartheta }^{{ - 1}}\left( {{u_i}} \right) \).

Equation 4 represents t copula with parameters (ϑ, Σ). For ϑ > 2, the shape parameter matrix Σ is nothing but as positive–definite correlation matrix. The density of multivariate t copula can be expressed as,

$$ c\left( {u;\vartheta, \Sigma } \right) = \frac{{\Gamma \left( {{{{\left( {\vartheta + d} \right)}} \left/ {2} \right.}} \right){{\left[ {\Gamma \left( {{{\vartheta } \left/ {2} \right.}} \right)} \right]}^{{d - 1}}}}}{{{{\left[ {\Gamma \left( {{{{\left( {\vartheta + 1} \right)}} \left/ {2} \right.}} \right)} \right]}^d}{{\left| \Sigma \right|}^{{{{1} \left/ {2} \right.}}}}}}\left[ {\prod\limits_{{i = 1}}^d {{{\left( {1 + {{{y_i^2}} \left/ {\vartheta } \right.}} \right)}^{{{{{\left( {\vartheta + 1} \right)}} \left/ {2} \right.}}}}} } \right]{\left( {1 + {y^T}{\Sigma^{{ - 1}}}{{y} \left/ {\vartheta } \right.}} \right)^{{ - {{{\left( {\vartheta + d} \right)}} \left/ {2} \right.}}}} $$
(5)

2.4 Copula parameter estimation

The parameter estimation is performed using maximum pseudo-likelihood (MPL) method. The multivariate copula model can have d − 1 parameters if the dependency is modeled by FNA copula, and d(d − 1)/2 parameters for elliptical t copula. The MPL estimation method does not require any prior assumptions regarding marginal distributions of the dependent variables. The procedure consists of transforming the marginal variables into uniformly distributed vectors using its empirical distribution function. Then, the copula parameters are estimated using maximization of pseudo log-likelihood function.

Let \( {X_1} = ({X_{{1,1}}},..,{X_{{1,d}}}), \ldots .,{X_n} = ({X_{{n,1}}},..,{X_{{n,d}}}) \)be n sample of observations in d-dimensional case. The empirical CDF of variable X k can be computed by,

$$ \matrix{{*{20}{c}} {{F_k}\left( {{X_{{i,k}}}} \right) = \frac{{{R_{{i,k}}}}}{{n + 1}}\,,\,i \in \left\{ {1,...,n} \right\},}{k \in \left\{ {1,...,d} \right\}} \\ } $$
(6)

where R i,k is rank, which is given by

$$ {R_{{i,k}}} = \sum\limits_{{j = 1}}^n {I\left( {{X_{{j,k}}} \leqslant {X_{{i,k}}}} \right)}; $$
(7)

where I(A) is a logical indicator function results in either 1 (if A is true) or 0 (if A is false). The rescaling (n + 1) at the denominator is used instead of n to avoid numerical problems at the boundaries of [0, 1]2 (a standard convention in the probabilistic modeling). The empirical distribution function is used as a surrogate for the unknown marginals. Substituting empirical CDFs into copula density and applying logarithm to the likelihood function of the copula yields the following form,

$$ \ell \left( \theta \right) = \sum\limits_{{i = 1}}^n {\log \left[ {{c_{\theta }}\left\{ {{F_1}\left( {{X_{{i,1}}}} \right), \ldots .,{F_d}\left( {{X_{{i,d}}}} \right)} \right\}} \right]} $$
(8)

Then, the copula parameter \( \hat{\theta } \) can be obtained by maximizing this pseudo log-likelihood function ℓ(θ).

For obtaining Σ in three-dimensional case, it involves computing σ ij for i,j∈{1, 2, 3} with the values of Kendall’s tau, τ i,j , for three random variables. For elliptical family of copulas, σ ij  = sin(πτ i,j /2) and the Σ is given as

$$ \Sigma = \left[ \begin{gathered} 1\,\,\,\,{\sigma_{{12}}}\,\,\,\,{\sigma_{{13}}} \hfill \\ {\sigma_{{21}}}\,\,\,1\,\,\,\,\,{\sigma_{{23}}} \hfill \\ {\sigma_{{31}}}\,\,{\sigma_{{32}}}\,\,\,\,1 \hfill \\ \end{gathered} \right] $$
(9)

However, sometimes it may be possible that the estimate of Σ is not positive definite. In that case, a procedure based on eigenvalue decomposition is used to transform correlation matrix into positive definite (McNeil et al. 2005). Using the relationship in Eq. 9, the correlation matrix of Student’s t copula is estimated and then a numerical search technique is employed for estimating \( \hat{\vartheta } \) (Mashal and Zeevi 2002). In this study, a real-coded genetic algorithm (GA) is applied to find the optimal parameters of the FNA and Student’s t copulas.

3 Selection of suitable copula model

3.1 Performance measures

The appropriate dependence structure is selected by minimizing the distance between fitted parametric copula and the empirical copula. Anderson–Darling (AD) and Integrated Anderson–Darling (IAD) statistics are used as a distance measures between empirical copulas and fitted parametric copulas.

The empirical copula C n from the pseudo-observations(U 1,n , V 1,n, W 1,n ),,(U n,n , V n,n, W n,n ), is given by (Genest et al. 2009):

$$ \matrix{{*{20}{c}} {{C_n}\left( {u,v,w} \right) = \frac{1}{n}\sum\limits_{{i = 1}}^n {I\left( {{U_{{i,n}}} \leqslant u,{V_{{i,n}}} \leqslant v,{W_{{i,n}}} \leqslant w,} \right)}, }{\left( {u,v,w} \right) \in \left[ {0,1} \right]} \\ } $$
(10)

where (U i,n , V i,n , W i,n ) are pseudo-observations computed from the collected observational data (X 1,Y 1,W 1),,(X n ,Y n ,W n ),

$$ \matrix{{*{20}{c}} {{U_{{i,n}}} = \frac{1}{{n + 1}}\sum\limits_{{j = 1,}}^n {1\left( {{X_j} \leqslant {X_i}} \right)\,}, \,\,\,\,{V_{{i,n}}} = \frac{1}{{n + 1}}\sum\limits_{{j = 1,}}^n {1\left( {{Y_j} \leqslant {Y_i}} \right),\,\,\,\,{W_{{i,n}}} = \frac{1}{{n + 1}}\sum\limits_{{j = 1,}}^n {1\left( {{W_j} \leqslant {W_i}} \right),} } }{\,i\, \in \left\{ {1, \ldots, n} \right\}} \\ } $$
(11)

The expressions for AD and IAD distance measures are given as (Ané and Kharoubi 2003):

$$ {\text{AD}} = \mathop{{\max }}\limits_{{1 \leqslant i \leqslant n,1 \leqslant j \leqslant n,1 \leqslant k \leqslant n}} \frac{{\left| {{{\hat{C}}_n}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right) - {C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)} \right|}}{{\sqrt {{{C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)\left[ {1 - {C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)} \right]}} }} $$
(12)
$$ IAD = \sum\limits_{{i = 1}}^n {\sum\limits_{{j = 1}}^n {\sum\limits_{{k = 1}}^n {\frac{{{{\left[ {{{\hat{C}}_n}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right) - {C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)} \right]}^2}}}{{{C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)\left[ {1 - {C_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)} \right]}}} } } $$
(13)

where i, j, and k represents order statistics of the random variable u 1, u 2, and u 3. Since empirical copula is defined on a lattice ℓ, the distance statistics are defined with discrete norms. The copula family with minimum AD and IAD distance is chosen as the best fitted copula.Apart from these distance-based measures, the relevance of each copula model is measured using the concept of entropy, which measures uncertainty of the distribution \( {f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right) \):

$$ E\left[ {{f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)} \right] = - \int {\int {\int {{f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)\ln \left[ {{f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)} \right]d{x_1}d{x_2}d{x_3}} } } $$
(14)

The entropy offers a distance measure based on copula density, whereas the AD and IAD statistics are computed using CDFs. For copula representation of trivariate density, the entropy of \( {f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right) \) is equal to the sum of the entropies of individual marginal distribution plus the entropy of the copula distribution function. The entropy of the univariate distribution f X (x) is defined as,

$$ E\left[ {{f_X}(x)} \right] = - \int {{f_X}(x)\ln \left[ {{f_X}(x)} \right]dx} $$
(15)

Therefore, the discrete entropy of the copula model can be expressed as,

$$ \begin{gathered} \matrix{{*{20}{c}} {E\left[ {{f_{{{X_1},{X_2},{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)} \right] = E\left[ {{f_{{{X_1}}}}\left( {{x_1}} \right)} \right] + E\left[ {{f_{{{X_2}}}}\left( {{x_2}} \right)} \right] + E\left[ {{f_{{{X_3}}}}\left( {{x_3}} \right)} \right] + }{} \\ } \hfill \\ E\left\{ {c\left[ {{F_{{{X_1}}}}\left( {{x_1}} \right),{F_{{{X_2}}}}\left( {{x_2}} \right),{F_{{{X_3}}}}\left( {{x_3}} \right)} \right]} \right\} \hfill \\ \end{gathered} $$
(16)

where,

$$ E\left\{ {c\left[ {{F_{{{X_1}}}}\left( {{x_1}} \right),{F_{{{X_2}}}}\left( {{x_2}} \right),{F_{{{X_3}}}}\left( {{x_3}} \right)} \right]} \right\} = - \sum\limits_{{i = 1}}^n {\sum\limits_{{j = 1}}^n {\sum\limits_{{k = 1}}^n {{c_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)\ln \left[ {{c_{{p\theta }}}\left( {\frac{i}{n},\frac{j}{n},\frac{k}{n}} \right)} \right]} } } $$
(17)

For visual inspection, a graphical comparison between observed data and simulated samples from each copula family is performed. This also offers qualitative assessment in finding a suitable copula model.

3.2 Tail dependence of flood characteristics

The tail dependence coefficient (TDC) captures the concordance between extreme values in the lower left quadrant tail and upper right quadrant tails of the variables. In order to study occurrence of extreme events, the pair-wise analysis of upper tail dependence of flood variables is performed for the fitted copula models. If u be a threshold value then upper tail dependence between two variables X and Y, denoted as λ U is given as

$$ {\lambda_U} = \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \left\{ {{F_X}(x) > u|{F_Y}(y) > u} \right\} $$
(18)

in terms of copula the above equations can also be expressed as (Nelsen et al. 2008),

$$ {\lambda_U} = \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \frac{{1 - 2u + C\left( {u,u} \right)}}{{1 - u}} = 2 - \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \frac{{1 - C\left( {u,u} \right)}}{{1 - u}} = 2 - {\delta '_C}\left( {{1^{ - }}} \right) $$
(19)

where, the function δ C (•) is the diagonal section of copula C and given by δ C (u) = C(u,u) for every uϵ[0,1]. The estimate λ U measures the concordance between extremely high values of random variables. If λ U ϵ(0,1], then F X (x) and F y (y) are said to show upper tail dependence or extremal dependence in upper tail.

To study nonparametric TDC, several methods have been suggested in literature such as \( \lambda_U^{\text{LOG}} \)estimator (Coles et al. 1999), \( \lambda_U^{\text{SEC}} \)estimator (Joe et al. 1992), Capéraá-Frahm-Genest estimator \( \lambda_U^{\text{CFG}} \) (Capéraá et al. 1997; Frahm et al. 2005), and Schmidt–Stadtmüller estimator \( \lambda_U^{\text{SS}} \) (Schmidt and Stadtmüller 2006). The estimator \( \lambda_U^{\text{LOG}} \) can be interpreted as logarithm of the copula diagonal (Coles et al. 1999). The estimate \( \lambda_U^{\text{SEC}} \) can be interpreted as the slope of the secant along the copula diagonal (i.e., close to the 45° line) and hence can specify wrongly the value of TDC when the data are not accumulated along the diagonal. Except for \( \lambda_U^{\text{CFG}} \), all other estimators require specification of a threshold. This study adopts \( \lambda_U^{\text{CFG}} \)for TDC. If {(u 1, v 1),,(u n , v n )} be random sample obtained from Copula C(•), the bivariate upper TDC using \( \lambda_U^{\text{CFG}} \) is given by (Capéraá et al. 1997),

$$ \hat{\lambda }_U^{{CFG}} = 2 - 2\exp \left[ {\frac{1}{n}\sum\limits_{{i = 1}}^n {\log \left\{ {{{{\sqrt {{\log \left( {\frac{1}{{{u_i}}}} \right)\log \left( {\frac{1}{{{v_i}}}} \right)}} }} \left/ {{\log \left( {\frac{1}{{\max {{\left( {{u_i},{v_i}} \right)}^2}}}} \right)}} \right.}} \right\}} } \right] $$
(20)

Though \( \lambda_U^{\text{CFG}} \) estimator assumes that the underlying copula can be approximated by an extreme-value copula, the estimator performs well even if the copula does not belong to extreme value classes as discussed by Frahm et al. (2005). For a comparison purpose, the estimates of \( \lambda_U^{\text{LOG}} \) and \( \lambda_U^{\text{SS}} \) are also presented, although these estimators can show higher variance as compared to \( \hat{\lambda }_U^{\text{CFG}} \). If the diagonal section of copula C(u,u) is differentiable for uϵ(1−ε,1) for any ε > 0, then,

$$ {\lambda_U} = 2 - \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \frac{{1 - C\left( {u,u} \right)}}{{1 - u}} = 2 - \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \frac{{dC\left( {u,u} \right)}}{{du}} = 2 - \mathop{{\lim }}\limits_{{u \to {1^{ - }}}} \frac{{\log C\left( {u,u} \right)}}{{\log (u)}} $$
(21)

The log estimator is based on Eq. 21 and can be expressed as

$$ \hat{\lambda }_U^{\text{LOG}} = 2 - \frac{{\log \,{{\hat{C}}_n}\left( {\left( {1 - \frac{k}{n}} \right),\left( {1 - \frac{k}{n}} \right)} \right)}}{{\log \left( {1 - \frac{k}{n}} \right)}} $$
(22)

where kϵ{1,,n − 1} represents the threshold to be selected. An optimal threshold is selected by applying a plateau-finding algorithm as described in Frahm et al. (2005). In first step, the curve of \( {\hat{\lambda }_k} \) is smoothed by nonparametric kernel function. A kernel smoother defines a set of weights {W i (x), i = 1, , n} for each x and can be expressed as,

$$ \hat{f}(x) = \sum\limits_{{i = 1}}^n {{W_i}(x){y_i}} $$
(23)

where y i is the observations to be smoothed. For a given kernel band width b, the weight sequence is defined as,

$$ \matrix{{*{20}{c}} {{W_i}(x) = \frac{{K\left( {\frac{{x - x{}_i}}{b}} \right)}}{{\sum\limits_{{i = 1}}^n {K\left( {\frac{{x - x{}_i}}{b}} \right)} }},}{\sum\limits_{{i = 1}}^n {{W_i}\left( {{x_i}} \right) = 1} \,and\,\int {K(u){\text{du}} = 1} } \\ } $$
(24)

In this study, the box kernel with specified bandwidth (say, \( b = \left\lfloor {0.005n} \right\rfloor \), bN) is chosen as suggested by Frahm et al. (2005). The analytical expression of K(•) for box-kernel estimator can be given using following window function

$$ K(x) = \left\{ \begin{gathered} 1,\,\left| x \right| \leqslant {{1} \left/ {2} \right.}, \hfill \\ 0,\,\,{\text{otherwise}} \hfill \\ \end{gathered} \right. $$
(25)

where K(x) defines a unit interval centered at the origin.

Thus the kernel smoothed map of \( k \mapsto {\hat{\lambda }_k} \) leads to the means of 2b + 1 successive points of \( {\hat{\lambda }_1}, \ldots, {\hat{\lambda }_n} \) to a new smoothed map of \( {\tilde{\lambda }_1}, \ldots, {\tilde{\lambda }_{{n - 2b}}} \) In next step, a vector \( {p_k} = \left( {{{\tilde{\lambda }}_k}, \ldots, {{\tilde{\lambda }}_{{k + m - 1}}}} \right),\,\,k = 1, \ldots, n - 2b - m + 1 \) is defined with a plateau length \( m = \sqrt {{n - 2b}} \). This procedure continues and the algorithm stops at the first plateau p k , whose elements fulfill the following condition,

$$ \sum\limits_{{i = k + 1}}^{{k + m - 1}} {\left| {{{\tilde{\lambda }}_i} - {{\tilde{\lambda }}_k}} \right| \leqslant 2\sigma } $$
(26)

where σ represents the standard deviation of \( {\tilde{\lambda }_1}, \ldots, {\tilde{\lambda }_{{n - 2b}}} \). Then the TDC \( \lambda_U^{\text{LOG}} \) is estimated as arithmetic mean of the vector corresponding to the plateau,

$$ \hat{\lambda }_U^{\text{LOG}}(k) = \frac{1}{m}\sum\limits_{{i = 1}}^m {{{\tilde{\lambda }}_{{k + i - 1}}}} $$
(27)

If no plateau fulfills the stopping condition, the TDC is estimated as zero and the procedure is repeated with a different set of parameters.

The expression for \( \lambda_U^{\text{SS}} \) is given as (Schmidt and Stadtmüller 2006),

$$ \widehat{\lambda }_{U}^{{{\text{SS}}}}(k) = \frac{n}{k}{{\overline C }_{n}}\left( {\frac{k}{n},\frac{k}{n}} \right) \approx \frac{1}{k}\mathop{\sum }\limits_{{i = 1}}^{n} I\left\{ {\left( {{{R}_{i}} > n - k} \right) \wedge \left( {{{S}_{i}} > n - k} \right)} \right\} $$
(28)

where \( {\overline{C}_n}\left( {u,v} \right) = P\left( {U > 1 - u,V > 1 - v} \right) = u + v - 1 + C\left( {1 - u,1 - v} \right) \) denotes empirical survival copula (Nelsen 2006).

4 Trivariate flood characteristics

The annual maximum peak discharge, volume, and duration values of flood events are obtained from daily stream flow data. Annual peak flow magnitude is selected from the portion of daily stream flow hydrographs having the highest peak flow from each year’s stream flow record. The single-peaked flood hydrograph is shown in Fig. 1. Flood duration (D) can be determined by identifying the time of rise (point “s” in Fig. 1) and fall (point “e” in Fig. 1) of the flood hydrograph. The start of the surface runoff is marked by the sharp rise of the hydrograph and end of the flood runoff is identified by the inflection point on the receding limb of the hydrograph. Between these two points, the total flood volume is estimated. If time of rise of the flood hydrograph is denoted by SD (day) and fall by ED (day), the flood volume (V) of each flood event is determined using following expression (Yue 2001a, b)

$$ {V_i} = \left( {V_i^{\text{total}} - V_i^{\text{baseflow}}} \right) = \sum\limits_{{j = {\text{SD}}}}^{\text{SD}} {{q_{{ij}}} - \frac{{{D_i}}}{2}\left( {{q_{{is}}} + {q_{{ie}}}} \right)}, \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\forall i = 1,2, \ldots, n $$
(29)

where, q ij is the jth day observed stream flow value for ith year, q is and q ie are the observed daily stream flow values on start and end date of the flood hydrograph for the ith year. The annual flood peak series is constructed by,

$$ \matrix{{*{20}{c}} {{Q_i} = \max \left\{ {{q_{{ij}}},\,j = {{S}}{{{D}}_i},\,{{S}}{{{D}}_i} + 1, \ldots, \,{{E}}{{{D}}_i}} \right\}},{\forall i = 1,2, \ldots, n} \\ } $$
(30)

The duration series is given by the difference between starting and ending day of the flood event, and can be expressed as,

$$ \matrix{{*{20}{c}} {{D_i} = {\text{E}}{{\text{D}}_i} - {\text{S}}{{\text{D}}_i},} \,\,\,\,\,\,\,{\forall i = 1,2, \ldots, n} \\ } $$
(31)
Fig. 1
figure 1

Typical flood hydrograph showing flood flow characteristics

5 Estimation of flood risks

The copula models can form the basis for the estimation of various quantities, which can be very useful for risk analysis of floods, such as estimation of conditional probability distributions as well as conditional and joint return periods. The return period of a prescribed event is generally adopted as a criterion for design purposes in hydrologic projects, which provides a simple means for risk analysis. Usually, the return period is defined as “the average time elapsing between two successive realizations of the given event” (Salvadori 2004). The basic concepts of return periods are thoroughly discussed in Yue and Rasmussen (2002) and Salvadori and De Michele (2004).

5.1 Primary return period

The objective of frequency analysis of hydrologic data is to relate the magnitude of extreme events to their frequency of occurrence through the use of probability distributions (Chow et al. 1988). For multivariate case, in which X 1, X 2…, X d exceeds their respective thresholds (X 1 > x 1, X d  > x d ), the joint return period is computed using inclusive probability (“OR” and “AND” cases) of all three events, known as primary return periods (Salvadori 2004). For trivariate case, the joint primary return period in “OR” case \( T_{{{X_1}{X_2}{X_3}}}^{\text{OR}} \) (for annual flood analysis) is given by,

$$ \matrix{{*{20}{c}} {T_{{{X_1}{X_2}{X_3}}}^{\text{OR}}\left( {{x_1},{x_2},{x_3}} \right) = \frac{1}{{P\left( {{X_1} \geqslant {x_1}\,,{X_2} \geqslant {x_2},{X_3} \geqslant {x_3}} \right)}} = \frac{1}{{1 - P\left( {{X_1} \leqslant {x_1}\,,{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right)}}} \\ { = \frac{1}{{1 - {F_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)}} = \frac{1}{{1 - C\left( {{u_1},{u_2},{u_3}} \right)}}} \\ } $$
(32)

The joint primary return period in “AND” case \( T_{{{X_1}{X_2}{X_3}}}^{\text{AND}} \) (for annual flood analysis) can be expressed as,

$$ \matrix{{*{20}{c}} {T_{{{X_1}{X_2}{X_3}}}^{\text{AND}}\left( {{x_1},{x_2},{x_3}} \right) = \frac{1}{{P\left( {{X_1} \geqslant {x_1}\, \wedge {X_2} \geqslant {x_2} \wedge {X_3} \geqslant {x_3}} \right)}}} \\ \begin{gathered} = \frac{1}{{1 - {F_{{{X_1}}}}\left( {{x_1}} \right) - {F_{{{X_2}}}}\left( {{x_2}} \right) - {F_{{{X_3}}}}\left( {{x_3}} \right) + {F_{{{X_1}{X_2}}}}\left( {{x_1},{x_2}} \right) + {F_{{{X_2}{X_3}}}}\left( {{x_2},{x_3}} \right) + {F_{{{X_1}{X_3}}}}\left( {{x_1},{x_3}} \right) - {F_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)}} \hfill \\ = \frac{1}{{1 - {F_{{{X_1}}}}\left( {{x_1}} \right) - {F_{{{X_2}}}}\left( {{x_2}} \right) - {F_{{{X_3}}}}\left( {{x_3}} \right) + C\left( {{u_1},{u_2}} \right) + C\left( {{u_2},{u_3}} \right) + C\left( {{u_1},{u_3}} \right) - C\left( {{u_1},{u_2},{u_3}} \right)}} \hfill \\ \end{gathered} \\ } $$
(33)

where C(u 1,u 2), C(u 2,u 3), and C(u 1,u 3) are bivariate copula CDFs for flood characteristics.

The conditional distribution function of X 1, X 2 given (X 3 ≤ x 3) in “OR” case is given by,

$$ \matrix{{*{20}{c}} {{F_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2}|{X_3} \leqslant {x_3}} \right) = P\left[ {{X_1} \leqslant {x_1},{X_2} \leqslant {x_2}|{X_3} \leqslant {x_3}} \right]} \\ { = \frac{{{F_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)}}{{{F_{{{X_3}}}}\left( {{x_3}} \right)}} = \frac{{C\left( {{u_1},{u_2},{u_3}} \right)}}{{{u_3}}}} \\ } $$
(34)

where \( F_{{X_{3} }} {\left( {x_{3} } \right)} \) is the marginal CDF of random variable X 3. The corresponding conditional return period under this condition can be expressed as,

$$ {T_{{{X_1}{X_2}|{X_3}}}}\left( {{x_1},{x_2}|{X_3} \leqslant {x_3}} \right) = \frac{1}{{1 - {F_{{{X_1}{X_2}|{X_3}}}}\left( {{x_1},{x_2}|{X_3} \leqslant {x_3}} \right)}} $$
(35)

Similarly, the conditional distribution of \( {X_1} \) given \( \left( {{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right) \) is given by,

$$ \matrix{{*{20}{c}} {{F_{{{X_1}|{X_2}{X_3}}}}\left( {{x_1}|{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right) = P\left[ {{X_1} \leqslant {x_1}|{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right]} \\ { = \frac{{{F_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right)}}{{{F_{{{X_2}{X_3}}}}\left( {{x_2},{x_3}} \right)}} = \frac{{C\left( {{u_1},{u_2},{u_3}} \right)}}{{C\left( {{u_2},{u_3}} \right)}}} \\ } $$
(36)

where C(u 2, u 3) is a bivariate copula CDF. The corresponding conditional return period can be expressed as,

$$ {T_{{{X_1}|{X_2}{X_3}}}}\left( {{x_1}|{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right) = \frac{1}{{1 - {F_{{{X_1}|{X_2}{X_3}}}}\left( {{x_1}|{X_2} \leqslant {x_2},{X_3} \leqslant {x_3}} \right)}} $$
(37)

5.2 Secondary return period

Flood events can be subcritical, critical, and supercritical. In this context, the concept of secondary return period is practically useful for design of hydraulic structures (Salvadori 2004; Vandenberghe et al. 2011). Salvadori and De Michele (2004) used Kendall distribution function to define the “secondary return period”. The primary return period predicts that a critical event is expected to appear once in a given time interval (i.e., it gives an average forecast), where as the secondary return period provides the average time between the occurrence of two supercritical events. The probability of supercritical event for any realization can be computed using Kendall distribution function in place of C(u) in computation of joint primary return period in OR case.

For a d-dimensional distribution \( F = {C_{{\hat{\theta }}}}(U) \) and t∈(0,1] the critical probability level \( \overline{p} \) is defined as (Salvadori et al. 2011)

$$ \overline{p} = \left\{ {F(X) = t,\;X \in {R^d}} \right\} $$
(38)

Thus the isosurface \( \overline{p} \) partitions R d into three non-overlapping regions, viz., the subcritical region consisting the points which are less than \( \overline{p} \); the critical layer \( \overline{p} \), where F(X) = t; the supercritical region consisting the points which are greater than \( \overline{p} \). The return period of a supercritical or potentially dangerous event associated with critical probability level \( \overline{p} \) can be computed as

$$ {\overline{T}_{{{X_1}{X_2}{X_3}}}}\left( {{x_1},{x_2},{x_3}} \right) = \frac{1}{{{{\overline{K}}_{{{C_{{\hat{\theta }}}}}}}\left( {\overline{p}} \right)}} = \frac{1}{{1 - {K_{{{C_{{\hat{\theta }}}}}}}\left( {\overline{p}} \right)}} $$
(39)

where \( {K_{{{C_{{\hat{\theta }}}}}}} \)(.) is Kendall’s distribution function associated with trivariate copula \( {C_{{\hat{\theta }}}}(U) \) at critical probability level \( \overline{p} \). \( {\overline{K}_{{{C_{{\hat{\theta }}}}}}} \) denotes survival probability and Kendall’s distribution of the region of the supercritical events; a multivariate form of univariate event {X > x}. Thus, any point in the subcritical region will have a smaller joint CDF function value than any point in the supercritical region. The function \( {K_{{{C_{{\hat{\theta }}}}}}} \)(.) helps to project multivariate information into a single dimension. For bivariate Archimedean copulas, an explicit expression for \( {K_{{{C_{{\hat{\theta }}}}}}} \) is available; though for FNA structure, the form of Kendall distribution function is complex as shown by Okhrin et al. (2009). For elliptical class of copulas, the function \( {K_{{{C_{{\hat{\theta }}}}}}} \) can be constructed numerically using Monte Carlo simulation.

6 Application

6.1 Study area

The case study of Delaware River basin is chosen for illustration of the methodology, the basin drains an area of 34,447 km2 in the states of New York, Pennsylvania, New Jersey, and Delaware. Major flood events in the basin took place in the years 1955, 1999, 2001, 2003, 2004, 2005, 2006, and 2010 (DRBC 2011). US Geological Survey stream gauge 01438500 (41° 18′ 33″ latitude and 74° 47′ 43″ longitude) at Port Jervis, New York is used in the study, as part of Delaware River flows. The location has a drainage area of 9,013 km2 and datum gauge is 112.7 m above sea level. About 61 years daily stream flow data from the years 1949 to 2009 are analyzed. Table 2 presents statistical properties of flood variables during the study period. The high kurtosis and positively skewed in nature of flood variables suggests that they can be best modeled by heavy tailed distributions.

Table 2 Statistic properties of flood varaibles

6.2 Dependence of flood variables

The pair-wise association among flood variables as well as the strength of dependency is studied by employing graphical tools such as ranked scatter plots, chi-plots, and Kendall plots, since it is often difficult to judge random nature and nonlinear behavior of data from a simple scatter plot. In Fig. 2, ranked scatter plots, chi plots, and Kendall plots of pair-wise flood variables are presented. For peak flow–volume pair, the increased density of points close to 45° line in the ranked scatterplot indicates that dependence between this variable pair is strongest whereas a weak dependence is observed between peak flow and duration pair. The shape of the clusters of ranked observations for peak flow–volume and volume–duration pairs in regions around upper right corner of the unit square suggests presence of stronger upper tail dependence between these flood variables. The control limits for chi plot are set to enclose a p value of 0.95. A strong deviation from the control limit is observed for flood variable peak flow–volume and volume–duration pair. For peak flow–duration pair, deviation is observed from the center of the main diagonal in Kendall plot, indicating presence of positive association between the flood variables. For quantitative assessment, the sample estimates of Pearson’s linear correlation r and two nonparametric dependence measures viz., Spearman’s ρ, Kendall’s τ with associated p values of the estimate are listed in Table 3. The dependencies are found to be significant as tested by standard two-tailed t test.

Fig. 2
figure 2

Graphical representation of strength of dependence of flood variables using ranked scatter plot, chi-plot, and Kendall plot. First row (a) peak flow–volume combination; second row (b) peak flow–duration combination; and third row (c) volume–duration combination of flood variable. U Q , U V , and U D denotes pseudo-samples of peak flow, volume and duration respectively

Table 3 Correlations to measure dependencies among flood variables

6.3 Modeling marginal distributions

For fitting marginal distribution two-parameter log normal, two-parameter gamma, extreme value type I (Gumbel distribution) and extreme value type II (Fréchet distribution) have been evaluated. The Gumbel and Fréchet distribution are special case of generalized extreme value (GEV) distribution. For Gumbel distribution, the shape parameter (α) of GEV takes the form of α → 0 and corresponds to an unbounded and thin upper tails. If α > 0, then the GEV distribution is termed as Fréchet; the distribution is unbounded above and has polynomially decreasing tail function, which corresponds to a long- or heavy-tailed distribution. The density form and corresponding parameters of the distributions are presented in Table 4. The parameters of the distribution functions are estimated using method of maximum likelihood. Table 5 lists various performance measures such as Akaike information criteria (AIC), Anderson–Darling (AD n ), and Kolmogorov–Smirnov (KS n ) statistics for the flood data fitted with parametric distributions. The table shows heavy tailed Fréchet distribution fits well for peak flow and volume data and log-normal distribution performed better for modeling flood duration data. The probability density functions (PDFs), cumulative distribution functions CDFs and corresponding probability–probability (P-P) plot for each marginal flood variables fitted with Fréchet distributions are shown in Fig. 3, which shows good correspondence between theoretical distributions with the observed data.

Table 4 The mathematical expressions for probability density functions and parameters of different probability distributions
Table 5 Performance of various probability models for fitting marginal distributions for flood variables
Fig. 3
figure 3

Fitted marginal distributions for flood variables a peak flow, b volume, and c duration. The first, second, and third column shows probability density functions, CDFs, and P–P plots, respectively, for three flood characteristics

6.4 Modeling joint dependence structure using copulas

6.4.1 Bivariate copula models

First, bivariate copula models are developed for representing the joint dependence of flood variable pairs such as flood peak–volume, volume–duration, and flood peak–duration. Four copula models namely Clayton, Gumbel–Hougaard and Frank copulas, and Student’s t copula are fitted and their corresponding results are presented in Table 6. The parameters of copula functions are estimated using GA-based maximum pseudo-likelihood approach. The following GA parameters adopted: population size of 20, generations of 200, single point cross-over with cross-over rate of 0.8, Gaussian mutation function with mutation rate of 0.01, and selection strategy as stochastic uniform. From Table 6, it can be seen that the Student’s t copula resulted in higher log-likelihood and minimum AIC values for all three combinations of flood–variable pairs. These bivariate models are used for computing the return periods of bivariate and trivariate flood characteristics.

Table 6 Performance of various bivariate copula models for representing dependence structure (joint distributions) of flood variables

6.4.2 Trivariate copula models

The fully nested form of Clayton, Gumbel–Hougaard and Frank copulas, and one elliptical Student’s t copula are chosen to model trivariate flood characteristics. The copula functions are fitted with maximum pseudo-likelihood estimator, where parameters are estimated using genetic algorithm. The GA parameters used involves, for nested Archimedean class of copula: population size of 20, generations of 500, single point cross-over with cross-over rate of 0.8, Gaussian mutation function with mutation rate of 0.01, and selection strategy as stochastic uniform; whereas for Student’s t copula: a population size of 50, generations of 200, and the remaining parameters are the same as above. The estimated copula parameters and corresponding log-likelihood function values are presented in Table 7, which shows highest log-likelihood function value for Student’s t copula followed by Gumbel–Hougaard copula. The performance of copula families are compared using distance-based statistics AD, IAD and entropy tests, which are presented in Table 8. From this table, it can be seen that overall Student’s t copula performed best for trivariate flood data as compared to other copula families. The low performance of Frank copula can be attributed to its radially symmetric structure whereas flood events are often upper tail dependent.

Table 7 Estimated parameters of trivariate models for various copula families
Table 8 Comparison of copula models performance in representing the trivariate flood properties

For further visual illustration, a set of random sample of size = 500 are generated from both Gumbel–Hougaard and Student’s t copulas and transformed back into their original units using corresponding marginal distribution functions and compared with observed data as shown in Fig. 4 (a–c for Student’s t and d–f for Gumbel–Hougaard copula). From these plots, it can be observed that the Student’s t copula is performing satisfactorily, as the random pairs generated from this copula (gray dots) are adequately overlapped with the dependence pattern of sample data (black dots). Corresponding Kendall’s τ value from the simulated data is also presented in Fig. 4. There is restriction in the application of fully nested Gumbel–Hougaard copula as the correlation between two pairs should be identical and lower than the third pair (such as rank-based correlation of \( {\tau_{{P,V}}} \geqslant {\tau_{{V,D}}} = {\tau_{{P,D}}}\, \)) as observed from the pair-wise simulated data.

Fig. 4
figure 4

Scatter plots of observed versus 500 simulated samples of flood variable pairs from Student’s t copula (ac) and fully nested Gumbel–Hougaard copula model (df). Solid black dots observed samples, light gray dots simulated samples

The maximum distance, d max computed between non-exceedance probability and Student’s t copula-based joint distribution is found to be 0.062. The critical values of the Kolmogorov–Smirnov test at 5 % significance level is \( D_{{n = 61}}^{{\alpha = 0.05}} = 0.17 \). Hence, Student’s t copula is verified as an acceptable model at 5 % significance level. A scatter plot between the empirical joint CDF obtained using Grigorton plotting position formula and theoretical joint CDF obtained using Student’s t copula is presented in Fig. 5. A good agreement between empirical and theoretical probabilities at the tails of the distribution is observed. Hence, Student’s t copula is suitable for representing trivariate joint distribution of flood properties.

Fig. 5
figure 5

Comparison of empirical joint probability distribution (obtained using plotting position formula) and theoretical joint probability distribution obtained from Student’s t copula

6.5 Analyzing tail dependence

In order to test the tail properties of simulated sample from the copula families, a pair-wise tail-dependence test is performed. The upper tail dependence coefficient for Clayton and Frank copulas are equal to zero. Hence, these two copulas are excluded from tail-dependence analysis. The TDC of empirically transformed (i.e., from ECDF) flood variable pairs is computed using \( \lambda_U^{\text{cfg}} \), \( \lambda_U^{\text{LOG}} \) and \( \lambda_U^{\text{SS}} \)estimators. For \( \lambda_U^{\text{SS}} \) and \( \lambda_U^{\text{LOG}} \) estimate of observed data (of sample length = 61), the optimal plateau is selected using box-kernel of bandwidth \( \left( {b = \left\lfloor {0.05n} \right\rfloor = 3} \right) \). The empirical TDC of observed flood variable pairs is listed in Table 9.

Table 9 Empirical tail dependence coefficient for three flood variable pairs

For comparing TDC estimate of Gumbel–Hougaard and Student’s t copula, bivariate random numbers are generated for sample size 500. The optimal plateau for \( \lambda_U^{\text{SS}} \) and \( \lambda_U^{\text{LOG}} \)is selected using box-kernel of bandwidth \( \left( {b = \left\lfloor {0.005n} \right\rfloor \approx 3} \right) \). The computation is repeated for ten different runs (i.e., \( {\hat{\lambda }_{{n,i}}},i = 1, \ldots, 10 \)). Then each TDC estimate is compared using sample mean \( \hat{\mu }\left( {{{\hat{\lambda }}_U}} \right) \) and standard deviation \( \hat{\sigma }\left( {{{\hat{\lambda }}_U}} \right) \) of ten different runs. The results are summarized in Table 10. The resulting box-plot for simulated tail-dependence coefficient of three different estimators is presented in Fig. 6. The asymmetrical nature of box-plots show that sample variance of \( \hat{\lambda }_U^{\text{LOG}} \) and \( \hat{\lambda }_U^{\text{SS}} \) is higher as compared to \( \hat{\lambda }_U^{\text{CFG}} \). From Table 10 and Fig. 6, it can be observed that Student’s t copula is able to simulate observed upper TDC estimate well as compared to Gumbel–Hougaard copula, especially for \( \hat{\lambda }_U^{\text{CFG}} \) estimator. The sample variance of \( \hat{\lambda }_U^{\text{CFG}} \) estimator for Student’s t copula is less than that of Gumbel–Hougaard copula. The parametric upper TDC of Student’s t copula involving dependence parameter ϑ = 5.08 and σ ij is (Frahm et al. 2005) \( \lambda_{{{U_{{ij}}}}}^{\text{param}} = 2 - 2{t_{{\vartheta + 1}}}\left( {\sqrt {{\left( {\vartheta + 1} \right)}} \sqrt {{\frac{{\left( {1 - {\sigma_{{ij}}}} \right)}}{{\left( {1 + {\sigma_{{ij}}}} \right)}}}} } \right),i,j \in \{ 1,2, \ldots, n\} \) where t ϑ + 1 is a CDF of Student’s t random variable with ϑ + 1 degrees of freedom. The pair-wise parametric TDC estimate using Student’s t copula are: peak flow and volume \( \lambda_{{{U_{\text{PV}}}}}^{\text{param}} = 0.{43} \), volume and duration \( \lambda_{{{U_{\text{VD}}}}}^{\text{param}} = 0.{35} \) and peak flow and duration \( \lambda_{{{U_{\text{PD}}}}}^{\text{param}} = 0.{13} \) respectively.

Table 10 Bivariate upper TDC estimate for Student’s t and Gumbel–Hougaard copulas
Fig. 6
figure 6

Box–whisker plots of upper TDC for simulated sample from copula. First row (ac) shows Student’s t copula, second row (df) depicts Gumbel–Hogaard copula family. \( {\lambda_{{{U_{\text{PQ}}}}}} \), \( {\lambda_{{{U_{\text{PD}}}}}} \) and \( {\lambda_{{{U_{\text{VD}}}}}} \) denotes upper TDC (λ U ) estimate of peak flow–volume, peak flow–duration, and volume–duration combination

7 Probabilistic analysis of flood variables

7.1 Primary return periods of flood peak flow conditional to volume and duration

The frequency analysis of multivariate extreme events are helpful in understanding critical hydrologic behavior of flood in a River basin scale through use of various combinations of flood characteristics. It can also be helpful in taking nonstructural safety measures, and to delineate flood plains and developing flood mitigation strategies. From hydrological perspective, these scenarios are important, as extreme flood events with high peak flow and long duration and hence high volume may be devastating at watershed scale; on the other hand, short-duration events with high peak discharge and moderate volume may cause flash floods.

Table 11 presents return period obtained using univariate marginal distributions of peak flow, volume, and duration; and joint return periods for “AND” and “OR” cases for bivariate as well as trivariate distributions. The joint return period in “AND” case is longer than the joint return period in “OR” case when same univariate return period is assumed. The effect of including third variable (duration in this case) in computing joint return period can also be observed from Table 11. For all the cases considering trivariate flood properties, the joint return period in “AND” case \( T_{\text{QVD}}^{\text{AND}}\left( {q,v,d} \right) \) is greater than the other cases of return periods; and the joint return period in “OR” case \( T_{\text{QVD}}^{\text{OR}} { \left( {q,v,d} \right)} \) is lower than the pair-wise bivariate joint return periods. Hence, it also infers that the occurrence of trivariate flood characteristics simultaneously is less frequent in “AND” case and more frequent in “OR” case.

Table 11 Comparison of univariate, bivariate, and trivariate return periods for flood characteristics

The joint return periods of two flood variables conditional on third flood variable, viz., \( {T_{{Q,V|D}}}\left( {q,v|D \leqslant d} \right),{T_{{V,D|Q}}}\left( {v,d|Q \leqslant q} \right) \) and \( {T_{{Q,D|V}}}\left( {q,d|V \leqslant v} \right) \) are computed using Eq. 35. For example, let us consider a flood event with the following flood characteristics: annual maximum peak discharge, q = 398.8 Mm3/day; flood volume, v = 676.6 Mm3; and flood duration, d = 14 days. By using Eq. 35, the corresponding conditional return periods are obtained as \( {T_{{Q,V|D}}}\left( {q,v|D \leqslant d} \right) \) = 106 years, \( {T_{{V,D|Q}}}\left( {v,d|Q \leqslant q} \right) \) = 4 years, \( {T_{{Q,D|V}}}\left( {q,d|V \leqslant v} \right) \) = 4.4 years. The contour plots of specific joint return periods for all the three combinations are plotted in Fig. 7. As seen from the scatterplots, most of the flood events have shorter primary return periods. However, there exists only one event both in \( {T_{{Q,V|D}}}\left( {q,v|D \leqslant d} \right) \) and \( {T_{{V,D|Q}}}\left( {v,d|Q \leqslant q} \right) \) cases, which have primary return period more than 100 years. One can obtain desired information for a particular use from contours of joint return periods (viz., flood volume and duration at given peak discharge, flood peak and duration at given flood volume, and flood peak and volume at given duration). This information will be helpful for design of flood control structures, reservoirs, spillways etc., where it seek design flood hydrographs.

Fig. 7
figure 7

Contour plots for conditional return periods (in years) of flood characteristics: a \( {T_{{{\text{QV}}|D}}}\left( {q,v|D \leqslant d} \right) \), b \( {T_{{{\text{VD}}|Q}}}\left( {v,d|Q \leqslant q} \right) \), and c \( {T_{{{\text{QD}}|V}}}\left( {q,d|V \leqslant v} \right) \). Historical events are shown as solid dots on the graph

The conditional return periods of peak flow at given volume and constant duration, and conditional return period of peak flow at given duration and constant volume are presented in Fig. 8a and b, respectively. It should be noted that in Fig. 8, the volume and duration values correspond to various percentile values of the respective data. The skewness of the conditional return period curves for flood peak shows increasing trend for the first case (i.e., at given conditional volume for various percentile levels and constant duration), and decreasing trend for the second case (i.e., at given conditional duration and decreasing trends for the second case and constant volume). In a similar way, conditional return period of flood volume given peak discharge and duration, and/or conditional return period of flood duration given volume and peak discharge can also be obtained. These scenarios can be helpful in assessing flood risk for hydrologic design purposes such as design of spillways and construction of various flood protection structures (for example levees, flood walls, diversion works).

Fig. 8
figure 8

Conditional return periods (in years) of flood characteristics: a \( {T_{{Q|{\text{VD}}}}}\left( {q|V \leqslant v,D \leqslant d} \right) \)of peak flow at given volume and constant duration (d = 5 days); b \( {T_{{Q|{\text{VD}}}}}\left( {q|V \leqslant v,D \leqslant d} \right) \)of peak flow at given duration and constant volume (v = 120 Mm3)

7.2 Analysis of secondary return period

The secondary return period can be useful for analyzing risk of supercritical flood events. The relationship between joint primary return period T QVD(q,v,d) and the secondary return period \( {\bar{T}_{\text{QVD}}}\left( {q,v,d} \right) \) is also investigated. The relationships of \( {\bar{K}_{{{C_{{\Sigma, \vartheta }}}}}} \)and the secondary return period \( {\bar{T}_{\text{QVD}}}\left( {q,v,d} \right) \) against primary return period in ‘OR’-case \( T_{\text{QVD}}^{\text{OR}}\left( {q,v,d} \right) \) are shown in Fig. 9a and b, respectively. For higher value of critical probability level \( \overline{p} \), higher primary return period is observed, and the survival probability function \( {\overline{K}_{{{C_{{\Sigma, \vartheta }}}}}}\left( {\overline{P}} \right) \) tend to decrease, which corresponds to decrease in probability of supercritical events. As the probability of supercritical events decreases, the secondary return periods or mean occurrence time between two supercritical events increases and vice versa. It is also observed that the secondary return period is always greater than that of primary return period. A similar behavior is observed between \( {\bar{K}_{{{C_{{\Sigma, \vartheta }}}}}} \) and the primary return period in “AND” case \( T_{\text{QVD}}^{\text{AND}}\left( {q,v,d} \right) \). It is also observed that the secondary return period is always greater than that of primary return period in OR case \( T_{\text{QVD}}^{\text{OR}}\left( {q,v,d} \right) \), but lower than that of primary return period in AND case \( T_{\text{QVD}}^{\text{AND}}\left( {q,v,d} \right) \).

Fig. 9
figure 9

The relation of primary return period \( T_{{_{\text{QVD}}}}^{\text{OR}}\left( {q,v,d} \right) \) versus a survival probability function \( {\bar{K}_{{{C_{{\Sigma, \vartheta }}}}}}\left( {\bar{p}} \right) = 1 - {K_{{{C_{{\Sigma, \vartheta }}}}}}\left( {\bar{p}} \right) \) and b the secondary return period \( {\bar{T}_{\text{QVD}}}\left( {q,v,d} \right) \)

An illustration is the 20-day flood event that occurred during 2006, which has peak flow of 391.45 Mm3/day and volume of 1,242 Mm3 (which corresponds to 99th percentile value of flood volume). The secondary return period associated with this extreme event using Eq. 39 is more than 100 years \( \left( {{{\bar{T}}_{\text{QVD}}}\left( {q,v,d} \right) \approx 108\;years} \right) \) whereas primary return period associated with the event for “OR” and “AND” cases are \( T_{\text{QVD}}^{\text{OR}}\left( {q,v,d} \right) \) = 16.5 years and \( T_{\text{QVD}}^{\text{AND}}\left( {q,v,d} \right) \) = 507.1 years. Similarly, for a 14-day flood event that occurred during 1955, with peak flow q = 398.79 Mm3/day (which corresponds to 99th percentile value of annual maximum flood peak) and volume v = 676.58 Mm3, has secondary return period \( {\bar{T}_{\text{QVD}}}\left( {q,v,d} \right) \) = 14.5 years, while primary return periods \( T_{\text{QVD}}^{\text{OR}}\left( {q,v,d} \right) \) is 3.8 years and \( T_{\text{QVD}}^{\text{AND}}\left( {q,v,d} \right) \) is 175 years, respectively. The secondary return period of flood variables presented in Table 11 also support this inference. This means that the structure could be underdimensioned if it is designed considering only the primary return period in “OR” case and overdimensioned if it is designed with the primary return period in “AND” case. Thus, the presented methodology can be very useful for risk based hydrological design of hydraulic and water resources projects.

However, in recent times many concerns are raised over climate change and its influence on hydrological extremes. It would be interesting to see how the climate change may affect future floods, and quantifying the associated risks. These tasks can be considered for future studies to have comprehensive evaluation of flood risks.

8 Summary and conclusions

In this study, a trivariate copula-based approach is presented for risk analysis of flood flows of Delaware River basin at Port Jervis, New York. The association among the three mutually correlated flood properties (viz., annual flood peak flow, volume, and duration) is explored and used for dependence structure modeling using copulas. From graphical tests and rank correlations of flood variable pairs, it is found that the dependence among the flood variables is statistically significant and therefore copula based methodology is adopted for flood risk analysis. Three fully nested forms of Archimedean copulas (viz., Clayton, Gumbel–Hougaard, Frank copulas); and one elliptical copula (i.e., Student’s t copula) are used to model the joint distribution of flood variables. To test the performance of copulas in modeling flood extreme events, apart from standard performance measures, the nonparametric tail-dependence coefficients are evaluated and used to verify the efficacy of copulas in modeling hydrological extremes. The return periods for univariate, bivariate, and trivariate cases are estimated and comparative analysis is performed. Also, the importance of primary and secondary return periods is analyzed.

The specific conclusions of the study are as follows:

  • The copula method is found to be very effective tool for multivariate modeling of flood risks, as copulas are effectively preserving the dependence structure of multiple flood characteristics.

  • In general, the upper tail-dependent copula families performed better in modeling extreme flood events and also capable of capturing tail behavior of data well. In relative comparison, the Student’s t copula is found to be better than the other copulas in describing bivariate and trivariate dependence structure of flood variables. This may be because of the weak dependences are averaged in other copulas. Also, in case of Student’s t copula, the nonparametric tail probability measures showed a close similarity with observed TDC, which suggests that the Student’s t copula is well capturing the extremes of the observed data.

  • The comparative analysis of different return periods showed that it is very important to compute trivariate return periods of flood characteristics to know the expected flood risks and their magnitude of influence if they occur simultaneously. It also noted that the hydraulic structures could be underdimensioned if it is designed considering only the return period in “OR” case; and overdimensioned if it is designed with the return period in “AND” case. The primary and secondary return periods of flood characteristics can be very useful for effective risk-based design of water resources projects.