1 Introduction

It has been widely warned that implementing simple univariate statistical approach in the case of dependence among variables may result in severe under/over estimations. Traditionally, classical multivariate distributions are used as the main tool for such considerations. If so, the marginal variables (or their transformations) must be described using identical parametric families. Copula modelling provides an alternative for overcoming such limitations based on a theorem given by Sklar in 1959. In its bivariate form, if x and y are two continuous random variables with continuous marginal cumulative distributions of F(x) and G(y), Sklar’s theorem implies that the joint cumulative distribution of H(x,y) can be described as:

$$ H(x,y) = C(u,v)\quad x,y \in \Re {\kern 1pt} {\kern 1pt} \,{\text{and}}\;u = F(x),v = G(y) $$
(1)

In this case, C: [0,1]2 → [0,1] is the copula function, which couples the cumulative marginal distributions and form the cumulative joint distribution. It can be proved that if H is known, then C, F, and G are uniquely determined (Genest and Favre 2007). So technically, the copula approach divides the problem into two mutually independent steps concerning fitting the marginal distributions and finding the dependence structure among the margins (Wang et al. 2009). This ability can provide a powerful modelling scheme even “when no common probability space can be found for a given set of random variables” (Sklar 1996). Furthermore, Genest and MacKay (1986) showed how copulas can discover dependencies that traditional dependence measures, such as Pearson’s correlation coefficient, cannot capture.

In the most recent decade, the rate of publications in copula modelling and its applications has increased exponentially in various fields. The application of copula modelling in hydrology and environmental modelling was initialized by De Michele and Salvadori (2003) and was followed by Favre et al. (2004), who discussed the advantages of copula modelling in characterizing complex hydrological variables. This recognition was amplified by dedication of a special issue of the ASCE Journal of Hydrologic Engineering (2007) to copula modelling and its application in hydrology. So far, the application of copula modelling in hydrology has been mainly cited in the extreme events modelling, such as flood frequency analysis (e.g. Zhang and Singh, 2006; Bénard and Lang 2007; Genest et al. 2007; Karmakar and Simonovic 2008, 2009; Wang et al. 2009; Chebana and Ouarda 2009), flood retention measures (Nijssen et al. 2009), designing flood control systems (e.g. Klein et al. 2008; Osorio et al. 2009; Muhaisen et al. 2009), inland flooding analysis (e.g. Sunyer Pinya et al. 2009), flood damage calculation (e.g. Gartsman et al. 2009), and characterization of drought or low flow conditions (e.g. Shiau et al. 2006, 2007; Dupuis 2007; Serinaldi et al. 2009; Kao and Govindaraju 2010). Recently, the application of copula modelling has been extended to new areas. For example, Bárdossy (2006) discussed the applicability of copula as a geostatistical model for deriving the groundwater quality parameters. Salvadori and De Michele (2007) used copulas for finding the temporal structure of the sequential storms. Chowdhary and Singh (2008) used copula for rainfall frequency analysis. Serinaldi (2008) used copulas for unfolding the dependencies among the rainfall fields. In more advanced examples, Serinaldi (2009a) used copulas as a module embedded in rainfall simulators. Haberlandt et al. (2008) implemented copulas for space–time hybrid hourly rainfall models. Leonard et al. (2008) applied copula for considering the seasonal/annual dependencies among rainfall and streamflow. Maity and Kumar (2008) reported the application of copulas for analyzing the dependencies among the teleconnected hydroclimatic variables. Bárdossy and Pegram (2009) implemented copulas to model the spatial interdependence structure of the rainfall amounts together with the rainfall occurrences. Karamouz et al. (2009) used copulas for deriving the joint probability of supplying both water demand and water quality in a complex reservoir operation system. In a regionalization context, Gargouri-Ellouze and Bargaoui (2009) showed that the copula parameters derived for describing infiltration can be further linked to the actual physical properties of the watershed. In the same context, Samaniego et al. (2010) performed streamflow prediction in ungauged catchments using copula-based dissimilarity measures. More recently, van den Berg et al. (2010) used copulas for spatial downscaling of rainfall fields. The applications are still growing rapidly (www.stahy.org/Topics/CopulaFunction/tabid/67/Defult.aspx).

Nonetheless, it should be noted that copula modelling has received some strong criticism. In a landmark discussion, Mikosch (2006) argued that the use of copula modelling might not be scientifically legitimate and can suffer from lack of rationale in many circumstances (see also the reply by Genest and Rémillard 2006). Regardless of the theoretical/philosophical discussions, it has been also warned by copula practitioners that the method might be still far from a universal tool (Bénard and Lang 2007) and could result in disappointing results in some occasions (Genest et al. 2007). Maity and Kumar (2008) concluded that even though Sklar’s theorem is mathematically coherent, this does not imply that one of the classical copulas can certainly describe the dependence among all possible datasets. As a result, it has been argued by Ashkar (2008) that practitioners should be very careful not to consider copula modelling as the only approach to tackle the statistical dependence. Therefore, copulas should be considered in conjunction with other models of dependence and should only remain as a potential candidate for describing the dependence in a given dataset.

This paper aims at providing a pragmatic framework to perform copula modelling in real-world practical contexts and to present the application of copula modelling in a new practical context—assessing the long-term behaviour of reconstructed watersheds. Long-term behaviour of reconstructed watersheds can be assessed by the aid of maximum annual water deficit, which is a measure of available water for evapotranspiration (Elshorbagy and Barbour 2007). However, estimation of maximum annual water deficit requires the knowledge of soil moisture dynamic within the watershed, which can be only available in limited watersheds with extensive soil moisture monitoring. By recognizing the interdependence between maximum annual soil moisture deficit and annual cumulative evapotranspiration, building the joint description of these variables is attempted. By having the joint model and the estimation for the annual cumulative evapotranspiration, the cumulative distribution of maximum annual water deficit can be estimated even in nearby ungauged reclaimed sites with similar physical characteristics and reclamation strategy. Throughout our study, a brief review on the theory of copula modelling will be presented. Accordingly, three research questions would be addressed: (1) how to compare and to select among several competing models of dependence; (2) how the efficiency of the dependence model can depend on the method used for parametric estimation; and finally, (3) how the efficiency of model can be changed by altering the structure of the dependence model. The outline of this paper is as follows: Section 2 introduces the considered application area as well as the case study and the data used for the experimentations. In Sect. 3, the dependence modelling is viewed as a top-down sequential procedure, and a brief review of the copula modelling is provided. Section 4 introduces the applied test space and proposes a two-step framework to compare/falsify several competitive models of dependence. In Sect. 5, the results are reported and discussed from several perspectives, and finally Sect. 6 summarizes the paper, provides the conclusions, and highlights some directions for further research.

2 Performance assessment of the reconstructed watersheds

2.1 Context and the problem definition

Oil sands mining industry is one of the main economical resources in Canada. The extraction process of the oil can completely disrupt the natural behaviour of the mining area. Based on Leskiw (2004), the end results of the oil sands mining can be spatially extended to over 100 km2 from each individual site. An extensive reclamation work, therefore, is required to reconstruct various functions of the natural landscape. The main reclamation strategy in these sites is to cover the tailing sand remaining of the mining process by one or multiple soil layers that enable the watershed to naturally “store and release the water” (Elshorbagy and Barbour, 2007). This ability can provide the chance for re-planting the mining site and re-establishing the natural conditions in the watershed. In order to evaluate the hydrological efficiency of different options of soil covers, building some prototype watersheds with an extensive monitoring program is essential. The acquired knowledge can result in proposing an optimal cover design regarding the characteristics of the site and the target vegetation. Based on Elshorbagy et al. (2006), by gathering a short-period of data (i.e. few years), a hydrological model can be calibrated with the short-term observations. Then, the verified model can be used to simulate the long-term behaviour of the watershed. Based on the long-term simulation results, Elshorbagy and Barbour (2007) proposed a novel methodology for hydrological performance assessment of the reconstructed watersheds. They suggested the dynamic concept of maximum annual moisture deficit as a measure to assess the performance of the soil cover. Assuming that the averaged daily soil moisture content (S i ), interflow (I i ), and percolation below the cover depth (P i ) are available through long-term simulation, the daily moisture deficit (D i ) that contributes to evapotranspiration can be calculated as (Elshorbagy and Barbour 2007):

$$ D_{i} = S_{i} - S_{i + 1} - \left( {I_{i} + P_{i} } \right) $$
(2)

By accumulating the annual daily series of D i , it would be possible to retrieve the maximum annual soil moisture deficit. The daily values of D i should be accumulated only over the growing season. The maximum value of the cumulative D i in each year can be marked as the maximum annual soil moisture deficit. By having a long-term series of maximum D i values, it would be possible to tackle the long-term behaviour of the reconstructed catchment in a statistical manner.

Although the concept of maximum annual moisture deficit can provide a reliable dynamic notion of the watershed performance, the calculation of this measure is based on the soil moisture content, which can be only available in limited prototype watersheds with extensive soil moisture measurement scheme. However, given a unique physical characteristics and reclamation strategy, the difference in the maximum annual water deficit between two nearby reconstructed watersheds stems from the difference between their annual cumulative evapotranspiration quantities. Therefore, if the model of dependence between annual cumulative evapotranspiration and maximum annual moisture deficit is known, the cumulative distribution of maximum annual water deficit can be estimated based on the measurements or estimations of annual cumulative evapotranspiration.

2.2 Case study

Located in a continental boreal climate regime, Young Jack Pine (YJP) is a 15-year old planted forest on a reconstructed site on Mildred Lake mine (57°03.8′N, 111°39.8′W, elev. ~310 m). The mine is about 40 km NNW of Fort McMurray, Alberta. The average daily temperature ranges from −18.8 to 16.8°C (Jan.–Jul.), with average annual precipitation of 456 mm. The growing season falls between mid May to early October with 60–70 frost-free days. Most of the precipitation falls within this interval as 313 mm typically falls during day 139–282 of each year. The reconstructed cover is composed of glacial till down to an average depth of 40 cm, placed over a deep layer of sandy soil from the oil sands extraction process. The gathered data from the site includes the daily on-site meteorological data (the micrometeorological data including AET is only available during the growing seasons) as well as 4-h measurements of the soil moisture contents, soil temperature and soil suctions at five different depths. Three measurements were made in the upper till layer (depths 50, 150 and 300 mm) and the other two in the underlying tailing sand base (depths 500 and 800 mm). In order to measure the turbulent fluxes at the site, an open-path eddy covariance system was used, mounted approximately 4 m above the canopy at height of 8.8 m. Radiant fluxes were measured with instrumentation mounted at a height of 6.3 m. Ground heat flux was obtained using a soil heat flux plate at 5 cm. Precipitation data were measured using a tipping-bucket rain gauge located less than 1 km west of the tower. In total, two years (2007–2008) of daily climate and soil data were available for YJP. Keshta et al. (2009) proposed a Generic System Dynamic Watershed model (GSDW) for hydrological modelling in reconstructed watersheds with similar set of data. GSDW, therefore, has been used for further hydrological modelling in this paper. In brief, the 2007 data of soil moisture and evapotranspiration in YJP were used to calibrate sixteen different conceptualizations of the GSDW model. These conceptualizations were based on different incorporation of soil and/or evapotranspiration data in the model structure. Accordingly, 16 competitive models were calibrated using four automatic calibration techniques. The performances of the competitive modelling alternatives were validated using the 2008 dataset and the most accurate conceptualization/parameterization has been identified. By the aid of this model and the meteorological data of Fort McMurray from the beginning of 1944 to the end of 2000, a notion for long-term behaviour of the watershed was simulated. According to this long-term simulation, 56 pairs of maximum annual moisture deficit and annual cumulative evapotranspiration were produced. This dataset is considered for further experimentation in this paper. Table 1 shows the statistical characteristics of this dataset.

Table 1 The statistical properties of the long-term simulated annual series of maximum water deficit and the annual cumulative evapotranspiration for YJP prototype reclaimed site

3 Copula approach to modelling interdependence

3.1 Background

Considering the bivariate case, Fréchet-Hoeffding theorem proves that the possible forms of dependence are bounded between two extremes, which can be formulated as the following inequality (Genest and Favre 2007):

$$ \max (0,u + v - 1) \le C(u,v) \le \min (u,v),\quad u,v \in [0,1] $$
(3)

The right and left bounds of Eq. 3 imply conditions in which x and y are deterministically dependent on each other. In this juncture, the stochastic independence can be verified, if and only if, C(u,v) = uv, u,v ∈ [0,1]. This verification requires setting up both marginal and joint statistical representation of the data and can be described by a top-down framework presented in Fig. 1. Here, the task is to briefly explain the main building blocks of this system.

Fig. 1
figure 1

A top-down approach to static multivariate modelling

3.2 Pre-processing and marginal modelling

Nelsen (2006) suggested that the rank-based dependence measures such as Kendall’s tau and Spearman’s rho can better capture the existence of dependence compared to correlation coefficient. The empirical formulations of these measures can be described respectively as:

$$ \tau = {\frac{{n_{c} - n_{d} }}{{\frac{1}{2}n(n - 1)}}} $$
(4)
$$ \rho = 1 - {\frac{{6\sum {d_{i}^{2} } }}{n(n - 1)(n + 1)}} $$
(5)

where n c and n d are the numbers of concordant pairs and discordant pairs in the data set, respectively; and, d i is the differences between the ranks of each observation of the two pairs of variables. Genest and Favre (2007) introduced two hypothetical statistical tests for verifying independence according to Kendall’s tau and Spearman’s rho.

It is better to keep the number of marginal variables as low as possible. This can be achieved either by using the dimension reduction techniques, such as the principal component analysis or considering the possible functional relationships among the marginal variables (e.g. Chowdhary 2008). After selecting the non-redundant margins, the marginal distributions (both empirical and theoretical) should be identified. The empirical cumulative distribution function (CDF) for a continuous random variable x can be calculated as:

$$ F_{n} (x) = {\frac{1}{n + 1}}\sum\limits_{i = 1}^{n} {1\left( {X_{i} \le x} \right)} $$
(6)

where n is the number of data points. Also, various continuous statistical distributions can be used for functional representation of the margins, which are well documented in the literature and are widely available through different software packages such as R or MATLAB. Moreover, Zhang and Singh (2006) reported that transforming the original data using transformation techniques, such as Box–Cox transformation, can be also implemented and might result in better identification of marginal CDF. This transformation of the hypothetical variable z can be defined as follows:

$$ z^{*} = {\frac{{z^{\lambda } - 1}}{\lambda }},\lambda \ne 0{\text{ or }}z^{*} = \ln (z),\lambda = 0 $$
(7)

where \( z^{*} \) is the Box-Cox transformation of z and λ is the Box–Cox parameter. After the selection of the functional families, and applying the possible transformations, the parameters of the considered distribution should be assigned. The well-known maximum likelihood estimation has been frequently used for this purpose and has been well incorporated in most of the available statistical packages.

3.3 Copula structures

Copula modelling requires choosing a parametric mathematical structure that can provide the functional equivalent of H(x,y) through copula C(u,v); u = F(x), v = G(y). The current available copula structures can be categorized into four main classes.

3.3.1 Archimedean copulas

These structures are symmetric and associative and can be defined by continuous, strictly decreasing convex generators. One parameter Archimedean copulas have been frequently used in hydrology (e.g. Shiau et al. 2007; Karmakar and Simonovic 2009; Sunyer Pinya et al. 2009; Osorio et al. 2009; Wang et al. 2009). The application of two parameters Archimedean copulas is also reported (e.g. Nijssen et al. 2009).

3.3.2 Metaelliptical copulas

These structures can be formulated based on elliptically contoured distributions (Michiels and De Schepper 2008). Well-known metaelliptical copulas are Gaussian, Student t, Cauchy and Pearson type II (e.g. Fang et al. 2002; Genest et al. 2007; Bénard and Lang 2007; Dupuis 2007; Poulin et al. 2007; Bárdossy and Pegram 2009). These structures can be further generalized; for instance, Demarta and McNeil (2004) provided a couple of extensions for t copula, called Skewed t and Grouped t copula. In practice, these copulas are more favourable when the joint distribution of three or more variables is concerned.

3.3.3 Extreme value copulas

These structures were initially proposed to deal with the joint probability of extreme events. Different systems have been suggested for constructing the extreme value copulas. Some famous structures such as Gumbel–Hougaard, Galambos, and Hüsler–Reiss systems, have been frequently implemented in hydrology (e.g. Genest and Favre 2007; Chowdhary and Singh 2008; Maity and Kumar 2008). Salvadori et al. (2007) presented a monograph with several applications of extreme value copulas in geophysical context. Recently, Salvadori and De Michele (2010) presented the application of multi-parameter extreme value copulas.

3.3.4 Other copulas

There are other copula families that cannot be classified according to the three mentioned families. These copulas include some well-known systems, such as Plackett and Farile–Gumbel–Mogenstern (e.g. Shiau 2006; Shiau et al. 2007) as well as some less used copula models, such as Koehler–Symanowski system (Koehler and Symanowski 1995), vine copula (Kurowicka and Cooke 2007) and the contamination families based on Legendre polynomials (Kallenberg 2008).

3.4 Copula parameter estimation

All copula structures contain one or more parameters that should be assigned based on the data under consideration. The calibration of copula parameters can be classified mainly into five different methods.

3.4.1 Method of moments

In some copula structures, there are mathematical relationships between the copulas’ free parameter and the measures of dependence, such as Kendall’s tau, Spearman’s rho or correlation coefficient. This method of estimation is quite popular in metaelliptical copulas (e.g. Bénard and Lang 2007; Genest et al. 2007) and one-parameter Archimedean families (e.g. Zhang and Singh 2006; Maity and Kumar 2008; Sunyer Pinya et al. 2009; Osorio et al. 2009; Chebana and Ouarda 2009). The main limitation of this method is the fact that the closed-form formulas linking the copula parameters to measures of dependence do not exist for all copulas.

3.4.2 The maximum pseudolikelihood estimator

This method, also known as canonical maximum likelihood, is probably the most favourable method for assigning the copula parameters when identifying the model for joint dependence is the main aim of copula modelling. This method is based on maximizing the log-likelihood function of the copula model, which can be described for the bivariate case as follows:

$$ l(\theta ) = \sum\limits_{i = 1}^{n} {\log \left[ {c_{\theta } \left\{ {F\left( {X_{i} } \right),G\left( {Y_{i} } \right)} \right\}} \right]} $$
(8)

In maximum pseudolikelihood estimator, the marginal distributions F(X i ) and G(Y i ) are replaced by their empirical representation within the sample (Eq. 6). This method can provide a general rule for estimating the copula parameters and mainly requires implementing a numerical procedure. The maximum pseudolikelihood formulation has been used frequently in hydrology (e.g. Dupuis 2007; Poulin et al. 2007; Wang et al. 2009).

3.4.3 Inference from margins (IFM)

In practical circumstances, it might be preferred to use the full likelihood estimator (e.g. Joe 1997; Shiau 2006; Shiau et al. 2006, 2007; Chowdhary and Singh 2008; Karamouz et al. 2009). As a result, the empirical distributions are substituted by the parametric distribution of the marginal variables. Kim et al. (2007) argued that the success of IFM is closely related to the goodness of marginal distributions. Therefore, the estimated copula parameters have the risk of being misidentified because of inappropriate choice of the marginal distributions or their parameters.

3.4.4 Minimum distance method

Recently, some authors proposed new procedures by minimizing the distance between the copula and the empirical joint distribution. The objective function in the bivariate case can be described as (Foscolo et al. 2008):

$$ L(\theta ) = \sum\limits_{i = 1}^{n} {\left[ {c_{\theta } \left( {\hat{F}_{n} \left( {x_{i} } \right),\hat{G}_{n} \left( {y_{i} } \right)} \right) - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{C}_{n} \left( {x_{i} ,y_{i} } \right)} \right]}^{2} $$
(9)

where \( \hat{F}_{n} \left( {x_{i} } \right) \) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{G}_{n} \left( {y_{i} } \right) \) are the marginal distributions (either empirical or theoretical) and \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{C}_{n} \left( {x_{i} ,y_{i} } \right) \) denotes the rank-based empirical copula (Genest and Favre 2007):

$$ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{C}_{n} (u,v) = \frac{1}{n}\sum\limits_{i = 1}^{n} {1\left( {{\frac{{R_{i} }}{n + 1}} \le u,{\frac{{S_{i} }}{n + 1}} \le v} \right)} $$
(10)

where (R i ,S i ) are the pair of observations’ rank within the sample. According to the authors’ knowledge, there is still no application of this estimation method in the context of hydrology. Genest and Favre (2007) argued that this method results in sub-optimal solutions compared to the pseudolikelihood estimator. In contrast, Foscolo et al. (2008) concluded that when the copula structure is not specified correctly, the minimum distance estimator can provide a better parameterization. These arguments will be further explored in this paper.

3.4.5 Other estimation methods

There are other estimation methods reported in the literature. As an illustration, for the Student metaelliptical copula, Demarta and McNeil (2004) proposed a two-stage approach in which the scale matrix is first estimated by using the relationship with Kendall’s tau, and then the pseudo-likelihood function is maximized with respect to the degree-of-freedom using the estimate of the scale matrix. Tsukahara (2005) used the estimating-equation approach to find the copula parameters. Some authors have used Bayesian approach to estimate copula parameters (e.g. Huard et al. 2006; Silva and Lopes 2008). There is also a growing body of literature that implements kernel methods to estimate a smooth representation of a copula without assuming any specific parametric family (e.g. Fermanian and Scaillet 2003).

4 Experimental study

4.1 Rationale and computational platform

Considering that the true model of dependence is unknown prior to the experiment, here the aim is to identify the most credible model of dependence for our dataset based on the top-down system presented in Fig. 1. In each modelling step, several competitive options should be examined in a rigorous falsification framework and the non-falsified options should pass to the next modelling step. The final delivery of this procedure would be a model of dependence that can be used for deriving the conditional probability of maximum annual water deficit based on the given values of annual cumulative evapotranspiration.

Computational platforms for copula modelling are gradually appearing (e.g. Yan 2007). In this study, MATLAB modelling environment (MATLAB 2010) is used. MATLAB Statistics Toolbox contains 20 statistical distributions for continuous univariate description of marginal variables. Also by applying Box–Cox transformation, the Normal distribution can be further extended to the transformed data. The parameters of marginal distributions are estimated using the maximum likelihood estimation. MATLAB also contains several multivariate distributions as well as copula-based model of dependence. By coding more copula-based multivariate descriptions, optimization-based parametric estimation methods, graphical inspections and goodness-of-fit tests, a unified computational package was assembled to conduct the experiments in this study.

4.2 Test space

Michiels and De Schepper (2008) discussed that the functional structure for describing multivariate data should be diverse and relevant. Table 2 shows different modelling options used in this study. Both methods of maximum likelihood and minimum distance are implemented in their full (considering theoretical marginal distributions) and pseudo (with the empirical marginal distributions) forms for identifying the models’ parameters. The method of moments was also used for the structures in which the model parameters have direct relationship with the measures of dependence.

Table 2 The multivariate test space implemented in this study

4.3 Initial ranking of the dependence models

Assume that the parametric copula structure C θ is fitted to the bivariate data \( \left( {X_{1} ,Y_{1} } \right), \ldots ,\left( {X_{n} ,Y_{n} } \right) \) with the marginal distribution of F X (x) and G Y (y) and the empirical copula \( \hat{C} \). In this juncture, the goodness-of-fit measures can be calculated in two different fashions. In the pseudo measures, the empirical marginal distributions are considered, therefore, the task is to check the ability of the model of dependence in coupling the marginal variables. On the other hand, in the full goodness-of-fit assignment, the marginal variables are considered with their fitted theoretical distributions F X (x) and G Y (y). So, the task is to check the complete goodness-of-fit for the parametric structure describing the dataset. Intuitively, copula models can be ranked based on their overall goodness-of-fit and the error contribution from their marginal quantification. The overall goodness-of-fit measure can be simply defined as the average value of the full and pseudo goodness-of-fit measures. The difference between the corresponding full and pseudo goodness-of-fit measures shows how much of the total error is contributed by the error in the theoretical marginal distributions. In this study, the BIC measure has been chosen for such analysis. BIC provides better distinction between modelling options and puts an emphasis on the parametric parsimony by considering the number of model parameters. BIC can be formulated as follows:

$$ BIC = n\log (MSE) +[({\text{no}}\;{\text{of}}\,{\text{fitted}}\,{\text{parameters}})\times \log (n)] $$
(11)

where n is the number of observations and MSE is the mean square of error of the copula model. By applying this procedure to an ensemble of copula models, the potential candidates can be pruned by eliminating the highly dominated options, i.e., the models that strictly result in poorer goodness-of-fit measures compared to the other competing options.

4.4 Secondary goodness-of-fit test and tail analysis

There is a recent trend in the literature trying to propose formal (also known as blanket or omnibus) hypothetical tests for goodness-of-fit evaluation of copula models (Fermaninan 2005). The goodness-of-fit tests can be further extended and used as a way to test the equality between two copulas (e.g. Rémillard and Scaillet 2009). The main formal approach is to compute the proper p-values through bootstrapping procedures (Genest et al. 2006). These tests are either based on the Rosenblatt transformation (e.g. Berg 2009) to convert the multivariate problem into a univariate hypothetical test or the use of empirical copula (e.g. Genest et al. 2009) to calculate Cramér–von Mises and Kolmogrov–Smirnov statistics. Tests based on Rosenblatt transformation often require double bootstrapping and high-level computational resources, which might not be widely available in practice. Genest et al. (2009) argued that the tests based on empirical copula are more objective and only require parametric bootstrapping if the analytic expression of the copula structure exists. Here, we implement the formal test introduced by Genest et al. (2009) based on the distance between the empirical and the null hypothesis copula. The Cramér–von Mises statistics for bivariate case can be described as the following:

$$ \hat{T}_{n} = n\int\limits_{{[0,1]^{2} }} {\left\{ {\hat{C}(u,v) - C_{\theta } (u,v)} \right\}^{2} d\hat{C}(u,v) = \sum\limits_{j = 1}^{n} {\left\{ {\hat{C}\left( {u_{j} ,v_{j} } \right) - C_{\theta } \left( {u_{j} ,v_{j} } \right)} \right\}^{2} } } $$
(12)

For the complete bootstrapping algorithm refer to Genest et al. (2009) or Berg (2009). Acceptance or rejection of the copula models are based on the p-values estimated through bootstrapping: Small values suggest discarding the model whereas large values supporting its suitability (e.g. Mesfioui et al. 2009; Durante and Salvadori 2010). However, as argued by Serinaldi (2009b), there might be circumstances where several models of dependence cannot be falsified through goodness-of-fit tests. In these circumstances, exploring the tail dependence of non-rejected copulas can be useful. Tail dependence is a measure for describing the dependence in the upper tail or the lower tail joint space. These values for copula C(u,v) can be defined as follows (Joe 1997; Poulin et al. 2007):

$$ \begin{gathered} \lambda_{U} = \mathop {\lim }\limits_{{t \to 1^{ - } }} {\frac{1 - 2t + C(t,t)}{1 - t}} \hfill \\ \lambda_{L} = \mathop {\lim }\limits_{{t \to 0^{ + } }} {\frac{C(t,t)}{t}} \hfill \\ \end{gathered} $$
(13)

where λ U and λ L represent the upper and the lower tail dependence, respectively. Several authors have come with empirical tail dependence estimators (e.g. Schmidt 2005; Schmidt and Stadtmuller 2006; Serinaldi 2008). One of the most frequently used empirical measures of the upper tail has been suggested by Capéraà et al. (1997) and can be expressed as:

$$ \hat{\lambda }_{U}^{CFG} = 2 - 2\exp \left[ {\frac{1}{n}\sum\limits_{i = 1}^{n} {\log \left\{ {{\frac{{\sqrt {{\frac{{{\frac{\log 1}{{u_{i} }}}\log 1}}{{v_{i} }}}} }}{{{\frac{\log 1}{{\max \left( {u_{i} ,v_{i} } \right)^{2} }}}}}}} \right\}} } \right] $$
(14)

The comparison between copula based upper tail (λ U ) and the empirical measure (Eq. 14) can also provide a means to investigate the performance of the applied parametric copula (e.g. Demarta and McNeil 2004; Bénard and Lang 2007; Charpentier and Segers 2009).

5 Results and discussion

Figure 2 shows the scatter and scatter-rank plots for the 56 pairs of maximum annual moisture deficit and annual cumulative evapotranspiration produced for YJP site. The measures of dependence for the site are 0.60, 0.42, and 0.60 for the correlation coefficient, Kendall’s tau, and Spearman’s rho, respectively. Both hypothetical tests based on Kendall’s tau and Spearman’s rho measures have confirmed that the assumption of independence can be rejected at 5% level.

Fig. 2
figure 2

The scatter plot for maximum annual water deficit and annual cumulative evapotransporation; a original scale, b rank scale

5.1 Marginal quantification

Twenty one different univariate distributions were considered for marginal quantification of the annual maximum water deficit and the annual cumulative evapotranspiration. For the former variable, the Box–Cox transformed Normal distribution has been selected as the theoretical description of the data. This distribution was the only distribution that passed both chi-squared and KS tests at 5% level. For the latter variable however, both Beta distribution and Box–Cox transformed Normal distribution could satisfy one of the applied hypothetical tests. Table 3 shows the characteristics of the fitted marginal distributions and their corresponding goodness-of-fit measures. Figure 3 shows the fitted and empirical marginal distributions.

Table 3 The non-falsified marginal distributions for annual maximum water deficit and annual cumulative evapotranspiration along with their specifications
Fig. 3
figure 3

Fitted versus empirical CDFs of marginal variables; a Annual maximum water deficit fitted with Box–Cox, Normal distribution; b annual cumulative evapotranspiration fitted with Beta distribution; c annual cumulative evapotranspiration fitted with Box–Cox, Normal distribution

5.2 Multivariate modelling

Considering the non-falsified marginal distributions, two different approaches can be taken for building up the multivariate description of the data. In the first approach, Box–Cox transformed Normal distribution can be considered as the univariate description of both marginal variables. As a result, the classical multivariate distributions can be also implemented for quantifying the bivariate dependence within the data. However, in the second approach only copula-based models of dependence are used, since the marginal variables are described using different theoretical distributions. Based on the possible options for marginal distributions, multivariate structures, and parameter estimation methods (Table 2), an ensemble of 106 parametric structures were developed for describing the interdependence between the annual maximum water deficit and the annual cumulative evapotranspiration. The characteristics of the fitted models should be carefully investigated before further implementation. Large convergence limits can resemble the lack of parametric robustness and should be taken as a reason for falsifying the fitted model. Also, association with poor goodness-of-fit measures can be interpreted as a sign of model insufficiency. Based on this logic, 55 models were falsified because of the lack of appropriate convergence or very poor goodness-of-fit measures. Tables 4 and 5 show the 51 non-falsified multivariate models and their associated goodness-of-fit measures. Table 4 refers to the models in which maximum annual water deficit and the annual cumulative evapotranspiration are described by Box–Cox transformed Normal and Beta distributions respectively; whereas in Table 5, both margins are fitted using the Box–Cox transformed Normal distribution.

Table 4 The feasible multivariate models developed for describing the maximum annual water deficit (fitted with Box-Cox, Normal distribution) and the annual cumulative evapotranspiration (fitted with Beta distribution)
Table 5 The feasible multivariate models developed for describing the maximum annual water deficit and the annual cumulative evapotranspiration when both marginal variables are fitted with Box–Cox, Normal distribution

Considering the BIC goodness-of-fit measures for both full and pseudo representation of the data, Fig. 4 illustrates different distinctive clusters of the 51 models. In general, three clusters can be observed. According to these clusters, two general arguments can be made. First, considering Table 5, it can be observed that classical multivariate options are strictly dominated by copula-based models. Therefore, they can be further eliminated from the potential solutions. It is also worthwhile mentioning that the majority of the feasible options (in particular in the best cluster) are either estimated through the method of moments or the minimum distance criterion. Only less than 13% of the feasible options in the best cluster are derived using maximum likelihood methods, and they are only related to metaelliptical copulas (t-copula and Gaussian options reported in rows 2, 4, and 6 of Table 4). This observation might be a hint for further use of the minimum distance estimators in copula parameter identification. This issue will be more investigated in the next section.

Fig. 4
figure 4

Different clusters of the feasible models in terms of their pseudo and full BIC goodness-of-fit measures. The dots and triangles represent the feasible solutions reported in Tables 4 and 5, respectively

5.3 Model selection

39 copula-based models located in the best modelling cluster (shown in Fig. 4) were considered for initial ranking using the procedure described in Sect. 4.3. Figure 5 shows the locations of the solutions in the surface defined by efficiency criteria. The Pareto candidates are highlighted in Fig. 5 as well as Tables 4 and 5 by the star (*) sign. These eleven models have been used for secondary goodness-of-fit test and tail analysis based on the procedure given in Sect. 4.4. Table 6 provides the name, parameter value, the method of parameter estimation of the non-dominated models along with their upper tail limit, T n statistics, and p-values resulted from bootstrapping with 10,000 random series with the same length as the original data. According to the p-values, Gaussian and Gumbel–Houguaard options stand superior to other options. However, only Gumbel–Houguaard options can provide close upper tail limit compared to the empirical tail limit. Based on Eq. 14, the empirical upper tail limit is 0.50; however, Gaussian copula theoretically implies tail independence. Among Gumbel–Houguaard non-dominated options, the structure calibrated by the method of moment is preferred based on the estimated p-values and closer upper tail limit compared to the original sample-based measure. This copula model is selected for quantifying the conditional probability of maximum annual water deficit with respect to the annual cumulative evapotranspiration.

Fig. 5
figure 5

The location of solutions within the best modelling cluster (Fig. 4) in the surface defined by measures introduced for overall accuracy and the error contribution from marginal variables. The stars refer to the non-dominated solutions

Table 6 The upper tail dependence and the goodness-of-fit test results for the Pareto candidates shown in Fig. 5; p-values are extracted by 10,000 bootstrap samples

Figure 6 compares the empirical cumulative bivariate distribution (Eq. 10) with the selected copula-based distribution in both pseudo and full representations. Comparison between pseudo and full subfigures can explicitly show the error contribution initiating by the misspecification in the marginal theoretical distributions.

Fig. 6
figure 6

The empirical (empty dots) versus fitted joint cumulative distribution (lines) for maximum annual water deficit and annual cumulative evapotranspiration in both pseudo and full representation based on the selected Gumbel–Houguaard model

Before extracting further information from the fitted joint model, it is worthwhile to compare the performance of different estimation methods. In order to perform this task, a bootstrapping study has been conducted to compare the robustness of the three Gumbel–Houguaard models calibrated using method of moment, minimum distance and the maximum likelihood, respectively. The specifications of the first two models have been reported in Table 6. The third model has the same margins as the first two models; however, it has been calibrated using the maximum likelihood. This model has been falsified in our preliminary inspection, due to the wide distance between the lower/upper bounds of parametric convergence. For each model, 1000 bivariate samples have been extracted using the fitted structure and the model parameters have been re-estimated using the generated samples and each of the estimation methods. Figure 7 shows the results of this experimentation. The horizontal line shows the Gumbel–Houguaard parameter estimated using the original data set and Kendall’s tau. As can be easily investigated, the method of maximum likelihood considerably suffers from lack of robustness comparing with both method of moment and minimum distance.

Fig. 7
figure 7

The variation in the sampled copula parameters in the three Gumbel–Houguaard models calibrated using the method of moment, minimum distance and maximum likelihood, respectively

Based on the Gumbel–Houguaard copula fitted using the method of moment, the conditional probability can be estimated based on the following equation:

$$ C_{\theta }^{u} (v) = P\left\{ {V \le v|U = u} \right\} = {\frac{\partial }{\partial u}}C(u,v) $$
(15)

Figure 8 provides graphical presentation of the conditional probabilities. Based on this figure, and given a value for annual cumulative evapotranspiration, the possible quantities and the corresponding CDF for maximum annual water deficit can be identified. Figure 9 compares the CDF of maximum annual water deficit when annual cumulative evapotranspiration estimations are at 10, 50 and 90 percentiles. The joint model not only capture the uncertainties in maximum annual water deficit with respect to the estimations of annual cumulative evapotranspiration; but also, can be further used in nearby reconstructed watersheds with similar physical characteristics and soil cover, in which the soil-moisture measurements are not available to directly assess the performance of the reclamation soil cover.

Fig. 8
figure 8

Conditional CDF for maximum annual water deficit given the annual cumulative evapotranspiration derived for YJP prototype site

Fig. 9
figure 9

Different CDFs for maximum annual water deficit given the annual cumulative evapotranspiration in its 10, 50 and 90% of exceedance probability derived for YJP prototype site

6 Summary and conclusion

Majority of hydrologic variables are interdependent. Considering the joint characteristic in the case of dependence not only provides more realistic representation of the involved variables, but also offers the means of quantifying one variable with respect to the others. In this study, the interdependence between annual cumulative evapotranspiration and the maximum annual water deficit in reconstructed watersheds was studied. Assigning the maximum annual water deficit requires extensive measurements of soil moisture in different depths, which can be only available in very few prototype watersheds. On the other hand, estimating the annual cumulative evapotranspiration can be much easier in practice. Therefore, if the joint model of interdependence is known, the maximum annual water deficit can be approximated based on the estimated annual cumulative evapotranspiration. The current study has applied the copula framework for such a problem. A diverse test space has been considered for setting up the marginal distributions, describing the multivariate joint model and estimating the models’ parameters. A simple framework for initial ranking of the copula models was proposed. A goodness-of-fit test and tail analysis was applied to select the most credible copula model among an ensemble of potential candidates. It has been concluded that Gumbel–Houguaard copula provides the most credible model of dependence for the considered dataset. Moreover, it seems that the method of moments can provide the most reliable way for adjusting the copula parameters. However, this requires a direct relationship between the measures of dependence and the model parameters, which is not available for many copula structures. Comparing the general rules of copula parameter estimation, i.e. maximum likelihood versus minimum distance, the results of this study supported the superiority of the latter technique; however, by no means this can be extended to other problems and/or datasets. Further research in this direction is suggested.

The application of copula modelling can be potentially capable in various fields within the scope of hydrology and environmental modelling. However it should be noted that copula modelling is valid, if and only if, the considered variables and their margins are continuous random variables as Embrechts (2009) warned that discrete margins can cause severe problems. The works of Genest and Neslehova (2007) as well as Meisar and Komorníková (2009) on discrete copulas and copulas based on partial knowledge can be promising. Also copula modelling, at least in the field of hydrology and environmental modelling, has been mainly considered for constructing static stochastic distributions rather than describing multivariate stochastic processes. Describing the dependence within dynamic datasets is still fairly untouched in the field and, indeed, is a very interesting topic for future investigations. Regarding the application context reported in this paper, one can question the uncertainties propagated into the copula modelling because of the involved uncertainties in the applied field measurements and/or hydrologic model. This would be the next step in our investigation.