1 Introduction and literature review

One of the main objectives of regional frequency analysis (RFA) is the estimation of extreme event quantiles (e.g. floods and droughts) at sites where little or no hydrological data is available. In general, RFA procedures have two main steps, namely the delineation of homogeneous regions (DHR) and regional estimation (RE) (e.g. Chebana and Ouarda 2007, 2008; Ouarda et al. 2008a). For each of these two steps, a large number of methodologies have been proposed (Ouarda et al. 2008b). Canonical correlation analysis (CCA) is one of the most commonly used methods for DHR where it consists in identifying linear combinations of variables within the same group, for which the canonical correlation is maximal. Ouarda et al. (2008a) demonstrated the advantages of CCA by comparing its performance to other techniques such as the hierarchical cluster analysis approach. However, note that in Shu and Ouarda (2007), CCA was used not for the DHR step, but to form a canonical physiographic space over which an artificial neuronal network (ANN) is then employed to estimate flood quantile.

CCA is an important statistical tool for multivariate data analysis. However, it presents a drawback in the interpretation of results, which seems to be often difficult. In addition, this approach is based on a linear foundation and, hence, is not able to adequately describe non-linear relationships between variables. Therefore, CCA may not be suitable for representing hydrological processes in the DHR step. Two groups of variables are usually considered in RFA: (i) hydrological variables and (ii) meteorological and/or physiographical characteristics of the watersheds (Ouarda 2013). Hydrological processes are relatively complex because of the variability in the response of watersheds which does not generally result from a linear relationship between the hydrological and the physiographical characteristics (e.g. Chen et al. 2008; Xu et al. 2010; Chebana et al. 2014). Hydrological processes and their inherent non-linearities could not be adequately represented by linear relationships. One aspect of the non-linearity is represented by the rainfall-runoff relationship. Indeed, the variations of meteorological variables and flows are linked by a non-linear relationship (Riad and Mania 2004). This non-linear behavior depends strongly on the physiographic characteristics of the watersheds. For instance, surface runoff is strongly influenced by the soil storage capacity and soil infiltration.

A number of statistical tools have been proposed in the literature to deal with the additional complexity associated to non-linearity in a variety of fields (e.g. Bolton et al. 2003; Yin 2007). Among the proposed techniques, we can mention non-linear principal component analysis (NL-PCA) (Rumelhart et al. 1985; Kramer 1991) and non-linear CCA (NL-CCA) (Dauxois and Nkiet 1998; Hsieh 2000). NL-PCA has been applied in various fields such as chemistry (Kramer 1991), image processing (Botelho et al. 2005) and atmospheric sciences (e.g. Sengupta and Boyle 1995; Monahan 2000). Sengupta and Boyle (1995) applied NL-PCA to average monthly rainfall data in the United States. Compared to conventional PCA, results showed that the non-linear approach is a more effective data reduction tool. It was also demonstrated that NL-PCA represented better the variation of variables than ordinary PCA. However, this method presents some technical drawbacks (Malthouse 1998).

Although the above constraints of the NL-PCA also persist for NL-CCA (Hsieh 2000), the latter seems to provide better results than the CCA. NL-CCA was used in several fields, such as analysis of voice conversion (e.g. Zhihua and Zhen 2010), biomedicine (e.g. Campi et al. 2013), medicine (e.g. Wang et al. 2005) and sociology (e.g. Frie and Janssen 2009). A number of techniques related to NL-CCA have been proposed in the literature. For instance, Dauxois and Nkiet (1998) introduced measures of association between two random variables based on NL-CCA. Among the most studied non-linear methods associated to CCA, we can mention the neural network approach (NN) (Hsieh 2000), genetic algorithms (GA) (Kruger et al. 2004) and Kernel based methods (Akaho 2001; Hardoon and Shawe-Taylor 2009). Recently, Nagai (2013) proposed an optimization approach based on cross validation to optimize the NL-CCA parameters. In terms of applications, the non-linear method based on NN was adopted in a number of studies in meteorology and climatology. For example, Wu and Hsieh (2002) studied the El Nino Southern oscillation using NL-CCA based on the NN approach (CCA-NN). They showed the ability of CCA-NN to detect non-linearity between surface wind stress and sea surface temperature. Hsieh (2001) also applied CCA-NN to study the relationship between sea level pressure in the tropical Pacific and sea surface temperature. Results revealed the ability of this model to characterize non-linearity between variables, which was not the case with the conventional CCA.

Other studies in the past were interested by treating non-linear aspects of categorical variables (qualitative). Gifi (1990) presented two different techniques and algorithms, mainly OVERALS and CANALS to deal with such qualitative variables. However, the treated variables in RFA are quantitative and continuous. Therefore, the latter methods are not applicable in the context of the present study. In Table 1, all non-linear approaches discussed previously are summarised including their advantages and drawbacks. Note that methods designed for quantitative variables are more flexible than those for categorical ones.

Table 1 Summary of common methods of NL-CCA

Despite strong evidence concerning the non-linearity of hydrological processes, NL-CCA approaches have not yet been considered in hydrology. In RFA, non-linear approaches can account for possible non-linearities in order to determine the most representative homogeneous regions and lead to a better regional estimation. The purpose of the present paper is to deal with the issue of non-linearity in RFA by introducing NL-CCA in the DHR strep in order to improve its performance and representativeness.

The present paper is organized as follows: In the following section, the potential of NL-CCA in the DHR step is developed. In order to verify and validate the usefulness of the NL-CCA approach for the modelling of hydrological processes, a comparative study is carried out in Sect. 3 using three different datasets from North America (Quebec, Arkansas and Texas). These approaches are used in the delineation of hydrological neighborhoods where the obtained results are presented and discussed in Sect. 4. The conclusions of this work are reported in Sect. 5.

2 Background and methodology

In this section we present a brief description of the use of CCA in RFA, as well as a description of the NL-CCA method and its application to RFA.

2.1 Canonical correlation analysis in RFA

CCA is a multivariate analysis method used to identify the correlations that may exist between two groups of variables. It has been applied in a number of fields, such as seasonal climate forecasting (e.g. Barnett and Preinsendorfer 1987), management science (e.g. Tishlert and Lipovetsky 1996), forecasting of accident risk modeling (e.g. Michael and Raymond 2003), river thermal regime modeling (e.g. Guillemette et al. 2009), water quality estimation (e.g. Khalil et al. 2011) and especially flood frequency estimation (e.g. Ouarda et al. 2001).

As mentioned above, in RFA, variables of interest are mainly hydrological and physiographical variables. We denote Y the vector describing hydrological variables, and X the vector containing meteorological and/or physiographical variables. Considering linear combinations of variables \( X_{1} ,X_{2} , \ldots ,\,X_{q} \) and \( Y_{1} ,Y_{2} , \ldots ,\;Y_{r} \), we obtain a new canonical space composed by canonical vectors U i and \( V_{i} \) such as:

$$ U_{i} = a_{i1} X_{1} + a_{i2} X_{2} + \cdots + a_{iq} X_{q} $$
(1)
$$ V_{i} = b_{i1} Y_{1} + b_{i2} Y_{2} + \cdots b_{ir} Y_{r} $$
(2)

where \( i = 1, \ldots ,p \) with \( p = \hbox{min} \left( {r,q} \right) \). The canonical space is built under constraints of unit variance and maximum correlation between pairs of canonical variables. Let \( \Lambda \) be a p-by-p diagonal matrix composed of canonical correlation coefficients given by:

$$ \lambda_{i} = corr\left( {U_{i} ,V_{i} } \right);\quad i = 1, \ldots ,p $$
(3)

Once the first pair of canonical variables \( (U_{1} ,\,V_{1} )_{p} \) is obtained, other canonical pairs are obtained subject to the constraint \( corr\left( {U_{i} ,V_{j} } \right) = 0 \) for i ≠ j. Note that all distinct hydrological canonical variables (as well as distinct physiographical variables) are also uncorrelated (Ouarda et al. 2001).

In order to improve quantile estimations in RFA, CCA is commonly used for the determination of neighborhoods of target sites. For an ungauged site, the canonical meteorological-physiological information \( U_{0} \) is usually known but the hydrological information \( V_{0} \) is not available. The hydrological mean position of the target site S is given by \( \Lambda U_{0} \). Hence, a 100 (1−α) % confidence level neighborhood is identified by the Mahalanobis distance. It is considered between the mean position of target site \( \Lambda U_{0} \) and positions of other sites V, such that:

$$ \left( {V - \Lambda U_{0} } \right)^{\prime } \left( {I_{p} - \Lambda^{2} } \right)^{ - 1} \left( {V - \Lambda U_{0} } \right) \le \chi_{\alpha ,p}^{2} $$
(4)

where \( P( {\chi_{p}^{2}\, \le\, \chi_{\alpha ,p}^{2} } ) = 1 - \alpha \) and \( \chi_{p}^{2} \) has a Chi squared distribution with p degrees of freedom. Expression (4) is used to define an ellipsoid representing the neighborhood region for the ungauged site associated to \( \Lambda U_{0} \) (Ouarda et al. 2001).The equation of the ellipsoid has the following form:

$$ \frac{{\left( {V_{1} - {{\Lambda}_{1}} U_{01} } \right)^{2} }}{{a^{2} }} + \frac{{\left( {V_{2} - {{\Lambda}_{2}} U_{02} } \right)^{2} }}{{b^{2} }} = 1 $$
(5)

where V1 and V2 denote the hydrological canonical variables, \( \Lambda_{1} \) and \( \Lambda_{2} \) are the canonical correlation coefficients, (\( {{\Lambda}_{1}} U_{01} \), \( {{\Lambda}_{2}} U_{02} \)) are the coordinates of the center of the ellipsoid and a and b denote respectively the semi-major axis (or focal) and the semi-minor axis (Ballard 1981). Expression (5) is the equation of an ellipsoid in an orthonormal base (two orthogonal unit vectors), where axes are parallel to the coordinate system axes.

2.2 Nonlinear CCA using a neural network approach (CCA-NN)

An artificial neuron network ANN is a fairly simple mathematical model compared to the natural biological evolution, with a running-inspired design of biological neurons (Bishop 1995). It consists essentially in several neurons generally organized in layers. The output of each neuron results from the weighted sum of inputs, and transformed by a transfer function. Different transfer functions can be used (Duch and Jankowski 1999). ANNs have been widely used in a number of fields, such as in geology where Li et al. (2014) utilized the back-propagation (BP) neural network approach to forecast the geological hazard linked to bank destruction and landslides, and in hydrology where Zaier et al. (2010) used ANNs to model lake ice thickness, and Chen et al. (2014) used ANNs to model the rainfall-runoff relationship. As previously indicated, ANNs were integrated in RFA for instance by Ouarda and Shu (2009) and by Aziz et al. (2014) for the estimation of flood quantiles at ungauged sites.

In the meteorological field, Hsieh (2000) developed a NL-CCA version based on ANN (CCA-NN). The CCA-NN approach consists on establishing non-linear combinations between groups of original variables (X and Y) and the new canonical variables (U and V) via a transfer function. Consider the following hidden layer:

$$ h_{k}^{(x)} = f\left( {\left( {W^{(x)} x + b^{(x)} } \right)_{k} } \right);\quad k\,{\text{and}}\,n = 1,\ldots,l $$
(6)
$$ h_{n}^{(y)} = f\left( {\left( {W^{(y)} y + b^{(y)} } \right)_{n} } \right) $$
(7)

where \( W^{(x)} \) and \( W^{(y)} \) are weight matrices, \( b^{(x)} \) and \( b^{(y)} \) are vectors of biased parameters, k and n denote respectively the indexes of the vector’s elements \( h^{(x)} \) and\( h^{(y)} \) and l denotes the number of hidden neurons. The transfer function f, the same for x and y, is generally set to the hyperbolic tangent function (Hsieh 2000):

$$ f\left( x \right) = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }} $$
(8)

Multivariate canonical neurons U and V are determined from a linear combination of respective neurons \( h^{(x)} \) and \( h^{(y)} \)(but from a non-linear combination with respect to x and y):

$$ U = w^{(x)} h^{(x)} + \overline{b}^{(x)} $$
(9)
$$ V = w^{(y)} h^{(y)} + \overline{b}^{(y)} $$
(10)

Without loss of generality, U and V are assumed to have zero mean. Thus, we have

$$ \overline{b}^{(x)} = - \left\langle {w^{(x)} h^{(x)} } \right\rangle \quad {\text{and}}\quad \,\overline{b}^{(y)} = - \left\langle {w^{(y)} h^{(y)} } \right\rangle $$
(11)

where \( \left\langle z \right\rangle \) is the empirical mean of variable z.

A limitation of the CCA-NN is that, once applied to the original data, it provides only one pair of canonical variables, i.e. one for the physiographical variables and one for the hydrological variables. This may lead to ignoring a part of the information since it is not guaranteed that the first pair of canonical variables covers a significant part of the explained variance. To overcome this problem, the notion of modes was considered (Hsieh 2000). It consists in applying CCA-NN on the original datasets. The obtained result, denoted \( x^{\prime} \), is related to the first mode. For the second mode, the CCA-NN is applied to the initial data, i.e. the set x, excluding the first mode. In other words, we determine the unexplained information in the previous mode by reapplying the procedure on the new variables:

$$ I_{2} = x - x^{\prime} $$
(12)

Based on Eq. (12) we get:

$$ J_{2} = y - y^{\prime} $$
(13)

where y′ is the result of the first iteration, y is the matrix of original data.

The same procedure applies for higher order modes by considering each time the residual of the previous mode as input. The number of iterations, m, should be at least equal to the lowest number of variables, p in our case. The final result consists in summing up the results of all considered iterations:

$$ x_{estimated} = x^{\prime} + x^{\prime\prime} + \cdots + x^{m} $$
(14)

where \( x^{m} \) is the result of the mi th iteration, \( m \ge p \). Therefore, the use of several modes may increase the percentage of the information contained in the resulting canonical variables.

2.3 Adaptation of CCA-NN to regional frequency analysis

For more clarity and to avoid confusion, it is important to note that in the approach proposed by Shu and Ouarda (2007), a CCA-based ANN model is used for flood quantile estimation without considering the DHR step and in which the employed CCA is the linear one. The aim of the linear CCA in Shu and Ouarda (2007) is to filter the signal from the original data and apply the ANN model on the canonical variables. However, in the present work the non linear version of CCA using ANN (CCA-NN) is introduced in order to identify homogeneous regions, while a log linear regression model is used in the RE step.

Several versions of CCA-NN may be considered depending on the selected cost functions (canonical correlation, mean square error MSE, mean absolute error MAE). Indeed, Cannon (2008) introduced a robust version of CCA-NN based on the biweight midcorrelation coefficient as a new measure of correlation instead of the Pearson correlation. After choosing the cost functions, canonical variables can be obtained and hence one can determine the hydrological neighborhood for an ungauged site. In the non-linear case, the variables V1 and V2 denote the hydrological canonical variables of the first and second mode, respectively, and \( \Lambda_{1} \) and \( \Lambda_{2} \) are the canonical correlation coefficients of the two modes. Identifying the physiographical coordinates of an ungauged site, U01 and U02, is performed using relation (9).

Similarly to the neighborhood of the linear case, the non-linear one can be obtained using the same constraint. However, the equation of the ellipsoid is different from the linear case (5), since the axes are not parallel to those of the coordinate system.

Let Y denote an array of hydrological data and V the corresponding canonical variable, thus we can write:

$$ Y = h\,(V) $$
(15)

Therefore by substituting (15) in (13) we obtain:

$$ h_{2} (V_{2} ) = h_{1} (V_{1} ) - y^{\prime} $$
(16)

Note that h, \( h_{1} \) and \( h_{2} \) are known non-linear functions.Hence, the angle \( \theta = (V_{1} ,V_{2} ) \) is different from \( \pi /2 \). Since the axes of the ellipsoid are always perpendicular, the ellipsoid is then rotated through an angle \( \phi \) relative to the coordinate system\( (V_{1} ,Z) \). As illustrated in Fig. 1, \( (V_{1} ,Z\,) \) is an orthonormal basis with Z = sin(θ) V2. The equation of the ellipsoid in the non-linear canonical space is given by:

$$ \frac{{\left( {P_{1} - {{\Lambda}_{1}} U_{01} } \right)^{2} }}{{a_{1}^{2} }} + \frac{{\left( {P_{2} - {{\Lambda}_{2}} U_{02} } \right)^{2} }}{{b_{1}^{2} }} = 1 $$
(17)

where:

$$ P_{1} = V_{1} \cos \phi - Z\sin \phi \,\quad {\text{and}}\,\quad P_{2} = V_{1} \sin \phi + Z\cos \phi $$
(18)

Note that the angle is the same for all sites and with different values of α. It depends only on θ: \( \phi = f(\theta ) \). Equation (5) related to the linear CCA is a special case of (17) with a zero angle of rotation \( \phi \) and \( \theta = \pi /2 \).

Fig. 1
figure 1

Geometrical definition of the homogeneous region in the non-linear canonical space

Similarly to CCA, the objective of NL-CCA consists in reducing the dimensions of hydrological and physiographical/meteorological spaces by taking into account the relationships between the considered variables. However, the construction of CCA reflects only linear relationships. The use of NL-CCA is necessary especially in the presence of non-linear structures. Note that the non-linearity in the hydrological processes is related to the non-linearity treated in NL-CCA.

To get a clear view of the correlation structure, it is essential to locate the source of interactions between variables. Note that the non-linearity in NL-CCA exists between the canonical and original variables of the same set, e.g. between U and physiographic variables. However, the non-linearity that occurs through the hydrological process is between hydrological variables Y and physiographical ones X. We show that these two types of nonlinearities are connected. Indeed, in the NL-CCA context, the canonical variables can be written as:

$$ U_{i} = f_{1} (X_{i} )\,\quad {\text{and}}\,\quad V_{i} = f_{2} (Y_{i} ) $$
(19)

where \( f_{1} \) and \( f_{2} \) are non-linear functions (or linear in the case of CCA) and \( i = 1, \ldots ,p \). The simplest situation is the linear case, where more complex relations load to the same correlation:

$$ U_{i} \approx \lambda_{i} V_{i} \; $$
(20)

The symbol \( \approx \) indicates that both sides are approximately equal. Using relation (19), we obtain:

$$ U_{i} \approx \lambda_{i} f_{2} \left( {Y_{i} } \right) \approx h(Y_{i} )\; $$
(21)

Substituting Eq. (19) into (21), we get:

$$ h(Y_{i} ) \approx f_{1} (X_{i} ) $$
(22)

which leads to

$$ Y_{i} \approx k(X_{i} ) $$
(23)

where k (.) is a general function (if h is invertible k would be equal to \( h^{ - 1}\,of_{1} \)).

Thus non-linear relations described by (19) are equivalent to non-linear relationships between the two groups of original variables (23). On the other hand, the presence of non-linearity in hydrological processes, between X and Y, leads to a non-linearity between canonical variables. Therefore, it is necessary to use the nonlinear approach in the context of RFA.

2.4 Regional estimation

Among the various RE methods, the most popular ones are the index-flood and regression models (Ouarda 2013). In this paper we focus on the multivariate log-linear regression model, since it is more appropriate to use with CCA and with the available datasets. The relationship between flood quantiles (Y) and the physiographical/meteorological characteristics (X) is generally described by a power product model. With a log-transformation, the following log-linear model is obtained:

$$ \log {\kern 1pt} (Y) = \beta \log {\kern 1pt} (X) + \varepsilon $$
(24)

where β is a vector of parameters and ε represents the error (see Pandey and Nguyen (1999) for instance).

2.5 Evaluation criteria

To assess the performance of the proposed techniques, different criteria are used. Each model is evaluated using the following five indices: the Nash criterion (NASH) which provides a general evaluation of the quality estimation, the root mean squared error (RMSE) providing information about the accuracy of the estimator in an absolute scale, the relative RMSE (RMSEr) which is related to the relative scale, the mean bias (BIAS) and the relative mean bias (BIASr) provide a measure of the magnitude of overestimation or underestimation of a model. These indices are estimated based on a jackknife resampling procedure (e.g. Ouarda et al. 2001). It consists in removing temporarily each site and considering it as an ungauged one. The regional estimate is thus compared to the local estimate and the ability of each method is then evaluated.

The correlation coefficient and the proportion of explained variance are also used as evaluation criteria in the present work. The explained variance is deduced from the correlations between canonical components and initial variables, (Van Den Wollenberg 1977):

$$ \sigma_{E}^{2} \left( {U_{i} } \right) = \frac{1}{q}\sum\limits_{j = 1}^{q} {[corr\,(U_{i} ,X_{j} )]^{\,2} } $$
(25)

In a similar way, expression (25) is also valid for hydrological variables Y j,j  = 1,…,r and canonical variables V i,i  = 1,…,2.

3 Case study

3.1 Data

The data used in this study covers three regions in North America, namely the province of Quebec (Canada), and the states of Arkansas and Texas (USA). The data from Arkansas and Texas are available in Tasker et al. (1996).

The first region includes 151 hydrometric stations and is located in the southern part of the province of Quebec, between 45° and 55° N. The considered physiographical and meteorological variables are those used previously by Chokmani and Ouarda (2004): the mean basin slope (PMBV), the basin area (BV), the proportion of the basin area covered with lakes (PLAC), the annual mean total precipitation (PTMA) and the annual mean degree-days (DJBZ). Hydrological variables are at-site flood quantiles standardized by basin area to eliminate the scale effect (specific quantiles), denoted Q ST for a return period T. For each site, the most appropriate statistical distribution has been identified in order to estimate the quantiles corresponding to different return periods. Two specific quantiles are selected for this study, namely the 10-year and the 100-year quantiles.

The second case-study concerns data from the state of Arkansas in the southern United States. Data stems from a hydrometric network composed of 204 gauging stations with drainage areas ranging from 0.13 to 6890 km2. The same data was used by Tasker et al. (1996), namely the area (A), the slope of the main channel (S), the mean annual precipitation (P), the mean elevation of the watershed (EL), the length of the main stream (L), and estimated flood quantiles, QST, corresponding to return periods of T = 2, 5, 10, 25 and 50 years.

The last region covers a hydrometric network of 69 stations in the state of Texas. Basin areas range between 86 and 101,000 km2. The variables used are those indicated in Tasker et al. (1996) i.e. five physiographic variables (A, S, P, EL and L) and five flood quantiles which are the same as those considered in the Arkansas case study.

3.2 Model design

In order to determine the homogeneous region, both CCA and CCA-NN analysis were carried out in the DHR step using r = 2 hydrological variables and q = 5 physiographical variables for all case studies (Quebec, Arkansas and Texas).

To build a model able to provide flood quantile estimation using the neighborhood approach, the CCA and CCA-NN approaches are coupled to a log-linear regression (24) in the RE step (denoted CCA & LR and CCA-NN & LR respectively). For comparison purposes, two regression models are considered in the non-linear case, according to the explanatory input variables, either directly using the initial data (X) or using the physiographical canonical variables (U1, U2). The latter is denoted CCA-NN & CLR and has the advantage of considering only the useful information with a smaller number of variables.

To compare the obtained results with different approaches presented in Chebana and Ouarda (2008), we discuss essentially results related to Quebec. Results associated to the other two regions will be presented briefly. Actually, several versions of CCA-NN with different cost functions were treated (Correlation coefficient/Mean absolute error COR/MAE, biweight midcorrelation coefficient/Mean absolute error BICOR/MAE and biweight midcorrelation coefficient/Mean square error BICOR/MSE). In the section below, only the results associated to BICOR/MSE are presented and discussed since this version provides the lowest evaluation criteria values. This finding is in concordance with the conclusion presented by Cannon (2008). In addition, it should be noted that the choice of the transfer function is an important step in ANN modeling, as it can significantly affect the results. In the hydrological literature, the sigmoid and the hyperbolic tangent functions are most commonly used as nonlinear transfer functions (Dawson and Wilby 2001; Yonaba et al. 2010). In this regard, several transfer functions belonging to the sigmoid function class were tested (the arctangent, the hyperbolic tangent and the sigmoid), and the hyperbolic tangent function yielded the best results. Hence, this transfer function (8) is employed for all case studies in the neurons of the hidden layers. The outputs of this model are canonical variables when the model is designed to forward mapping, and original variables in the case of inverse mapping. In the current application, three NNs were considered where the first ensures the forward mapping, while the second and the third are relative to the inverse mapping.

After extracting the first CCA-NN mode, the extraction of second mode is carried out by taking the residual as input, i.e., the original data minus the first CCA-NN mode, as in (12). Hence, we obtain the canonical variables in the non-linear space. Based on the Mahalanobis distance (4), the hydrological neighborhood of each ungauged site is determined.

4 Results

In this section, we present the results of the regional flood estimation procedure where the CCA-NN approach is considered for the DHR step. First, preliminary results are presented in order to study the relationships between variables. Figure 2 presents scatter plots of flood quantiles and physiographical/meteorological variables for Quebec. The examination of the scatter plots shows different forms of relationships between variables. We note, for instance, the existence of non-linear relations. The most notable ones are those between the variable basin area (BV) and the rest of the variables. Table 2 presents the correlation coefficients between the hydrological and the physiographical variables. Despite the existence of a relatively strong positive correlation between flood quantiles and PLAC on one hand, and negative linear correlation between quantiles and PTMA on the other hand, we can observe from Fig. 2 that these structures are rather non-linear. Further correlation measures are also evaluated between these variables. Figure 3 shows the correlation coefficients obtained by other correlation measures with respect to the Pearson correlation. This empirical comparison shows differences between measures, expressed by values higher or lower than those based on Pearson correlation. These behaviors indicate the existence of other dependence structures that are more complex than linearity.

Fig. 2
figure 2

Scatter plot of physiographical and hydrological variables—Quebec

Table 2 Correlation between hydrological and physiographical variables-Quebec
Fig. 3
figure 3

Empirical comparison between the Pearson correlation and other measures of correlation (the Kendall tau, the Spearman Rho and the biweight midcorrelation)Quebec

By carrying out a linear CCA, the canonical correlation coefficients (3) are \( \lambda_{1} \, = 0.81 \) and \( \lambda_{2} \, = 0.27 \). In Chebana and Ouarda (2008), representations of data in the canonical spaces (not presented here to avoid repetition) show that the relationship between the first two canonical variables \( \left( {U_{1} \,,V_{1} } \right) \) can be considered to be linear, unlike variables \( \left( {U_{2} \,,V_{2} } \right) \) where linearity is relatively low.

In the following, results related to the CCA-NN are presented and discussed. Figure 4 presents the scatterplot of the study sites in the non-linear canonical spaces: physiographical \( \left( {U_{1} \,,U_{2} } \right) \) and hydrological \( \left( {V_{1} \,,V_{2} } \right) \). It is also convenient to present data in the spaces (U1, V1) and (U2, V2) to get prior information about the estimation error (Chebana and Ouarda 2008). This is illustrated in Fig. 5 for the non-linear case. A nearly linear relationship is observed between the two canonical variables (U1, V1). This is not the case for the couple (U2, V2). However, the CCA-NN scatterplot seems to be more linear than the scatterplot of the data set in the linear space (U2, V2) presented in Chebana and Ouarda (2008). This may be explained by the fact that the canonical correlation coefficients obtained from CCA-NN (\( \lambda_{1} \, = 0.90 \) and \( \lambda_{2} \, = 0.36 \) using (3) and (20)) are higher than their counter parts deduced from CCA.

Fig. 4
figure 4

Data set in the non-linear canonical spaces: a physiographical and b hydrological—Quebec

Fig. 5
figure 5

Data set in the non-linear canonical spaces: a (U1, V1) and b (U2,V2)—Quebec

The explained variance (25), for the two first components, is respectively 51.16 and 97.36 % (vs 56.92 and 99 % in the linear CCA). Therefore, the canonical variables deduced from the linear CCA explain slightly better the variance of variables than those corresponding to CCA-NN. This may be due to the linearity induced by the correlation coefficient in the expression of the explained variance. However this does not affect the results significantly since the selection of canonical variables is based essentially on the canonical correlation coefficients.

In the following we study the difference between the linear and non-linear approaches in identifying the hydrological neighborhood. The neighborhoods of selected stations are presented for both CCA and CCA-NN approaches in Fig. 6. We observe a remarkable difference between the two approaches. Indeed, using the CCA, the neighborhood of each site is an ellipsoid with a zero angle of rotation. The non-linear method identified a rotated ellipsoid with a rotation angle ϕ ~ 21°. Unlike CCA, the orientation of the CCA-NN ellipsoid tends to follow the shape of the data dispersion. For instance, the non-linear neighborhood of station 030340 (n = 45) identified 31 neighboring stations while the linear one identified a classical neighborhood with 39 stations, for the same value of α, \( \alpha_{CCA - NN} = 0.2 \). This means that the CCA-NN requires a smaller number of stations to reach the same RMSE as CCA. The optimal value of α corresponds to the minimum RMSEr. Figure 7 presents the variation of RMSEr for different values of α using CCA-NN. It can be seen that the optimal value αCCA-NN is 0.2. Note that for high values of α, the performance criteria tends to infinity.

Fig. 6
figure 6

DHR results shown for stations 030340, 030420 and 02717 using: a CCA and b CCA-NN approaches, n = 45, 49 and 150 respectively—Quebec

Fig. 7
figure 7

RMSEr variation as a function of the α parameter for hydrological variables QS10 and QS100—Quebec

To assess the magnitude of obtained results and their impact on RFA, we proceed to the RE step. Table 3 illustrates the jackknife results for all considered approaches through the criteria cited above. It can be seen that the NASH of the linear and non-linear models are substantially equal and sufficiently high to present acceptable results. For instance, for a return period of 100 years, the NASH of CCA is equal to 0.70 while it is equal to 0.71 for the non-linear case. Results indicate also that the RMSE of CCA-NN & LR and CCA & LR are almost equal whereas the RMSEr of the estimates computed by the CCA-NN & LR model are considerably lower than the linear model. By comparing the results with those obtained with the iterative procedure in Chebana and Ouarda (2008) and Wazneh et al. (2013) for the same data set, it can be seen that the proposed model, CCA-NN & LR, leads to best results among all models in terms of RMSEr. Indeed, while the linear approaches resulted in an RMSEr value of about 38 % for the quantile QS10 and 44 % for the quantile QS100, the CCA-NN & LR RMSEr values are around 34 % for the quantile QS10 and 41 % for the quantile QS100. It is also observed that the CCA-NN & LR results in both spaces, canonical and original, are very similar and are significantly better than the other models, i.e. the linear approach and the iterative procedure.

Table 3 Jackknife validation results-Quebec

For all considered models, the BIAS is very close to zero with a slight improvement with the CCA-NN & LR approach. According to the BIASr criterion, the CCA-NN & CLR leads to the best results. However, in comparison with results reported in Wazneh et al. (2013), the BIASr of the proposed models is higher (for values of QS100 and QS10 BIASr values are about −6 and −7 % respectively using the CCA-NN & LR, versus around −2 and −3 % with the iterative procedure). This may be explained by the choice of the ANN parameters in the CCA-NN method. In fact, different parameters must be fixed from the beginning to guarantee optimum solution, such as penalty parameters which are chosen in such a way to avoid over-fitting. Optimization of these parameters is performed based on the RMSEr criterion. Consequently the model loses in terms of BIASr but this latter remains in the same order of magnitude as the linear approaches.

Figure 8 presents the estimation error for flood quantiles QS100, and QS10 using both the CCA & LR and the CCA-NN & LR models. One can observe that, overall, the CCA-NN & LR leads to smaller estimation errors than the linear model, CCA & LR. Particularly, the improvement for some sites is significant. For instance, for site 66 which has a particular location in both linear and non-linear canonical spaces, the estimation error goes from −4.13 using CCA & LR to −2.3 using CCA-NN & LR.

Fig. 8
figure 8

Estimation error resulting from the CCA & LR and CCA-NN & LR models—Quebec

In the following, selected results related to Arkansas and Texas are presented. Without loss of generality, we will focus on specific quantiles corresponding to return periods of 10 and 50 years.

Table 4 presents canonical correlation coefficients as well as percentages of explained variance for these two regions resulting from linear and non-linear CCA. Results indicate that, similarly to the region of Quebec, the canonical correlation coefficients are more important using a CCA-NN than using a CCA. This means that the non-linear components capture more information than the linear ones. However, as it was the case for Quebec case study, the explained variance of CCA is slightly higher than that of CCA-NN.

Table 4 Correlation coefficients and percentage of explained variance for CCA & LR and CCA-NN & LR relative to Arkansas and Texas

Table 5 summarises the results of the jackknife procedure using linear and non-linear analysis for these two regions. These results confirm the superiority of the non-linear approach. Indeed, when proceeding with CCA-NN & CLR applied to data of Arkansas, this model improves the RMSEr of QS10 by about 2 % over the linear model CCA-LR and about 10 % for QS50. Similarly, results for the Texas region indicate that non-linear models perform better than CCA. The improvement of the RMSEr is even more important for Texas than for the Arkansas case study, with a significant improvement of BIASr.

Table 5 Jackknife validation results

5 Conclusions

This study has focused on the use of CCA-NN & LR methods in the context of RFA. The CCA approach has been successfully used for the delineation of homogeneous regions in RFA. However, this approach is not capable of representing the possible non-linear relationships between the variables of interest. To overcome the CCA limitations, several non-linear methods have been developed and used in other fields. CCA-NN and CCA-K are among the most prominent and most commonly used non-linear CCA methods.

In the current work, CCA-NN is presented and adapted to the RFA context. The method is also applied to three different regions to study its robustness in dealing with the nonlinearity of hydrological processes. In order to assess the performance of this method, its results are compared to those of linear CCA. Results show that CCA-NN can be adopted to represent the non-linear behavior of hydrological process and provide a more accurate and flexible delineation of homogeneous neighborhoods leading to a better regional estimation. However, this method has a number of drawbacks similarly to other ANN-based approaches, such as the identification of optimum parameters and the selection of the transfer function. This latter requires the non-linear relationship to be empirical, i.e., dependent on the data, whereas in the current work and previous works, the hyperbolic tangent function was considered.