1 Introduction

Regional frequency analysis (RFA) procedures are commonly used in hydrology to estimate flood and low-flow quantiles at sites where little or no hydrological data is available. Generally, RFA includes two main steps: delineation of homogenous regions (DHR) and regional estimation (RE) (e.g. Chebana et al. 2014; Chebana and Ouarda 2007; Ouarda 2016). In this context, climatic, morphometric and physiographic characteristics of the watershed are widely used to describe geomorphic processes (e.g. Baumgardner 1987; Hadley and Schumm 1961; Marchi and Dalla Fontana 2005; Tramblay et al. 2010) in order to predict hydrological variables using RFA approaches (e.g. Dawson et al. 2006; Dodangeh et al. 2014; Goswami et al. 2007; Seidou et al. 2006; Tsakiris et al. 2011).

A number of physio-meteorological variables, such as basin area, basin slope, precipitation characteristics and land occupation are commonly used in the field of hydrology and more precisely in the RFA procedures. They are considered as the most relevant variables for these studies based on their high correlation with the hydrological variables (Chokmani and Ouarda 2004). In addition to the commonly considered variables (a more exhaustive list is in Table 1), drainage network characteristics (Jung et al. 2017) and tectonic setting (e.g. Ahmadi et al. 2006; Hamed et al. 2014) may have a strong impacts on hydrological dynamics, and are consequently related to flood quantiles. However, they are not yet well investigated and integrated in RFA studies. Indeed, the assessment of morphometric and physiographic variables requires the analysis of a number of stream characteristics (e.g. ordering of the streams, bifurcation ratio, texture ratio, stream length ratio, etc.). These variables characterize the basin shape as well as the drainage system, and can be useful to model the hydrological dynamics. Youssef et al. (2011) also indicated that the circularity ratio, number of orders and drainage density have a direct impact on the hydrological risk. Hence, the integration of these variables in the procedures for the regionalization of extreme hydrological events may contribute to the enhancement of RFA results. Variables related to drainage network systems are already used in several morphometric and hydrologic studies (e.g. Ameri et al. 2018; Biswas et al. 1999; Kaliraj et al. 2015; Pareta and Pareta 2011; Rai et al. 2017; Ratnam et al. 2005; Reddy et al. 2004; Sivasena Reddy and Janga Reddy 2013; Vijith and Satheesh 2006; Youssef et al. 2011) and they can eventually be useful in regionalization studies. These variables can be extracted based on classical approaches such as topographic maps and field examination or with advanced techniques using remote sensing and Digital Elevation Models (DEM). Remote sensing techniques coupled with the potential of GIS tools are increasingly popular. Indeed, they make it possible to calculate the various characteristics of the basin very quickly and more efficiently based on a DEM which is not possible in the past.

Table 1 Predictor variables used in a number of previous regionalization studies

During the last decades, the focus in RFA has been mainly on the development of new delineation and estimation methods (e.g. Durocher et al. 2015; Ouali et al. 2016; Wazneh et al. 2016). Meanwhile, the list of physiographical and meteorological variables used as predictors has seen little evolution. In the present study, a number of commonly used RFA approaches are applied to test and evaluate the potential improvements that may result from the adoption of new physiographic variables.

The objective of this work is to propose the use of new physiographical variables related to the basin shape and drainage network and argue about their usefulness. To evaluate their added value for quantile prediction in RFA, they are computed and used for a set of 151 basins in Quebec (Canada). More specifically, the objective is to use both the standard and extended databases to predict quantiles associated to several return periods, and compare their prediction performances. In this work, standard RFA methods are considered for quantile prediction, namely Canonical correlation analysis (CCA) (Ouarda et al. 2000) and the region of influence (ROI) (Burn 1990) for DHR, including a case with no DHR, as well as the log-linear regression model (LLRM) and the generalized additive model (GAM) (Hastie and Tibshirani 1987) for RE.

The present paper is structured as follows: Sect. 2 offers a review of the new physiographic and morphometric variables proposed in this work by detailing their characteristics. Section 3 briefly presents the theoretical background of the CCA and the ROI approaches for the delineation of neighborhoods and the LLRM and the GAM for the regional estimation. The adopted methodology and the developed regional models are detailed in Sect. 4. Section 5 describes the study area and the used datasets. The results are presented and discussed in Sect. 6, and the conclusions of the work are summarized in the last section.

2 Variables characterizing drainage networks

Drainage network characteristics and evolution depend closely on the prevailing climatic, physiographic, and topographic conditions of the basin (Jung et al. 2015). These conditions determine the drainage network configuration which, in turn, can affect the hydrological response of the watershed (Howard 1990), and consequently hydrological quantile estimation. The new physiographical variables considered in this work are presented herein. Table 2 summarizes the definitions and standard mathematical equations used to determine these variables.

Table 2 Morphometric variables definitions

2.1 Stream order (U)

The stream order of a basin is the highest stream order within the basin, where an order one is a stream starting at the source. A number of stream ordering systems are available in the hydrological literature. The simplest and most used one is the Strahler system originally introduced by Horton (1945) and then modified by Strahler (1952). This method is based on a hierarchical ranking of streams. When two first order streams join, an order two is formed and so on. Several researchers have directly correlated the stream order with stream flow (e.g. Blyth and Rodda 1973; Stall and Fok 1967). Blyth and Rodda (1973) also observed that during dry periods, first-order streams present less than 20% of the total length of the drainage network. At the maximum development of the drainage network, the total length of first-order streams constitutes over 50% of the total basin stream length. Thus, stream order frequency, especially the frequency of the first-order streams, may be well correlated with the hydrological response of the watershed.

2.2 Texture ratio (RT)

The texture ratio (RT) allows characterizing the basin drainage texture and is one of the most important factors in the drainage morphometric analysis due to its high relationship with the underlying lithology, the infiltration ability and the topographic characteristics of the terrain (Schumm 1956). High RT levels indicate the presence of soft rocks with high sensitivity to erosion (Ameri et al. 2018), and consequently a high and speedy surface runoff.

2.3 Circularity ratio (RC)

The circularity ratio (RC) is defined as the ratio between the areas of a catchment to the area of the circle having the same perimeter of the catchment. It is an important variable that helps characterize the basin shape. It is affected by the length and frequency of streams, geological structures, land use and cover, and the slope of the catchment (Dar et al. 2014; Vijith and Satheesh 2006). RC values range between 0 and 1. Basins with RC values close to 1 are characterized by circular form and a low concentration time and then a high peak flow. Low RC values are associated with strongly elongated basins and with lower runoff.

2.4 Stream length ratio (RL)

The stream length ratio (RL) was defined by Horton (1945) as the ratio between the mean length of the streams of a given order and the next lower order. It is based on Horton's law (1945) of stream length that indicates the existence of a direct geometric relationship between the mean length of the streams of a given order and the next lower order. The RL between successive stream orders changes under the effect of the topographic and slope variability, and has a significant relationship with surface runoff and the erosional stage of the watershed (Sreedevi et al. 2005).

2.5 Mean bifurcation ratio (MRB) and weighted mean bifurcation ratio (WMRB)

The bifurcation ratio (RB) is defined as the ratio between the stream’s number of a given order and those of the next-higher order in a drainage network. It permits the characterization of the impacts of the geological structures on the drainage network. Strahler (1957) indicated that the RB shows a slight range of variation for different regions except where the impact of the geological control is important. Chow (1964), Strahler (1964) and Verstappen (1983) indicated that, in general, the geological structures have a negligible impact on drainage networks, if the mean bifurcation ratio (MRB) of the watershed is comprised between 3 and 5. A higher value of this variable indicates a sort of geological control (Agarwal 1998). This variable can also characterize the watershed’s shape. A high RB value is, generally, associated with an elongated basin, while a low RB value is likely to be associated with a circular basin (Gajbhiye 2015; Taofik et al. 2017). Strahler (1953) proposed a more representative bifurcation number measure, called weighted mean bifurcation ratio (WMRB). It consists in multiplying the ordinary RB identified for each successive order by the total number of streams involved in the ratio and subsequently taking the mean of these values. Schumm (1956) used this approach to determine the WMRB of the drainage system of the Perth Amboy (N.J). Pareta and Pareta (2011) and Bajabaa et al. (2014) also used this variable in hydrologic and morphometric analysis studies.

2.6 RHO coefficient (ρ)

The RHO coefficient (ρ) is defined as the ratio between the RL and the RB of the watershed. It characterizes the relationship between the physiographic development of the watershed and the drainage density, and permits the assessment of the storage capacity of the drainage network (Horton 1945). This variable is affected by several climatic, geologic, biologic, geomorphologic and anthropogenic factors (Mesa 2006).

2.7 Drainage density (DD)

The drainage density (DD) was introduced by Horton (1932) in the hydrological literature as the total length of stream networks per unit area. DD express the closeness of the spacing of streams, and provides a quantitative measurement of landscape dissection and runoff potential (Magesh et al. 2011). It is a result of interacting factors controlling the surface runoff such as, the infiltration capacity, the climatic conditions and the vegetation cover of the watershed (Máčka 2001; Patton 1988; Reddy et al. 2004; Verstappen 1983).

2.8 Stream frequency (FS)

The stream frequency (FS) is the number of stream segments of all orders per unit area (Horton 1932, 1945). It depends on the rock characteristics, infiltration capacity, vegetation cover, relief, amount of rainfall and subsurface permeability (Hajam et al. 2013), and reflects the texture of the drainage network (Magesh et al. 2011). In general, a high FS is associated with impermeable subsurface, sparse vegetation, high relief conditions and low infiltration capacity (Reddy et al. 2004; Shaban et al. 2005).

2.9 Infiltration number (IF)

The infiltration number (IF) is defined by Faniran (1968) as the product of the DD and the FS. It allows the characterization of the watershed infiltration capacity (Hajam et al. 2013). This variable is inversely proportional to the infiltration capacity of the basin. The higher the IF values, the lower will be the infiltration and the higher will be the runoff (Pareta and Pareta 2011).

2.10 Ruggedness number (RN)

The ruggedness number (RN) is often used to evaluate the flood potential of streams (Patton and Baker 1976) and it usually combines the impact of slope steepness with its length (Strahler 1964). This variable allows describing the structural complexity of the terrain. Watersheds characterized by high RN values are highly subject to erosion and therefore susceptible to an increased peak flow (Sreedevi et al. 2013).

3 Theoretical background

In this section, we briefly present the statistical approaches adopted in the present work. We define a RFA model as a two-step procedure beginning with a neighborhood identification method and then performing regional estimation. We hereby consider two different methods for each step, which are described below.

3.1 Delineation of homogeneous regions

3.1.1 Canonical correlation analysis (CCA)

CCA method is detailed in Ouarda et al. (2001) in the context of RFA, and commonly used in this context to identify group of basins having the same hydrological response. This method consists of space reduction by establishing pairs of canonical variables based on a linear transformation of two groups of random variables. Let two sets of random variables \(X = \left( {X_{1} ,X_{2} , \ldots ,X_{m} } \right)\) and \(Y = (Y_{1} ,Y_{2} , \ldots ,Y_{n} )\) containing, respectively, the m physio-meteorological variables and the n hydrological variables of N gauged sites. Based on these variables, the linear combinations Vi and Zi of the variables X and Y and the canonical correlation coefficients λ1, …, λp (with λi = corr (Vi, Zi)) can be computed.

Using the CCA method, the considered basins can be represented as points in a spaces of the uncorrelated canonical variables (Vi, Zj); where i ≠ j. Then, it will be possible to examine the similarity of the point patterns in these spaces, i.e., the ability of the physio-meteorological canonical variables to predict the hydrological variables. The point patterns that are sufficiently similar are associated with sub-group of basins that belongs to the same statistical population and vice versa. The similarity between the basins are measured based on a Mahalanobis distance.

3.1.2 Region of influence (ROI)

As the CCA, the ROI method (Burn 1990) allows the identification of a hydrological neighborhood for a given target-site based on a Euclidean distance, generally a weighted Euclidean distance. This distance determines the similarity of watersheds in a multidimensional space of physio-meteorological variables. A more detailed description of the approach can be found for example in Burn (1990) and GREHYS (1996).

3.2 Regional estimation approaches

3.2.1 Linear regression model

The linear regression model or the log-linear regression model (LLRM) is commonly used to find a linear relationship between the hydrological variable (such as the flood quantile QT corresponding to a return period T) and the physio-meteorological characteristics of a watershed (X1, X2, …, Xm), and it is defined as (e.g. Girard et al. 2004; Pandey and Nguyen 1999):

$$\log \left( {E\left( {Y/X} \right)} \right) = \beta _{0} + \mathop \sum \limits_{j = 1}^{m} \beta_{j} \log (X_{j} ) + \varepsilon$$
(1)

where X is a matrix whose columns correspond to a set of m explanatory variables, β0 and βj are unknown parameters to be estimated using the least-square method (Pandey and Nguyen 1999) and ε is the model error.

3.2.2 Generalized additive model

GAM was developed by Hastie and Tibshirani (1987). It is an extension of the generalized linear model (GLM). This model allows for a response distribution other than Gaussian and for a non-linear relationship between response and explanatory variables through smooth functions (Hastie and Tibshirani 1987; Wood 2006), which may lead to a more close description of the hydrological processes involved. The GAM formula is given by Wood (2006):

$$g \left( {E\left( {Y/X} \right)} \right) = \beta_{0} + \mathop \sum \limits_{j = 1}^{m} S_{j} (X_{j} ) + \varepsilon$$
(2)

where g is a monotonic link function and \({\text{S}}_{{\text{j}}}\) are smooth functions of explanatory variables \({\text{X}}_{{\text{j}}}\).

The estimation of the smooth functions \({\text{S}}_{{\text{j}}}\) is carried out using splines, which are piecewise polynomial functions linked at points named knots. Generally, the smooth functions \({\text{S}}_{{\text{j}}}\) are defined as follows:

$$S_{j} \left( x \right) = \mathop \sum \limits_{i = 1}^{q} \beta_{ji} b_{ji} (x)$$
(3)

where \({\upbeta }_{{{\text{ji}}}}\) are unknown parameters and \({\text{b}}_{{{\text{ji}}}}\) are the spline basis functions.

4 Methodology

4.1 Regional models

In this study, we apply all combinations of the two DHR methods (CCA, ROI) in conjunction with the RE models (LLRM and GAM) presented in Sect. 3. The RE models are also considered with all stations (i.e. without defining any neighborhood). This result in six possible combinations for each dataset (STA and EXTD). Thus, the following regionalization approaches are evaluated (Fig. 1):

  • ALL/LLRM (STA and EXTD): LLRM used without neighborhoods (all stations) and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure.

  • ALL/GAM (STA and EXTD): GAM used without neighborhoods (all stations) and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure.

  • CCA/LLRM (STA and EXTD): LLRM used with neighborhoods identified by the CCA method and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure.

  • CCA/GAM (STA and EXTD): GAM used with neighborhoods identified by the CCA method and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure.

  • ROI/LLRM (STA and EXTD): LLRM used with neighborhoods identified by the ROI method and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure.

  • ROI/GAM (STA and EXTD): GAM used with neighborhoods identified by the ROI method and with variables selected from the STA and the EXTD datasets using the backward stepwise procedure

Fig.1
figure 1

Different combinations and considered models

The CCA and ROI methods are used in the DHR considering two different sets of physio-meteorological variables. The first group includes variables from the STA dataset, namely the area (AREA), mean basin slope (MBS), percentage of the area occupied by lakes (PLAKE), mean annual total precipitation (MATP), mean annual degree days below 0 °C (DDBZ) and the longitude of the centroid of the catchment (LONGC). The second one comprises variables from the EXTD dataset, which are PLAKE, MATP, DDBZ, LONGC, RT and RC. The selection of these variables is carried out based on their correlation level with the hydrological variables (Table 3) as the principle of the CCA is based on correlations. For the aim of simplicity and to be consistent with the CCA, variables selected for the ROI are also based on correlation levels.

Table 3 Correlation between hydrological and physiographical variables

The classical procedures of ROI and CCA lead to neighbourhoods with highly variable sample sizes from a target site to another. Indeed, considering a given threshold value, sites located near the centre of the cloud of points determined by the Euclidean space for ROI and the canonical space for CCA are expected to include more sites within their neighbourhoods than sites located on the edge of the cloud of points (Leclerc and Ouarda 2007). Since the accuracy of the estimates obtained by regression models is sensible to the sample size, it was decided to fix the neighbourhood size for all target stations. This size is chosen with a standard jackknife procedure and optimized using the optimization procedure of Ouarda et al. (2001) developed in the Matlab environment.

LLRM and GAM are used in this study as RE models. GAM was developed based on the R package mgcv (Wood 2006). In this work, the thin plate regression spline is considered as basis bji (.) in the smoothing function \({\mathrm{S}}_{\mathrm{j}}(.)\) in Eq. (3). This basis function is considered due to its advantages. The thin plate regression spline is characterized by its reduced calculation time, its flexibility and it comprises a lower number of parameters compared to other smoothing functions (Wood 2006). The considered link function g in (2) is the identity function since the log-transformed quantiles are approximately normal (as in Ouali et al. (2017)).

4.2 Selection of explanatory variables

Variable selection procedure is different for the two RFA steps; a correlation-based selection is considered for DHR and a stepwise method is used for RE as a standard approach in the RFA studies. Based on correlation level between physio-meteorological variables and hydrological variables (Table 3), six variables are identified for DHR (see above).

For the RE step, four variable selection methods are firstly tested namely forward, backward, stepwise and shrinkage approaches (Heinze et al. 2018) in this study. Table 4 presents the results obtained from each variable selection approach applied for QS10 that can be considered as the most reliable quantile. It can be seen that, regardless of the considered selection method, several new variables are selected in the final model. This suggests that new variables in the EXTD are potentially useful for RFA.

Table 4 Variables selection results for QS10 case (with different methods)

To evaluate whether the new variables are predictive of target quantiles, the backward stepwise selection procedure is adopted for both LLRM and GAM. It has already been successfully applied previously with the same dataset (STA) and in the same context by Chebana et al. (2014), Ouarda et al. (2018) and more recently by Msilini et al. (2020). Backward stepwise selection procedure consists in a progressive elimination of variables having the highest p value (based on the hypothesis that the coefficients in Eq. (1) for LLRM or the smooth terms in Eq. (3) for GAM are null) from an initial model comprising all available variables. The procedure stops when the number of variables remaining in the model drops below a specific number (Fig. 2). This number is chosen as the one minimizing the RRMSE estimated by jackknife.

Fig.2
figure 2

Backward elimination process

4.3 Models validation

For each RFA model, a jackknife procedure (also called leave one-out cross validation procedure) is used to evaluate its performance. It consists in considering, in turn, each gauged site as an ungauged one and comparing thereafter the regional estimate to the observed value. This comparison is performed through several criteria: first, the Nash criterion (NASH) gives an evaluation of the degree of adequacy and a global assessment of the prediction quality. Second, the root mean squared error (RMSE) provides information about the accuracy of the prediction in an absolute scale, and the relative RMSE (RRMSE) removes the impact of each site’s order of magnitude from the RMSE values and gives information about the accuracy of the prediction in a relative scale. Finally, the bias (BIAS) and the relative bias (RBIAS) give a measure of the magnitude of the systematic overestimation or underestimation of a model. The formulations of these criteria are given as follows:

Nash:

$${\text{NASH}} = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} (y_{i} - \hat{y}_{i} )^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} (y_{i} - \overline{y})^{2} }}$$
(4)
$${\text{RMSE = }}\sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} (y_{i} - \hat{y}_{i} )^{2} }$$
(5)

Relative root-mean-square error:

$${\text{RRMSE}} = 100\sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {\frac{{(y_{i} - \hat{y}_{i} )}}{{y_{i} }}} \right]^{ 2} }$$
(6)

Mean bias:

$${\text{BIAS}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} (y_{i} - \hat{y}_{i} )$$
(7)

Relative mean bias:

$${\text{RBIAS}} = 100 \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \frac{{(y_{i} - \hat{y}_{i} )}}{{y_{i} }}$$
(8)

where \(y_{i}\) and \(\hat{y}_{i}\) are, respectively, the local and regional quantile estimates at site i, \(\overline{y}\) is the mean of the local quantile estimates, and N is the number of stations.

5 Case study and datasets

The data used in this study includes two datasets, the STA and the EXTD, covering 151 stations located in the southern part of Quebec, Canada (Fig. 3). The STA was considered in previous studies with geographical coordinates of the stations and commonly used physio-meteorological variables (e.g. Durocher et al. 2015; Shu and Ouarda 2007; Wazneh et al. 2016). The EXTD dataset combining STA dataset with less common variables representing drainage network properties. The stations are operated by the Ministry of Sustainable Development, Environment, and Fight Against Climate Change.

Fig. 3
figure 3

Geographical location of the studied stations in Quebec, Canada

The considered hydrological variables (\(Y\) in the theoretical background) are at-site quantiles standardized by the basin area (specific quantiles), denoted by QS10, QS50 and QS100 with 10, 50 and 100 are the return periods. Descriptive statistics of hydrological and physio-meteorological variables of the STA (not presented here to avoid repetition) can be found for example in Durocher et al. (2015). The hydrological variables were identified in Kouider et al. (2002a) using a local Frequency Analysis in each gauged site. Data series with at least 15 years of measurement were considered for the analysis. The basic assumptions of stationarity, homogeneity and independence were verified and the appropriate statistical distributions were fitted to data. The appropriate probability distributions identified, are mainly the inverse gamma and Log-Normal with two parameters. For more details about this study, reader may refer to the report of Kouider et al. (2002b). The new physiographical variables, considered in the EXTD, are summarized in Table 5. These variables are identified from drainage networks extracted using the D8 method based on the DEMs (Jenson and Domingue 1988; O'Callaghan and Mark 1984). This technique is implemented in Arc Gis (Arc Hydro).

Table 5 Descriptive statistics of new physiographical variables

The D8 method is based on a digital elevation model (DEM) which is basically a grid of elevation values. For each cell, it is considered that water flows in direction of the steepest slope among the eight neighbors of a given DEM cell. The direction grid can then be used to estimate flow accumulation which is obtained by summing the weight of all grid cells following into each downslope cell in the output grid, i.e. simulating the flow path. Based on the obtained flow accumulation grid, the drainage networks can be extracted with the stream head locations corresponding to accumulation values below a constant threshold value (see for instance (Tarboton et al. 1991)).

In this work, the DEMs were hydrologically corrected based on information from the National Hydro Network (NHN). This correction was carried out using the DEM Reconditioning process, which is an implementation of the “AGREE” method. It consists in adjusting the DEM by imposing linear features as a reference. The reference in this case is the (NHN).

The used DEMs have a spatial resolution of ~ 20 m grid cells and are obtained from the Natural Resources Canada database (https://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/download-directory-documentation/17215). Note that, drainage networks of six cross-border watersheds are extracted using the United States Geological Survey (USGS) data distributed with ~ 30 m grid cells (https://earthexplorer.usgs.gov/).

CCA requires the normality of all variables. Hence, some variables need to be transformed. The normality of each variable is visually assessed with a normal probability plot. This technique plots empirical quantiles versus theoretical Gaussian quantiles and should be approximately linear in the case of actual normality. The logarithmic transformation is considered for the hydrological variables, AREA, MBS, MATP, DDBZ and RT, and a square root transformation for PLAKE and RC. The LONGC is used without transformation since it is approximately normal.

6 Results and discussion

A correlation analysis is carried out in order to investigate the relationships between variables. Table 3 shows the list of the variables selected for the DHR step based on their high correlation level with the hydrological variables. One can see the existence of relatively high negative correlations between the hydrological variables and the AREA, PLAKE, DDBZ and RT. We also note the presence of important positive correlations between the response variables and the MATP and RC variables. The linear correlation coefficients between the variable RT, which is one of the most important new variables, and the specific quantiles QS10 and QS100 are -0.53 and -0.51 respectively. However, those between the RT variable and the at-site flood quantiles Q10 and Q100 are 0.87 and 0.86 respectively. Positive and high correlation values indicate that the increase in RT is associated with a relatively fast and high hydrologic response and consequently an increased risk of erosion. This is consistent with what is stated in Ameri et al. (2018). The second important new variable in terms of correlation level is the RC characterizing the basin shape. Higher RC values (close to 1) are associated with circular basins with low concentration time and high hydrological response hence the positive correlation.

The identification of the neighborhood requires the determination of the optimal number of stations to be used in the RE step. To this end, the optimization procedure of Ouarda et al. (2001) is used. Based on a selected criterion such as RMSE, RRMSE, BIAS or RBIAS the optimal size of neighborhoods can be identified. The optimal size of the neighborhoods should be large enough to ensure that RE can be carried out effectively, but not too large in order to maintain an acceptable degree of homogeneity within the neighborhoods. In this study, we obtain nopt (STA) = 85 sites and nopt (EXTD) = 78 sites with respect to the RRMSE, which is the most important criterion (Hosking and Wallis 2005), for the CCA approach. For the ROI method, the obtained optimum sizes are nopt (STA) = 54 sites and nopt (EXTD) = 44 sites with respect to the same criterion.

The backward stepwise selection method is considered for each quantile (QS10, QS50 and QS100) and for each model (LLRM and GAM). In the present study, the optimal number of variables in GAM, which is the most complex model, is found to be seven. Table 6 shows the seven selected variables for each quantile and model combination. We note the selection of three new variables (RN, MRL and DD).

Table 6 Explanatory variables selected for the various regression models

The jackknife procedure results for all considered combinations are presented in Table 7. The best overall performances are obtained with the EXTD, especially with ROI/GAM/EXTD followed by the CCA/GAM/EXTD approaches. Based on the high NASH values (0.79) and the lowest RRMSE values (29.24% for QS100), the ROI/GAM/EXTD combination gives the most precise estimates compared to all other approaches. According to RBIAS, all models underestimate flood quantiles but the least biased model is ROI/LLRM/EXTD (-1.38% for QS100). However, compared to the ROI/GAM/EXTD approach, the difference is low (around -1.8% for QS100).

Table 7 Jackknife validation results

Note that, GAM applied to EXTD (with and without the neighborhoods) outperforms LLRM applied to EXTD and STA. This may be explained by the ability of GAM to take into account the possible nonlinear connections between predictor and response variables, and also by the important impact of the new variables.

We also notice that the use of the EXTD leads to even more important improvements when adopting the ROI method compared to the CCA approach. Wazneh et al. (2016) have also obtained better results with the ROI than with the CCA approach.

To further explain the previous results, the relative errors as a function of the stations ordered according to their area corresponding to the best combinations (ROI/GAM and CCA/GAM) are given in Figs. 4 and 5 respectively. It can be seen that the EXTD performs well especially for large basins. Indeed, for the large watersheds the relative errors decrease considerably with the EXTD. This result may also be confirmed by Fig. 6, where one can note that the lowest specific quantiles, which are usually associated to sites with large basin areas, are well estimated with the EXTD. A significant improvement can also be seen for some specific sites that have exceptionally large relative errors with STA. Four such sites (030401, 030402, 041903 and 042607) were identified previously by Chokmani and Ouarda (2004), Durocher et al. (2015) and Ouali et al. (2017) as particular stations with underestimated areas. The integration of more accurate variables dealing with the drainage network, improves considerably the quantile estimates corresponding to these sites.

Fig. 4
figure 4

Relative errors associated to the local quantile QS100 calculated with ROI/GAM/STA and ROI/GAM/EXTD

Fig. 5
figure 5

Relative errors associated to the local quantile QS100 calculated with CCA/GAM/STA and CCA/GAM/EXTD

Fig. 6
figure 6

Relative errors using ROI/GAM/STA and ROI/GAM/EXTD as a function of QS100

Jackknife estimates using the ROI/GAM and CCA/GAM approaches (for QS100) are illustrated, respectively, in Figs. 7 and 8. One can see that these models combined with the EXTD show better performances compared to the STA. The points associated to the scatter diagram of the at-site and regional estimates are less dispersed when using the EXTD than the STA. In addition, the coefficient of determination R2 values show that the linearity between the local and the regional specific quantile estimates is better explained when using the EXTD than the STA.

Fig. 7
figure 7

Specific regional quantile versus local estimates using ROI/GAM/STA and ROI/GAM/EXTD approaches for QS100

Fig. 8
figure 8

Specific regional quantile versus local estimates using CCA/GAM/STA and CCA/GAM/EXTD approaches for QS100

Results also indicate that sites with high specific quantile values (more than 0.7 m3/s.km2), which are generally associated to small basins with an area less than 800 km2, are underestimated using the two datasets. This may suggest the usefulness of developing specific regional models for small basins. This result can be explained by the fact that traditional neighborhood approaches (CCA and ROI) lead to an underestimation for sites with small basin areas as shown in Wazneh et al.(2016). This may be the cause of the obtained negative RBIAS values in this work.

Figures 9 and 10 present the smooth functions of the response variable log(QS100) as a function of the STA and the EXTD explanatory variables respectively. We notice that the variables PLAKE, DDBZ, AREA and DD show a complex nonlinear relationship (nonlinear smooth function curves and high edf values), while the variables LONGC; MALP, MCL, MBS and MRL present linear relations.

Fig. 9
figure 9

Smooth functions of QS100 for the predictor variables included in the regional models ALL/GAM/STA, CCA/GAM/STA and ROI/GAM/STA. The dotted lines represent the 95% confidence intervals. The vertical axes denote the spline of each explanatory variable

Fig. 10
figure 10

Smooth functions of QS100 for the predictor variables included in the regional models ALL/GAM/EXTD, CCA/GAM/EXTD and ROI/GAM/EXTD. The dotted lines represent the 95% confidence intervals. The vertical axes denote the spline of each explanatory variable

A particular case of interest from the EXTD that can be observed concerns the relationship between the hydrological variable and the DD values. One can see that the higher the DD values are the lower the hydrological response will be. This result is in contradiction with what is commonly observed in practice (Melton 1957). In fact, the correlation between the DD variable and specific quantile is negative (-0.11) while the correlation between flood quantile and the variable DD is positive (0.13). Thus, this variable depends on the size of the watershed, for this reason its effect is reversed in this study case because the specific quantile is used.

We also notice that the MRL and MCL variables are found to be inversely proportional to the hydrological response. An increase of these variables is associated with a decrease of the MBS and hence a decrease of the hydrological response.

It can also be seen that the relationship between log(QS100) and PLAKE is decreasing for the majority of PLAKE values, but increases for the highest values of PLAKE. However, the number of points is very limited in the high PLAKE range and more effort will be required to understand the effect of this variable on the flow regime for this range. In general, lakes act as a sponge absorbing the excess water during extreme events, which explains the decreasing relationship between log(QS100) and PLAKE.

The LONGC in this study is an indicator of the station proximity to the Atlantic Ocean and thereafter reflects the influence of the ocean on the local climate. Finally, the variability in the relationship between the DDBZ values and the hydrological response may indirectly reflect the seasonality impact of the temperature on the flow regime. The same patterns were observed previously by Chebana et al. (2014) for the DDBZ and PLAKE variables.

7 Conclusions

Through a case study in the province of Quebec, the present study shows the relevancy of considering drainage network characteristics for quantile prediction in RFA. This result is outlined by the variable importance in RFA models which shows that five new variables, namely RT, RC, DD, MRL and RN are found particularly useful for the specific case of Quebec. Prediction accuracy is also improved using the new variables, especially when considering small neighbourhoods and nonlinear models as shown by the superior accuracy of the ROI/GAM/EXTD combination. This result seems also more important for large basins.

By focusing on the drainage network and basin shape, the new physiographical variables allow integrating more information about the underlying hydrogeological flows and thus, indirectly, to make the link between the groundwater and the surface water flows. This added information allows for a better description of the hydrological dynamics involved and consequently to better flood quantile estimates.

The present study paves the way for several perspectives. In particular, drainage network characteristics should be evaluated further in a wider variety of settings including different climate and catchment geology. The increasing complexity of databases used in RFA to which this research participate, also outlines the need for methodological development that allow a more efficient use of this extensive information, as classical approaches may be limited in this regard. Future research should thus focus on studying how to take advantage of the interaction between the newly proposed variables on quantile estimation, as well as the potential nonlinear impact of the considered variables.