1 Introduction

Flood is a major natural disaster that causes millions of dollars in damage annually and the loss of human lives. Flood modeling has been an area of active research since the inception of engineering hydrology (Cunnane 1989). Globally, it has recently come into focus because of the numerous devastating floods that have swept across many countries. Recent episodes of flooding have been attributed to global warming (ABOM 2012). In order to decrease flood damage and save human lives, flood modeling is undertaken to estimate floods associated with different return periods; these are called design floods. There are many methods available for design flood estimation. At-site flood frequency analysis is the most direct method and serves as a benchmark to assess the accuracy of regional flood estimation methods and rainfall runoff modeling (Rahman et al. 2013).

Flood frequency analysis (FFA) is an active area of investigation in statistical hydrology. The primary objective of FFA is to relate the magnitudes of extreme events to their frequency of occurrence using probability distributions (Hamed and Rao 1999). FFA determines the relationship between flood quantiles and their non-exceedance probability. FFA is a major component of hydrological surveys, as it is the basis of hydraulic design for infrastructures such as dam spillways, diversion canals, dikes, river channels, urban drainage systems as well as cross-drainage structures (e.g., culverts, bridges, dips), and flood risk mapping. The social and economic implications of FFA require incorporation of accurate statistical procedures. For example, to provide a value for accepted risk of a dam spillway, the value of the design discharge should be estimated as accurately as possible because a flood quantile estimated below the real value will increase the cost of the spillway unnecessarily (Francés 1998).

Selection of the design return period depends on the nature and scale of the project, flood consequences, economic criteria, possible human casualties, and hydrological factors. Flood underestimation could result in loss of life and property while overestimation increases investment costs. Hydrologists have long searched for appropriate methods of improving the accuracy of flood estimation (Saghafian et al. 2014). For sites with long records of measured floods, the general methodology based on the univariate distribution is to derive a fitted distribution indicating the probability of annual maximum flood exceedance (USWRC 1982). At-site FFA methods include collecting data series for a site, choosing probability distribution functions, and using the data series to estimate the parameters of the underlying distribution (Hamed and Rao 1999). These traditional methods have a number of advantages such as simplicity, being inexpensive (when there is sufficient data), ability to incorporate risk analysis and flexibility for time series with different lengths. They however have limitations and uncertainty which may be enumerated as (Alberta Transportation 2001):

  • Record length;

  • Data reliability and systematic errors in observed discharges;

  • Heterogeneous data resulted from watershed climatic and land use changes;

  • Heterogeneous data due to floods caused by rain or snowmelt;

  • Inappropriate statistical distributions used;

  • Extrapolation of fitted distribution;

  • Errors of estimating distribution parameters;

There are numerous works in the literature which follow the traditional FFA. Kamal et al. (2016) conducted FFA for two hydrometric gauges on Ganga River, India, in order to estimate floods with different return periods. They introduced Lognormal and Gamble probability distributions as the best-fitted statistical distributions. Zhang et al. (2017) concluded that generalized extreme value (GEV) was the best statistical distribution for 34 stations in the Pearl River Delta during a period of about 60 years. Lam et al. (2017) used PREC (Probabilistic Regional Envelope Curve) approach and spatial information on maximum floods and achieved considerable improvement on results of FFA, particularly for floods greater than 100-year return periods.

Traditional FFA methods for determining a design flood are based on data from systematic records (Frances et al. 1994). Systematic records are collected during periods of systematic stream gaging, usually a continuous series of years, in which flood data is observed and recorded annually, regardless of magnitude. A nonsystematic record is one that is collected and recorded sporadically, without definite criteria, usually in response to actual, perceived, or anticipated major flooding (HFAWG 2008). Nonsystematic data includes historical flood data recorded before the systematic period and paleoflood data obtained from analysis of proxy data. Both historical and paleoflood data can provide additional important sources of information beyond the systematic period with which to estimate flood quantiles (Frances et al. 1994).

2 Flood Frequency Using Systematic and Nonsystematic Data: Proposed Approach

Looking at the common engineering practices in different countries show that, from practical viewpoint, approaches in getting historical floods integrated with systematic flood data in flood risk analysis is lacking. Several methods have been suggested for incorporating systematic and nonsystematic data into flood frequency, including empirical and nonparametric methods, parametric methods based on the historically-weighted moments method, the expected moments algorithm, maximum likelihood, probability weighted moments, and L-moments (Cohn et al. 1997; Cohn and Stedinger 1987; England et al. 2003; Frances et al. 1994; Halbert et al. 2016; Kjeldsen et al. 2014; Parkes and Demeritt 2016; Payrastre et al. 2013; Salinas et al. 2016; Strupczewski et al. 2014).

Extraordinarily large flood peak flow that has occurred during a period of systematic record (known as a high outlier) is a controversial component of traditional methods of FFA (HFAWG 2008). It can cause difficulties in selection of the best-fit flood frequency distribution and for parameter estimation. Retention, modification, and removal of these outliers present problems for satisfactorily fitting a parametric frequency distribution to the sample and can significantly affect the chosen frequency distribution and estimation of statistical distribution function parameters. For small samples in particular, it could cause the estimates of the T-year event to be underestimated or overestimated and produce high uncertainty for upper quantile estimates. All procedures for treating outliers ultimately require judgment involving mathematical and hydrologic considerations (USWRC 1982).

There is a lack of consensus about the usage of data for extraordinarily large floods. The Interagency Committee on Water Data in Bulletin 17B recommends that flood peaks considered to be high outliers should be compared with historic flood data and flood information at nearby sites (USWRC 1982). If information is available indicating that a high outlier is the maximum for an extended period of time, the outlier is treated as historic flood data. If useful historic information is not available to adjust high outliers, then they should be retained as part of the systematic record (USWRC 1982). The Civil Projects Branch of Alberta Transportation (2001) recommends that an unusually high value should not be excluded from analysis unless there is a reason to test the reliability of its reported value or to believe that it represents a special phenomenon likely to recur only at long intervals. The records of other adjacent or analogous stations should be checked for evidence of synoptic or similarly incompatible high events. Benson (1968) suggested that future work should be conducted to handle outliers or rare floods and noted that “in any case, any major modifications would have to meet the test conforming to the data satisfactorily.” Similarly, the NRC (1988) recommended focusing on extreme tails of the probability distribution to better estimate extreme flood probability. Lamontagne et al. (2016) applied the expected moments algorithm (EMA) in order to incorporate Potentially Influential Low Floods (PILFs) with other observed floods using log-normal and log-Pearson Type III statistical distributions. This approach is considered as a major advance in flood frequency analysis that improves flood estimation, particularly in high return periods. Halbert et al. (2016) compared local and regional flood frequency approaches with special emphasis on the effects of the information on extreme floods and the assumptions associated with regional approaches. Their results showed that a relatively limited level of regional heterogeneity may significantly affect the performances of regional approaches. The results also illustrated the added value of information on extreme floods, historical floods or recent floods observed at ungauged sites, in both local and regional approaches. Some other references considering outliers in flood frequency analysis include Hawkins (1980) and USWRC (1982).

This paper contributes towards presenting an innovative methodology for employing extraordinarily large flood data on flood estimation using probability distributions when there is no historical and paleoflood data in the region. An extensive literature review on related research shows that there is a lack of transparency and practical guidelines to use extraordinarily large flood data for FFA. In the proposed methodology after the identification and investigation of the authenticity of exceptional outlier values in the data series, the appropriate weight of this data is determined by allocating a suitable historical period through a sensitivity analysis. Considering to be unknown this historical period, the selection criterion is reaching to the least value of residual indices for the fitted statistical distribution over the data series. More precisely, in this approach, the extraordinarily large floods are suggested to be treated first as historic flood data, even though there is no historical information. To determine the effect of direct involvement of the extraordinarily large flood on flood quantlie estimations, FFA is used both with and without the extraordinarily large flood in which 15 probability distributions and three parameter estimation methods are applied. Chi-square and Kolmogorov–Smirnov goodness-of-fit tests as well as the mean absolute relative deviation (ARD) and mean squared relative deviation (MSD) are used to identify the best-fit probability distributions at all hydrometric stations (Mohammadpour et al. 2014). The generalized extreme value (GEV), three-parameter lognormal (LN3), log-Pearson type III (LP3), and Wakeby (WAK) probability distributions are used to incorporate and adjust the extraordinarily large floods with other systematic data. The length of the historical period (unknown) is determined using sensitivity analysis. Finally, different return period flood quantiles are estimated and the results are compared using an extraordinarily large flood without adjustment. The details of proposed methodology are provided in the subsequent sections of this article.

3 Methods

3.1 Study Area

Golestan dam watershed was selected as the study area. It is located in Golestan province in northeastern Iran at 53°10′ to 56°30′ E longitude and 36°50′ to 37°24′ N latitude. It is situated between the Alborz mountain range to the south and the Caspian Sea to the north. The climate of this area is mild; the average annual rainfall is less than 650 mm. Although most rainfall occurs in winter, the summers are not entirely dry. Floods of large magnitude mainly form during the summer as a result of high-intensity storms. The mean annual temperature is 16 °C with 75% humidity. Flooding is the main natural hazard in this area. It frequently causes loss of lives and damages to properties (Sharifi et al. 2002). The watershed drains 5155 km2 of land into the ∼90 million m3 Golestan Dam reservoir. The minimum, average, and maximum elevations of the watershed are 53, 935 and 2050 m, respectively (LAR 2000; Sheshangosht et al. 2010). Figure 1 shows the location of the watershed, the boundaries of the main sub-watersheds, the drainage network, and hydrometric stations.

Fig. 1
figure 1

Location of Golestan Province and the study watershed in Iran

3.2 Frequency Analysis

The availability of data is a major factor in frequency analysis. Estimating the probability of occurrence of extreme floods is an extrapolation based on limited data; thus, the larger the database, the more accurate the estimates will be. In the present study, FFA was performed in three hydrometric stations within the Golestan Dam watershed, namely Tamer, Tangrah and Galikesh. Discharge data measured at the hydrometric stations were obtained from the Ministry of Energy of Iran, which operates the stations. The procedure began with annual maximum instantaneous flood data quality control. The station codes, longitudes and latitudes, drainage areas, periods of record, record lengths, and altitudes are summarized in Table 1. The data considered for FFA are annual instantaneous maximum flood peak series shown in Fig. 2. Annual flood data series must be independent, random, homogeneous, and without trends. Such characteristics were examined using the consolidated frequency analysis (CFA) package from Environment Canada (Pilon and Harvey 1994) for nonparametric testing. The tests used were the independence and trend Spearman tests, general randomness test, and the Mann-Whitney split sample test for homogeneity.

Table 1 Stream gauging stations in the Golestan Dam Watershed used in this study
Fig. 2
figure 2

Variations of the discharge values for the stream gauging stations

In this study, 15 probability distributions were considered using the FREQ program in MATLAB developed by Hamed and Rao (1999). Some distributions are widely used in hydrologic frequency analysis. We examined normal (NRM), two-parameter lognormal (LN2), three-parameter lognormal (LN3), exponential (EXP), two-parameter gamma (G2), Pearson-III (P3), LP3, GEV, extreme value type I (EV1), Weibull (WEI), four-parameter Wakeby (WK4), five-parameter Wakeby (WK5), and generalized Pareto (PAR) distributions. The selection of these probability distributions stemmed from global recommendations for their use for at-site FFA (Cunnane 1989). In practice, true probability distributions for data at a site or a region are unknown.

FREQ applies three methods of estimating distribution parameters using biased and unbiased estimators; the moments method (MOM), the maximum likelihood method (MLM), and the probability-weighted moments method (PWM). Details of these methods are available in the literature (Hamed and Rao 1999). Estimated parameters are used to calculate quantile estimates for different return periods or to calculate the return period for a given flood magnitude. This is achieved using a distribution function in which the distribution parameters are replaced by their estimates and the relationship between return period (T) and probability of non-exceedance (F) in the form F = 1–1/ T. Chi-square and Kolmogorov–Smirnov goodness-of-fit tests, ARD and MSD were used to identify the best-fit probability distributions at all hydrometric stations.

$$ ARD=\frac{1}{N}\sum_{i=1}^N\left|{q}_i(T)\right| $$
(1)
$$ MSD=\frac{1}{N}{\sum_{i=1}^N\left[{q}_i(T)\right]}^2 $$
(2)

and

$$ {q}_i(T)=\frac{{\widehat{Q}}_i(T)-{Q}_i(T)}{Q_i(T)} $$
(3)

where \( {\widehat{Q}}_i(T) \) and Q i (T) are the computed and observed discharge values, respectively.

3.3 Detection of Extraordinarily Large Floods

The term “outlier” is generally used to refer to single data points that appear to depart significantly from other data trends (HFAWG 2008). A high outlier is an extraordinary flood that occurs during a period of systematic record. This type of outlier is more common in flood distributions where one tail is somewhat stretched out relative to the normal distribution; it is especially common in “heavy-tailed” distributions such as the Pareto and could be called “statistical” outliers (HFAWG 2008). There are no appropriate criteria for the detection and handling of outliers in flood data. Detection of outlier floods is performed using statistical tests and graphical methods (Alberta Transportation 2001). Extraordinarily large flood peaks have been identified using single and multiple outlier detection tests, the Grubbs and Beck (1972) test, and the Spencer and McCuen (1996) test. Bulletin 17B guidelines recommend Grubbs and Beck test (G-B) to detect outliers. In the G-B test, XH and XL quantities are calculated as follow:

$$ {X}_H= \exp \left(\overline{X}+{K}_N. S\right) $$
(4)
$$ {X}_L= \exp \left(\overline{X}-{K}_N. S\right) $$
(5)

where \( \overline{\mathrm{X}} \) and S are the mean and standard deviation of the sample natural logarithms, respectively; and KN is the G-B statistic tabulated for various sample sizes and significance levels by Grubbs and Beck (1972). The following approximation was used at the 10% significance level, where N is the sample size. Sample values greater than XH are considered to be high outliers, while those less than XL are considered to be low outliers (Hamed and Rao 1999).

$$ {K}_N=-3.62201+6.28446{N}^{\frac{1}{4}}-2.49835{N}^{\frac{1}{2}}+0.491436{N}^{\frac{3}{4}}-0.037911 N $$
(6)

Spencer and McCuen (1996) test (S-M) produced the following composite equation to determine critical deviate, KN, for sample size N:

$$ {K}_N={C}_1{N}^2+{C}_2 N+{C}_3\kern1.25em for\kern1em 10\le N\le 15 $$
(7)
$$ {K}_N = {C}_4 + {C}_5{{e^{-}}^{C_6}}^N{N}^{C_7}\kern1.25em for\kern1em 16\le N\le 89 $$
(8)
$$ {K}_N = {C}_4 + {C}_5{e}^{-90{C}_6}{.90}^{C_7}\left[1+\left( N-90\right)\left(\frac{C_7}{90}-{C}_6\right)\right]\kern1.25em for\kern1em 90\le N\le 150 $$
(9)

The coefficients of C1-C7 for five skews (from −1 to 1), sample sizes from 10 to 150, three levels of significance (10%, 5%, and 1%), both high and low outliers, and one, two, and three outliers have been presented in Spencer and McCuen (1996). The critical values to detect low outliers in distributions with negative skews and high outliers in distributions with positive skews were averaged to decrease sampling variation. Similarly, the values to detect high outliers in distributions with negative skews and low outliers in distributions with positive skews were averaged. In order to make a consecutive test for three high (or low) outliers, the two discharges farthest from, but on the same side of the mean should be removed. The mean, standard deviation, and skew for flood record logarithms should be computed for the sample of N-2 values. If the third most extreme discharge in the original sample exceeds the threshold of Eqs. (4) and (5), then all three discharges are considered as outliers and the test is complete. If the discharge does not exceed the threshold, the second most extreme discharge should be returned to the sample and the moments recomputed for N-1 values. If the second most extreme discharge in the original sample exceeds the threshold then there are two outliers. If not, the moments are computed for the entire sample and a test for one outlier is performed (Spencer and McCuen 1996).

3.4 Historical Adjustment Procedure for Extraordinarily Large Floods

The probability distributions of GEV, LN3, LP3, and WAK were used to incorporate extraordinarily large floods into other systematic data using CFA from Environment Canada, version 3.1 (Pilon and Harvey 1994). The Bulletin 17B (USWRC 1982) historical weighting moments procedure (B17H) was used to estimate the parameters for the GEV and LP3 distributions. The B17H sample mean (\( \tilde{M} \)), sample variance (\( {\tilde{S}}^2 \)) and coefficient of skew (\( \tilde{G} \)) estimates were as follows. These were used to compute the historically-adjusted \( \tilde{M} \), \( {\tilde{S}}^2 \) and \( \tilde{G} \) as:

$$ \tilde{M}=\frac{W\sum X+\sum {X}_Z}{H- WL} $$
(10)
$$ {\tilde{S}}^2=\frac{W\sum {\left( X-\tilde{M}\right)}^2+\sum {\left({X}_Z-\tilde{M}\right)}^2}{H- WL-1} $$
(11)
$$ \tilde{G}=\frac{H- WL}{\left( H- WL-1\right)\left( H- WL-2\right)\Big)}\left[\frac{W\sum {\left( X-\tilde{M}\right)}^3+\sum {\left({X}_Z-\tilde{M}\right)}^3}{{\tilde{S}}^3}\right] $$
(12)

where weighting factor W is defined as:

$$ W=\frac{H- Z}{N+ L\ } $$
(13)

Historically adjusted rank (\( \tilde{m} \)) for each flood magnitude is computed as:

$$ \tilde{m}= E; when:1\le E\le Z $$
(14)
$$ \tilde{m}= WE-\left( W-1\right)\left( Z+0.5\right); when: Z+1\le E\le \left( Z+ N+ L\right) $$
(15)

Historically-weighted plotting position for each event may be given by:

$$ \tilde{P} P=\frac{\tilde{m}- a}{H+1-2 a}100\kern0.72em $$
(16)

Where:

E :

event number when events are ranked in order from greatest magnitude to smallest magnitude. The event numbers “E” range from 1 to (Z + N).

X :

logarithmic magnitude of systematic peaks excluding zero flood events, peaks below base, high or low outliers

X z :

logarithmic magnitude of a historic peak including a high outlier that has historic information

N :

number of X’s

\( \tilde{M} \) :

historically adjusted mean

\( \tilde{m} \) :

historically adjusted order number of each event for use in formulas to compute the plotting position on probability paper

\( \tilde{S} \) :

historically adjusted standard deviation

\( \tilde{G} \) :

historically adjusted skew coefficient

\( \tilde{P} P \) :

plotting position in percent

Z :

number of historic peaks including high outliers that have historic information

H :

number of years in historic period

L :

number of low values to be excluded, such as: number of zeros, number of incomplete record years (below measurable base), and low outliers which have been identified

a :

constant characteristic of the plotting position formula (a = 0, 0.4, and 0.5 for Weibull, Cunnane, and Hazen formula, respectively).

The MLM was used to estimate the parameters in the LN3 distribution, except on rare occasions where it appeared to be unobtainable, and the historically-weighted moments were used as backups to MLM. The least squares algorithm similar to that prepared by Houghton (1978) was used to estimate the parameters in WAK (Öztekin 2011). The historical period length was unknown and was determined using sensitivity analysis. The statistical high-outlier threshold was used to determine the historical-adjustment threshold. In this step, ARD and MSD were used to identify the best-fit probability distribution and the best empirical plotting position formula, respectively.

4 Results and Discussion

Basic characteristics of the flows of the hydrometric stations for observed floods and their natural logarithms are presented in Table 2. The results for detection of extraordinarily large floods are presented in Table 3. The G-B test identified one high outlier flood at a significance level of 5% at the Tangrah and Galikesh hydrometric stations. The S-M test identified three high outlier floods at a significance level of 5% in the Tangrah hydrometric station. It is worth mentioning that no low outlier floods occurred at any station.

Table 2 Basic characteristics of the flows in hydrometric stations
Table 3 Results of determining outliers in hydrometric stations

Figure 3 shows the normal plot probability for observed floods and their natural logarithms. The extreme tails of the normal plot probability indicate the existence of one high outlier at Galikesh and Tamer stations and three at Tangrah station. Figure 3-a illustrates that the data from all stations after natural logarithms have transformed them into normal series. The results of outlier tests and the probability plots target one high outlier at Tamer and Galikesh stations and three at Tangrah station. It should be mentioned that the statistical outlier threshold at Tamer station was calculated in the G-B test (1747 cm) and single S-M test (1311 cm) at a significance level of 10%. These quantities are 13 and 9.5 times the standard deviation plus the mean, respectively, which is a high limit.

Fig. 3
figure 3

The normal probability plot (a Observations data and b Natural logarithms of the observations data)

A synopsis of the identification results of the best-fit probability distributions, MSD and ARD are presented in Table 4. The results of the goodness-of-fit tests show that the superior distribution at Tamer hydrometric station for the complete data series (without historical adjustment) is the WEI distribution. For the other case the LP3 distribution provide the best distribution at all hydrometric stations. The best distribution parameter estimation method for the complete series (without historical adjustment) after deleting extraordinarily large floods was the unbiased PWM method. B17H was adopted for adjustment to estimate the distribution parameters of LP3. The best empirical plotting position was for the Hazen formula (a = 0.5).

Table 4 Results of selection the best distribution functions based on ARD and MSD criteria

Note that ARD and MSD are calculated for the extraordinarily large floods only, and all flood observations. The criterion of the best selection is based on the average of these results. Figure 4 illustrates the change in the ARD and MSD indices at the hydrometric stations for the LP3 distribution and the Hazen relationship. The figure shows the results of calculations for optimum flood historical periods at Tamer, Tangrah and Galikesh hydrometric stations that are 100, 150 and 90 years, respectively. It was found that the choice of a (constant parameter for plotting position formula) had a strong effect on ARD and MSD. Goodness of fit tests, indicate that adjustment for extraordinarily large floods produced a substantial and significant improvement in the results. The return periods for the largest floods are shown in Table 4. The observed and estimated flows for the hydrometric stations are shown in Fig. 5. For instance, based on Table 4, the quantity (Q 1000 − without adjustment procedure /Q 1000 − historical adjustment procedure ) at Tamer, Tangrah and Galikesh hydrometric stations were 0.74 and 8.1 and 0.46, respectively. This result shows that a lack of attention to the real position of extraordinarily large floods results in under- or over-estimation. Although confidence intervals are usually determined on the basis of sampling uncertainty, it may be also illustrated by plotting limit curves at multiples of the standard error above and below the fitted frequency curve. An interval of +/− 1.65 standard errors, which is often used, roughly defines a 90% confidence interval. Figure 5 also shows the confidence interval of the calculated return periods. The mathematical basis for combining standard errors from two or more sources is presented in IACWD (1982).

Fig. 4
figure 4

Variations of the ARD and MSD for the Streamgauging stations

Fig. 5
figure 5

The observed and estimated flows as well as the 90% confidence interval for the Stream gauging station

5 Conclusion

Extraordinarily large flood peak flow during a period of systematic records causes problems in the selection of the best-fit flood frequency distribution and parameter estimation. The present study suggests that extraordinarily large floods be treated as historic flood data although there is no actual recorded historical information. The length of the historical period may be unknown and can be determined using sensitivity analysis. In this study, the LP3 was the best-fit probability distribution for the historical adjustment procedure and the best plotting position formula was the Hazen equation. Two procedures using extraordinarily large floods (with and without historical adjustment) were compared for goodness of fit and return period flood quantiles. The results of comparison indicate that historical adjustment procedure is a viable alternative to without-adjustment procedure from an operational perspective simply because without-adjustment procedure ignores the largest floods. Lack of attention to the real position of extraordinarily large floods in without-adjustment procedure and exclusion of these floods from analysis produced unreasonable results (over- and under-estimation). These results could substantially improve the estimation of probabilities of rare floods for efficient design of hydraulic structures, risk analysis, and floodplain management. Further study in expected moments algorithm (EMA) to adjust for extraordinarily large floods and use of probable maximum flood (PMF) values to adjust the FFA procedure are recommended.