1 Introduction

Knowing fluvial loads and concentrations of pollutants is essential to establishing the state of a waterbody and its tributary watershed, and assessing trends in water quality. Load estimates of several parameters, such as Total Nitrogen (TN), Total Phosphorous (TP), and Total Suspended Sediments (TSS), are used to establish budgets for nonpoint and point pollution sources and design remediation/mitigation strategies. The regulatory framework in the United States used to establish total maximum daily loads (TMDLs) for impaired waterbodies was created in response to the Clean Water Act and its modifications (FWPCA 2002). The framework is heavily dependent on using accurate and reliable computation of fluvial loading to control degradation and restore designated use(s). Computation of loads, particularly for regulatory compliance, is still not standardized and may never be because it is based on data, and resources allocated to collect data vary widely.

There are several sampling methods, with strengths and weaknesses, that may be employed for estimating loads. Methods that rely on extensive sampling produce reliable load estimates but are resource-intensive to undertake, whereas methods that do not need extensive sampling often are not very accurate, particularly at shorter time scales. Decision-makers may try to optimize the desired accuracy in load estimation with the costs involved in obtaining these estimates for their region (optimization decision). The problem often comes when trying to express the desired accuracy as a cost that may be compared with the costs of obtaining accurate loads. One potential way to estimate the value/cost of the desired accuracy is to use the cost of management measures for pollution abatement that may be driven by the optimization decision.

With the success that the United States has enjoyed in controlling the majority of point sources, further reduction of fluvial loads for regulatory purposes requires nonpoint or diffuse pollution abatement measures that are often expensive to control. Typically, the costs of diffuse pollution controls are very high, and control measures themselves are often unreliable. If best management practices (BMPs) with limited control ability and those that are practically infeasible in producing results (e.g., pet waste management education, control of illicit discharges, and reduction in urban growth) are excluded from consideration, the costs per pound (over the lifecycle including capital and operational costs) of controlling non-point pollution of TN, TP, and TSS are between $151–$14,449, $1851–$70,342, and $4–-$69, respectively, for some areas of the Chesapeake Bay watershed region (CWP 2013). The wide ranges are reflective of the type of BMP, efficiency in controlling the pollutant, and other local conditions such as soil type and land value. Further, it may be noted there are theoretical reasons to believe that the cost of reducing impairment will be of a convex shape, where the costs per unit reduction decrease first with the economies of scale and then increase with the diminishing marginal returns on resources invested in impairment reduction (Wainger 2012). It may be reasonable to believe that in most scenarios, where the easier, cheaper methods have already been implemented, the economies of scale have been exhausted and further reductions will require significantly higher costs per unit reduction of impairment.

1.1 Fluvial Loads Sampling and Computation Methods

The most accurate estimates for loads may be obtained using continuous measurements of flow rates and in-stream concentrations of the water quality parameters of interest. Near-continuous recording (every 15 min to hourly) of flow measurement may be done using a variety of techniques, and some parameters such as temperature, nitrate-nitrogen, and dissolved oxygen, may also be measured in a near-continuous fashion. However, the measurement of many parameters requires laboratory analysis. This requirement practically rules out long-term high-frequency water quality measurement. Nevertheless, with judiciously frequent water quality measurements, excellent estimates of fluvial loads may be made (He et al. 2018; Johnes 2007; Kronvang and Bruhn 1996; Kumar et al. 2013; Moyer et al. 2012; Park and Engel 2015; Stenback et al. 2011). For example, the concentration for any parameter in a stream typically does not change much during non-storm baseflow periods. Thus, very good to excellent load estimates may be obtained for non-storm periods using weekly or bi-weekly water quality sampling. During storm events, when the concentrations of the parameters of interest (and, consequently, load) may be expected to vary rapidly, compositing methods of load estimation that yield an event mean concentration (EMC) for the storm may be employed to get accurate storm-event loads.

In resource-constrained scenarios, reasonably frequent composite storm or base flow sampling are not feasible, and regression methods are often used for estimating constituent concentrations. Regression-based methods estimate the constituent concentration by relating flow and other readily measurable parameters to the concentration of the constituent of interest (He et al. 2018; Kumar et al. 2013). Typically, regression-based methods require calendar-based (e.g., monthly) sampling as the measured concentration data are only used for calibration of a regression equation. The putative trade-off of this method is the reliability of the loads estimated at a lower cost.

There is considerable evidence suggesting that in smaller watersheds calendar-based sampling methods do not perform adequately, and even in larger watersheds significant differences were found in long-term studies (Horowitz et al. 2015; Kumar et al. 2013; Robertson and Roerish 1999; Stenback et al. 2011). To improve the efficiency of the regression-based fluvial load using calendar-based sampling, different sampling methods, such as hydrological-based sampling, storm chasing, sampling in the rising or falling limb of the hydrograph, and adaptive cluster sampling have been used with varying degrees of success (Arabkhedri et al. 2010; Horowitz et al. 2015; Robertson and Roerish 1999; Sadeghi et al. 2008; Sadeghi and Saeidi 2010).

The United States Geological Survey (USGS) has developed a modified weighted regression method–Weighted Regression on Time, Discharge, and Season (WRTDS)–to address some of the issues with regression-based fluvial load estimation schemes (Hirsch et al. 2010). The WRTDS method was shown to perform well in several scenarios (Beck and Hagy 2015; Lee et al. 2016; Sprague et al. 2011; Zhang et al. 2016). The good performance of WRTDS when used with modified calendar-based sampling and the prevalence of the method, particularly after being adopted by the USGS, were the motivations for using WRTDS for this study.

1.2 Decision Making Framework for Monitoring

A comprehensive framework for developing a monitoring program (Fig. 1) relies on linked relations between several components including the natural system under observation, the objective of the monitoring program, sampling scheme, field collection and analysis methods, and the cost-effectiveness (Maher et al. 1994). The natural system under investigation and the desired objective is used to derive an observed indicator. The cost-effectiveness and sampling scheme can then be optimized. Broad monitoring design plans, such as the US TMDL Effectiveness Monitoring Plan and the EU Water Framework Directive direction on monitoring plans (Allan et al. 2006) are often too broad and may not be directly applicable to choosing between the various options available for processes in Fig. 1. In this study, we develop and assess a limited simplified framework relating the cost-effectiveness and performance of the modeling method to choose a sampling method for resource-constrained monitoring operations.

Fig. 1
figure 1

A simple framework for designing a sampling program adapted from Maher et al. (1994)

2 Study Area

This study utilized data collected for two stations, ST30 and PR01, marked in Fig. 2, which also shows their drainage areas. Data for both stations were obtained from the Occoquan Watershed Monitoring Laboratory (OWML). ST30 on Broad Run in northern Virginia drains an area of about 2.29 × 102 km2, and PR01 on the Potomac River drains an area of about 2.9 × 104 km2. The much smaller ST30 watershed is relatively uniform in elevation and slopes, whereas the PR01 watershed area spans four states and includes parts of the Appalachian Mountains and has a much higher variation in elevation and slopes.

Fig. 2
figure 2

Location of the two stations used in this study. The watershed for PR01 (on Potomac River) station spans four states and the District of Columbia. The Little Falls Dam is a short distance upstream from the PR01 station. ST30 station (on Broad Run) is the smaller watershed abutting the Potomac River watershed but not a part of it

3 Methods

3.1 Load Computation

The two methods of load computation utilized in this study are:

  1. 1)

    Direct Method, which used the OWML dataset of weekly/bi-weekly water quality data for the three parameters of interest during non-storm flows and EMC for storm events, along with near-continuous (15 min to hourly) flow measurements. The direct method represents the best model for estimating load with extensive sampling schemes.

  2. 2)

    WRTDS Method, which is based on a weighted-regression technique described by Hirsch et al. (2010). In this study, for the WRTDS method, we used the same OWML weekly/bi-weekly water quality data during non-storm flows but used discrete samples that were also taken during storm events, along with the daily average flow. The WRTDS method is the representative method used for estimating loads with modified calendar-based sampling schemes.

3.1.1 Direct Method

The direct method of fluvial load computation is an extension of the first principle of load calculation. In this method, a water quality concentration reading is assigned for every recorded flow reading. For non-storm flows, regular periodic discrete samples measurements (weekly, bi-weekly) are used to interpolate concentrations at every data point where the flow is recorded (usually hourly during non-storm flows). For storm events, the flow-composite EMC value is assigned to all the recorded storm time flows; see Kumar et al. (2013) for more details on the composite sampling method employed. Loads for the desired period are then computed by aggregation. In this paper, we have considered this load (Direct Load or DL) to be the reference and all the bias is computed with relation to this reference.

3.1.2 WRTDS Method

The WRTDS uses a five-parameter equation (Eq. 1) to estimate concentration based on flow and time. This approach is different from most other regression methods as the parameters of Equation 1\( \left({\hat{\beta}}_0\kern0.15em \mathrm{to}\kern0.15em {\hat{\beta}}_4\right) \), instead of being fixed, are estimated for every combination of Q (daily average flow) and t (time) where concentration has to be computed. Thus, a unique set of coefficients is estimated for every combination of Q and t in the period of record. The advantage of this approach is the ability to be unbiased and still estimate a wider class of regression surfaces than other parametric functions used for most other regression methods. The data set used to compute \( {\hat{\beta}}_0\ to\ {\hat{\beta}}_4 \) is weighted based on the distance from Q and t at the estimation point. Weight (w) is computed for three different distances: a) time, b) season, and c) discharge using the “tri-cube weight function” (Equation 5). Net weight is taken as the product of these three weights. A 7-year half-window width is used for trend distance, 0.5 (decimal time) half-window width is used for seasonal distance, and 2.0 (ln[Q]) half window width for weight computation. These half-window widths (h) are similar to what was used and found to be optimum by USGS for non-tidal load computation in the Chesapeake Bay watershed (Hirsch et al. 2010; Moyer et al. 2012). All load computations for the WRTDS methods were performed by using the R Software Exploration and Graphics for RivEr Time-series (EGRET) package developed by USGS [https://github.com/USGS-R/EGRET/wiki, Access Date: 02/14/2019]. A much more detailed explanation about the WRTDS method may be obtained from the software webpage [https://github.com/USGS-R/EGRET/wiki, Access Date: 02/14/2019] and USGS publication (Hirsch and De Cicco 2015).

$$ \ln (c)={\hat{\beta}}_0+{\hat{\beta}}_1\ln (Q)+{\hat{\beta}}_2(t)+{\hat{\beta}}_3\sin \left(2\pi t\right)+{\hat{\beta}}_4\cos \left(2\pi t\right)+\varepsilon $$
(1)

where

c:

is concentration \( \frac{mg}{l} \)

Q:

is observed daily flow, \( \frac{m^3}{s} \)

t:

is the decimal time, years

\( {\hat{\beta}}_0\ \mathrm{to}\ {\hat{\beta}}_4 \):

are regression coefficient estimates

ε:

is the unexplained variation

$$ w=\left\{\begin{array}{c}\left(1-{\left(\frac{d}{h}\right)}^3\right)\ \mathrm{if}\ \left|d\right|\le h\ \\ {}0\kern5.25em \mathrm{if}\ \left|d\right|>h\end{array}\right. $$
(2)

where

w:

is the weight

d:

is the distance from estimation point to data point

h:

is the half-window width

Calibration and Performance of WRTDS

To assess the performance of the WRTDS model the coefficient of determination (r2) based on observed and predicted concentrations was computed. The computation was on daily concentrations, not loads. An r2 value of greater than 0.6 is considered satisfactory for water quality modeling (Donigian 2002). Note that several other detailed methods of assessing performance based on residual analysis and others are available in the EGRET package. Only r2 was used in this study and it was hard to assess the performance using other, often graphical, methods, in a simplified decision-making scenario.

3.1.3 Load Comparisons

Load flux at two stations was computed by two methods, WRTDS (WL) and Direct Load Method (DL), for three parameters (TN, TP, and TSS) at three averaging timescales (annual, monthly, and daily). A total of thirty-six (36) fluvial load time series were computed (2 stations, 2 methods, 3 parameters, and 3 averaging times) for the period from 1989 to 2003. This fifteen-year period was chosen because during this period OWML was collecting discrete samples during storm events (necessary for calibrating WRTDS) at ST30. In 2004, discrete storm sampling was discontinued.

Matched-pair comparisons were made to establish the difference in daily, monthly, and annual load fluxes computed by the two approaches. Because normality of the differences between pairs of flux could not be established (even after log transformation), non-parametric matched-pair signed-rank test (using R wilcox.test) was performed, comparing log-transformed DL with WL.

Matched-pair signed-rank test, where the difference between the two datasets is tested for the null hypothesis of zero, was used to identify the statistically significant (α =0.05) difference. An unbiased magnitude-of-difference (δ) between loads was calculated with the “Hodges-Lehmann Estimator” as suggested by Helsel and Hirsch (2002) to estimate the difference for the fifteen-year period. The Hodges-Lehmann Estimator (Equation 3) represents the difference between two populations and is computed as the median of all possible pairwise differences. For daily time series, the difference data (for matched-pair sign rank test) were found to be serially correlated. The Autoregressive (AR1) model was found to be suitable for the daily difference data using the autocorrelation function and partial autocorrelation function plots (not shown). Testing was thus performed on ‘pre-whitened’ data as discussed in von Storch (1995) for AR(1) removal. Pre-whitening was performed using the formula in Equation 4 and the residual time series was used for testing.

$$ \delta =\mathrm{median}\left\{\left({X}_i-{X}_j\right);i=1,\dots, n;j=1,\dots, n;j\ne i\right\} $$
(3)

where

δ:

is the Hodges-Lehmann Estimator;

Xs:

are the differences;

$$ {X}_t^{\prime }={X}_t-{r}_1{X}_{t-1} $$
(4)

where

Xt:

is the daily difference at time t;

\( {X}_t^{\prime } \):

is the residual pre-whitened data used for testing;

r1:

is the lag1 sample serial correlation.

To estimate flow independent concentration trends, the Kendall-Theil robust slope was computed on residuals obtained after using a LOWESS function to explain variations in the average concentration with the observed average flows for the period. These trends were computed for all cases. All statistical tests were performed using well-established R (wilcox.test) and Python (theilslopes in scipy.stats) libraries. Further data partitioning based on flow (Low, Medium, and High) was done to analyze the difference in loads computed by DL and WL for different categorical flows. For the data partitioning, low flows were defined as periods (daily, monthly or annual) where the average flows (averaged over the time period of interest: annual, monthly, or daily) were from 0 to 25 percentile of observed flows, medium flows were from 26 to 74 percentile, and high flows were from 75 to 100 percentile.

4 Results

To calibrate the WRTDS model and compute WL, an average of 43 and 52 samples per year for ST30 and PR01, respectively, were used. In addition to the discrete samples used to calibrate WRTDS, additional flow composite sampling was performed for the computation of DL during the study period. Based on the r2 all the models (except TP and TSS at ST30) indicated acceptable results (>0.6) on a daily timescale. Models for TP and TSS with r2 of 0.55 and 0.50 were still used as WRTDS estimated loads and represented the best method to compute loads from a modified calendar-based sampling regime where only daily flows are available. It may be noted that WRTDS is not recommended for small flashy watersheds where the flows may vary substantially within a day (Hirsch and De Cicco 2015).

Figures 3 and 4 show the differences between the two stations for three timescales. For both ST30 and PR01, the interquartile range (represented by the width of the box plot) of the difference between the two load computation methods is lower for the annual timescale followed by monthly, and then daily. The spread for the difference quantified as the interquartile range for TN is smaller than that of TP and TSS for both stations. The majority of the difference [Δ = ln(DL)- ln(WL)] is negative for daily time scales at both stations, suggesting that WL load > DL load. For monthly and annual, the median difference is closer to zero.

Fig. 3
figure 3

Load flux and the log differences observed at ST30. The vertical boxplots show the log-transformed flux in Kg/day, and the horizontal boxplots show the difference (Δ) computed as ln(DL)-ln(WL)

Fig. 4
figure 4

Load flux and the log difference observed at PR01. The vertical boxplots show log-transformed flux computed in Kg/day, and the horizontal boxplots show the difference (Δ) computed as ln(DL)-ln(WL)

Table 1 shows the significance of the differences and the multiplicative retransformed magnitude-of-difference (δ). Differences were computed on a natural log scale and hence are multiplicative, not additive. A statistically significant δ is observed for all parameters on a daily time scale, except for TN at PR01. At the monthly timescale, δ for TP and TSS at PR01 are significant. At the annual timescale, δ for TN at PR01 and TSS at ST30 were found to be significantly different from zero.

Table 1 The statistical significance of the difference between the two methods of flux computation and the magnitude of difference. Note that the sample size for annual, monthly, and daily comparison are 15, 180 (15 × 12), and 5477, respectively

Table 2 shows the slopes for the flow-independent concentration trendline. None of the slopes for the trendline for annual data by either method was found to be significantly different from zero. Significant trends were observed from monthly and daily flow-independent data, except for TN at PR01 on a monthly timescale. Significance and the direction of the trendline, represented by the sign of the slope, computed by data from both methods were similar for all time scales and parameters. The magnitude of slopes calculated varied widely for TSS but was largely similar (within the 95% confidence bound) for TN and TP.

Table 2 Slopes (changes in mg/L per year) of the trend line along with the 95% confidence bound. Note that a negative slope shows a decline of the parameter and a positive slope an increase. The magnitude of the slope represents the average change over the study period for the parameter of interest

Analysis of the difference in the methods after flow partitioning (high, medium, and low) suggests that these differences seem to vary with the flow partition, although no pattern is evident. As was observed in Figs. 3 and 4, the difference in prediction of TN is less than TP and TSS for all computed scenarios. With DL as the reference, WL seems to under-predict loads for all three parameters on an annual timescale at ST30 during low flows; the WL prediction at other times are more closely aligned with DL. At PR01, on annual timescales for TP and TSS, there is wide under-prediction and over-prediction by WL at Low and High flows, respectively. TN at PR01 on annual timescale seems to be invariant to the flow partitions.

5 Discussion

In natural systems, it is reasonable to expect that larger/sudden variations in concentration and flows will occur during storm events. The sampling method used for DL captures all storm event loads and is likely to yield more representative or true loads. Thus, all load bias discussion here is referenced to DL.

The performance of the modified-calendar based sampling regime represented by WL for PR01 seems to be good (r2 > 0.6). The low r2 for TP and TSS at ST30 does question the applicability of a modified calendar-based sampling load computation for ST30. ST30 with shorter flow events (less than a day) associated with storms may be classified as a small watershed. For small watersheds, a model based on daily flows may not capture all variations. It is important to recognize that loads are a product of flow and concentration and much higher flow values (measured more reliably) may mask this poor performance of the modified calendar-based sampling. Increasing the load averaging timescale from daily to monthly or annually may also improve performance.

The values of δ close to 1 for both ST30 and PR01 on annual and monthly periods (Table 1), except for TSS at ST30 on annual time averaging scale, suggest a lack of any systematic bias in the load computation with WL when compared to DL as the reference. For TSS at ST30 on an annual timescale, WL overestimates loads by about 29% over the study period. On daily timescales, however, evidence of systematic bias exists: 5 out of 6 observations being statistically significant, and 4 out of 6 showing more than a 10% difference. Some uncertainty is expected in sample collection and wet chemistry testing of various constituents. For a small watersheds Harmel et al. (2006) have quantified the “typical-minimum” net uncertainty for TN, TP, and TSS load measurement to be in the order of 11%, 8%, and 7%, respectively, and this can serve as a good reference for evaluating the bias in the loads estimated by WL (typical maximums are defined as 70%, 110%, 53%, and typical averages as 29, 30%, 18%). Using typical-minimum, a conservative assumption, most of the long-term loads on annual and monthly timescales are within the limits of expected errors in DL (Table 2). Daily loads are generally more variable and outside the typical minimum values.

Flow-independent trends produced by the two methods are similar in direction and statistical significance. The magnitude of the trend slopes are similar (not exactly the same, but within the confidence bounds) except for TSS at ST30. The observation of low overall δ and similar trends indicate that for long-term planning, where variation in one period is not very important, modified calendar-based sampling may be a good substitute for both the large PR01 and the small ST30 watersheds. The statistical formulation used for computing WL also allows for direct computation of a flow-normalized trend (different from flow-independent trends discussed here) that may make the method more attractive for long-term trend analysis.

The WL method may not represent reality in cases where variations in each time period may be important, such as calibrating water quality models at a daily time-step for source allocation of loads. Regression methods and WRTDS have been used in the Chesapeake Bay watershed to calibrate parts of the Bay Model (Easton et al. 2017). For a watershed model calibrated with WL at a daily time step, it is reasonable to expect that the model will try to account for the wide daily discrepancy (from DL) which will result in inadequate system representation with the parameters adjusted to match the WL. The performance of WRTDS for smaller ST30 watershed (low r2 for 2 out of 3 parameters) strongly suggests that WRTDS should be avoided for smaller watersheds. Hirsch and De Cicco (2015) have suggested that WRTDS with a daily time step may not be appropriate for small flashy watersheds, defined as “where discharge at the stream gage commonly changes by an order of magnitude or more within a given day.”

As discussed in the introduction, strategic sampling during wet or dry weather has been suggested to improve the performance of regression methods. Our results based on flow partition suggest that it is not easy to determine when to sample more (flows where load computation errors are frequent). In the two watersheds studied, the difference between WL and DL varied with the parameter, flow, and station in a manner such that no discernable correlation could be observed. It is important to note that the number of samples per year to drive the WRTDS, on average 43 for ST30 and 52 for PR01, in this study is high for a typical resource-constrained watershed monitoring program. Even for the extensively monitored nine large rivers in the Chesapeake Bay watershed, less than an average of 20 samples per year are collected, with huge variation among parameters. WRTDS protocol recommends at least 20 samples (12 calendars and 8 stormflow). The number of samples collected per year from 1985 to 2010 for some rivers are as low as 6 for suspended sediments and as high as 41 for TN and TP at the Potomac River (Moyer et al. 2012). No attempt has been made in this study to assess the impact of the reduction in samples. With a reduction in samples, the results of this study may not hold and likely will be worse for regression-based methods. Shorter-term parallel sampling, sampling intensively with EMC for all storms and grab samples for some storms that will allow computation of both WL and DL, as suggested by Kumar et al. (2013), may be used to identify and correct the systematic biases in regression-based methods.

The choice of sampling method is often based on available resources. From an economic perspective, the costs associated with the construction and upkeep of nonpoint BMPs may be used as a surrogate for the benefits obtained from the investment in sampling. In a watershed where low-cost nonpoint BMP options such as street sweeping are used, extensive sampling needed to estimate DL may not be necessary. As the costs of control measures per pound of pollutant controlled rise, investment in extensive sampling becomes more feasible either to control cost if the regression methods are overpredicting or to confirm the loads and ensure attainment of load targets. Even the slight statistically significant over-prediction by the WRTDS method for TN at PR01 on an annual timescale (δ = 0.9) may predict 2400 Kg/day to be remediated. In a typical TMDL allocation scenario, this load would be distributed among point and non-point sources. Given that the median urban stormwater TN remediation cost is about 3160 $/Kg (James River Watershed), even a fraction of the 2400 Kg/day that may be assigned to the urban region will result in an extremely high excess cost, in the order of millions of dollars per day.

Without analysis such as the one presented in the study, there is no good way to understand the biases in loads predicted by modified calendar-based sampling, and the cost of remediation if WL overpredict is very high. It may be argued that from a policymaker’s perspective there is no scenario where modified calendar-based sampling should be used without intensive sampling comparison. In cases where WL loads are underpredicted, the situation may lead to non-attainment of designated use: not a desirable and often expensive option. Nevertheless, a simple framework for decision making (Fig. 5) on whether or not to use modified calendar-based sampling may rely on the performance of the regression equation, type of control measures available, and the estimated bias in the use of the regression equation. If the performance of the regression equation is ‘acceptable’ and ‘inexpensive’ control measures are available, the application of WL may be justified. The ‘acceptable’ measure can be based on literature (e.g., r2 > 0.6). ‘Inexpensive’ control measures are more subjective and may depend on the location being analyzed. If either of these conditions is not met, a rigorous sampling regime of ‘parallel sampling’ may be undertaken, where sampling will allow for the computation of both DL and WL. Kumar et al. (2013) have estimated that 12–72 months parallel sampling may be required for computing an accurate estimate of the load based on the parameter and size of the watershed. Finally, based on bias estimation and comparison with typical errors expected in load computation, a decision on whether to use modified calendar-based sampling may be made. This framework can be applied for any new or existing monitoring program. Especially where non-point control measures may be required in the short-term. The strength of this framework is the simplicity and ability to assess the bias objectively. However, the framework may often recommend parallel sampling that may require monitoring for years before a long-term method may be employed.

Fig. 5
figure 5

A simple decision-making framework to assess the applicability of the models that use modified calendar-based sampling data

6 Conclusion

In this study, for the watersheds analyzed, a statistical lack of similarity between the loads computed by modified calendar-based sampling and the loads computed via extensive sampling could be shown at the annual and monthly timescales for 4 comparisons: PR01 (TN annual, TP and TSS monthly), and ST30 (TSS annual). There is not enough evidence to reject similarity for eight other scenarios: PR01 (TN monthly, TP and TSS annual), and ST30 (TN and TP monthly and annual, and TSS monthly). Thus, if the goal of a monitoring program is estimating long-term annual mean loadings that may be used for TMDL (or similar) allocation scenario either DL or WL may work for some parameters. For the daily loads, 5 out of 6 comparisons show a difference, with TN at PR01 being the exception. Thus, for a goal of estimating short-term daily loading that may be used for calibrating water quality model and captures storm loads WL is unlikely to provide reliable estimates. Overall, these results indicate that there may be an agreement between loads computed by WL and DL for exactly 50% of all cases (9 out of 18).

The flow-independent trends showed a better correlation compared to fluvial loads, with similar magnitudes for all statistically significant slopes and the same direction of the trend at the two stations for all three averaging times. Thus if the goal for the monitoring program is to estimate long-term trends useful for assessing progress towards achieving long-term targets either WL or DL may be used.

Given that the remediations often needed to control the non-point loads are very expensive, we have argued that parallel sampling, which allows for load computation by both DL and WL, should be conducted for some period. Using a modified calendar-based sampling method directly without that information may not be advisable. Some degree of parallel sampling will allow watershed managers to perform comparisons like the one presented in this paper and assess periodic and long-term bias. In the absence of any prior information, an exhaustive sampling method to enable the framework discussed should be employed.