Introduction

Standardized freshwater bioassessments are used worldwide to ascertain the relative health and quality of aquatic resources. Community structure-based assessment models such as multimetric indices (MMIs) are typically calibrated within states or regions, but some models have been developed nationally (e.g., Stoddard et al. 2008) and continentally (e.g., Hering et al. 2006). For decades, benthic macroinvertebrates have served as excellent indicators of biological health in flowing waters (Resh and Jackson 1993); they are the most widely studied group and are typically considered one of the most sensitive to anthropogenic disturbance. Macroinvertebrates can quickly recolonize habitats under improving chemical or physical conditions and are well-suited for use with MMIs. In the USA, the Federal Water Pollution Control Act, also known as the Clean Water Act (CWA) of 1972, is a comprehensive statute with the stated Congressional objective to “restore and maintain the chemical, physical, and biological integrity of the Nation’s waters.” Many federal, state, and tribal entities in the USA use multimetric indices to assess this biological integrity for CWA programs. Since 2002, WVDEP has used a statewide macroinvertebrate multimetric index (developed using family-level macroinvertebrate data) for biological assessments called the West Virginia Stream Condition Index (WVSCI) (Gerritsen et al. 2000). The index uses six metrics (total no. of families, no. of Ephemeroptera + Plecoptera + Trichoptera (EPT) families, family-level Hilsenhoff Biotic Index (HBI), % 2 dominant families, and % Chironomidae individuals). WVDEP has used the WVSCI to characterize patterns of stream condition and to measure biological attainment of the aquatic life use (ALU). In the development of the WVSCI, Gerritsen et al. (2000) found no distinct natural classification patterns at the family-level (based on an original set of 109 reference sites), and thus the WVSCI is not tailored to geographic or seasonal variation and has been used statewide over one broad index period (April–Oct). However, Gerritsen et al. (2000) recommended a re-evaluation of classification and metric suitability when more detailed genus-level data became available.

While the scientific debate over the cost-effectiveness of finer taxonomic resolution continues (Jones 2008), it is widely accepted that genus or species-level data more accurately represent the composition of the aquatic community and increases our ability to detect a variety of impacts. In the USA, Carter and Resh (2001) found that most government agencies favored genus- and species-level taxonomy. Some research acknowledges that coarser family-level data can detect severe and obvious degradation in streams (Lenat and Resh 2001; Bailey et al. 2001), but genus- and species-level information can detect more subtle and complex effects and impairments as well (Lenat and Resh 2001; Schmidt-Kloiber and Nijboer 2004; Arscott et al. 2006). Several states adjacent to WV currently use genus-level MMIs (KY, OH, MD, and PA). Internationally, genus- or species-level taxonomy is widely used in MMIs (e.g., Vlek et al. 2004; Baptista et al. 2007; Verandes and Cortes 2010) but family-level taxonomy is frequently used for multivariate predictive models (e.g., Reynoldson et al. 2001). In southern WV, Pond et al. (2008) compared family- and genus-level metrics and taxonomy in assessing the effects of surface coal mining and reported that genus-level metrics detected impacts more effectively than WVSCI and its component metrics.

Our objectives were to develop and validate a genus-level index of most probable stream status (GLIMPSS) for bioassessment of West Virginia streams. This paper describes the refinement of WV’s assessment tools, the goal of which was to create robust and easy-to-apply MMIs that account for natural variation and can accurately distinguish between reference and environmentally stressed benthic communities within WV streams.

Methods

Environmental setting and data collection

Woods et al. (1996) described the WV landscape as highly forested and dissected by four level III ecoregions: Blue Ridge (ecoregion 66), Ridge and Valley (ecoregion 67), central Appalachian Mountains (ecoregion 69), and Allegheny Plateau (ecoregion 70). WVDEP estimates that over 88,000 km of streams (1:24,000 scale National Hydrography Dataset) occur within the state. Because of the high relief and topographic dissection, much of this stream length is comprised of 1st- and 2nd-order headwater streams (1:24,000 scale). Three quarters of WV is composed of sedimentary sandstones and shales (ecoregions 69 and 70); much of the remainder higher relief is folded sedimentary and metamorphic rocks in the eastern part (ecoregion 67) and a very small portion of the state (<2 %) in ecoregion 66. WV contains diverse temperate mesophytic and hardwood forests where the mineral resource extraction (e.g., coal, oil, and gas), silvicuture, agriculture, and urban development are some of the most common sources of impacts to aquatic resources.

All data collection followed WVDEP standard operation procedures (WVDEP 2011). Briefly, biologists collected macroinvertebrates from riffle/run habitats within a 100-m reach using a 0.5-m-wide rectangular frame kick net (500 μm mesh). A composite of four 0.25 m2 kick samples represented approximately 1 m2 of stream bottom substrate. Taxonomists identified organisms to genus or lowest level practical (e.g., Nematoda) from sorted random 200-organism (±20 %) subsamples. Taxa were coded for various traits (pollution tolerance value, functional feeding group, habit) based on published literature (e.g., Lenat 1993; Barbour et al. 1999; Merritt et al. 2008) and further adapted from neighboring state lists. Within the 100-m reach, field crews measured in situ water quality (dissolved oxygen, pH, specific conductance, and temperature) and assessed habitat using US EPA Rapid Bioassessment Protocol (RBP) Habitat Assessment procedures following Barbour et al. (1999). This latter procedure qualitatively evaluates habitat components such as epifaunal substrate quantity and quality, degree of embeddedness, sediment deposition, channel alteration, stream bank stability, bank vegetative protection, and riparian zone width. Water chemistry data availability varied depending on WVDEP’s programmatic survey types (e.g., probabilistic, long-term trend monitoring, targeted total maximum daily load monitoring). Metals were collected at 69 % of all sites (concurrent with benthic sampling) while nutrients were collected at 54 % of all sites. In contrast, fecal coliform bacteria samples were analyzed at 90 % of all sites. A full suite of all water chemistry parameters (e.g., nutrients, metals, bacteria, ions) were collected at approximately 50 % of all sites at the time of benthic sampling.

Dataset

The genus-level dataset (n = 3,737) represented 2,354 unique streams and 3,411 unique stations sampled from March–October, 1999–2009. Those streams that had more than one station located on them averaged a distance of 3.9 km between stations. Overall, the dataset consisted of approximately 33.3 % probabilistically selected sites and 66.7 % target sites. We removed samples containing <100 individuals from the development dataset. While most of these low abundance sites indicate severe chemical or habitat impairment, they might also include samples influenced by drought or spate conditions. As standard protocol, WVDEP reviews all available information to help evaluate whether samples with <100 individuals are related to natural or anthropogenic factors prior to assessing the site; conservatively, we excluded these low abundance sites for MMI development. Therefore, benthic samples used in this analysis ranged from 100 to 240 individuals. We also omitted larger rivers (>150 km2) and true limestone streams (which make up <1 % of stream kilometers in the state). We excluded same day duplicate samples, and any additional samples collected from the same site within a 5-year period from the development dataset as a means to reduce bias and confounding. To estimate within-site variability due to spatial patchiness, annual changes and method error, we analyzed duplicates and annual re-visits separately. We modeled a stressor gradient using only those samples with a full suite of water chemistry and habitat data (1,617 of the 3,737, 70.3 % of which were probabilistically selected). Prior to any analyses, the dataset was randomly divided into calibration (CAL; 70 % of all sites) and validation sets (VAL; remaining 30 % of all sites).

The development of a biological assessment MMI uses a reference condition approach (Hughes 1995, US EPA 1996; Barbour et al. 1999; Stoddard et al. 2006). The sensitivity of indicator metrics was evaluated by comparing the range of values among all reference sites in a class to sites known to be stressed by chemical or physical factors. The WVDEP relies on a combination of quantitative and qualitative physical and chemical attributes and narrative criteria to identify reference quality streams (Table 1). Additionally, we selected candidate reference sites by examining historic data (if available) and by consulting with regional professionals that have knowledge of local streams. To be classified as reference, a site must meet all of the listed conditions in Table 1. However, in areas where high quality reference sites are scarce, a site could be listed as reference even if it failed one of the criteria but best professional judgment provided evidence that the site could be included without degrading the reference condition classification. Thus, we would characterize most WV reference sites as “minimally disturbed” but retained sites considered to be “least disturbed” (after Stoddard et al. 2006).

Table 1 Screening criteria used to select reference and stressed sites for metric testing

Following EPA guidance (US EPA 1996; Barbour et al. 1999) the data were first divided into reference (REF), stressed (STRESS), and other, or non-reference (non-REF) sites. We defined STRESS sites as meeting at least one of the abiotic criteria (physical and chemical) shown in Table 1. These criteria are similar to the original WVSCI stress site criteria (Gerritsen et al. 2000) and cover a broad range of potential stressor variables across WV. Non-REF sites included all other sites that were not classified as either REF or STRESS and were used in combination with REF and STRESS sites for purposes of deriving disturbance gradients and metric correlations, metric standard values, and in precision analyses. Overall (combined CAL and VAL), the data represented 391 REF, 2,384 non-REF, and 962 STRESS sample sites (see Table 1). Figure 1 depicts the distribution of REF and STRESS sites among WV ecoregions and major catchments.

Fig. 1
figure 1

Location of WV sites distributed among major watersheds and ecoregions for calibration and validation reference and stressed sites used for index development. Numbers represent level III ecoregion codes

Data analysis

Community classification

Macroinvertebrate assemblages were grouped into natural classes based on inferences from community similarity analysis and ordinations (Barbour et al. 1999). We evaluated several combinations of seasonal and geographic factors to explain natural variability in macroinvertebrates found at REF sites. Seasonal categories (e.g., spring and summer) were biologically-relevant (i.e., they relate to known life histories of native taxa; e.g., Sweeney 1984). Geographical stratification followed Level III ecoregions after Woods et al. (1996). We evaluated the effect of stream size (e.g., mean stream width) on community structure and final GLIMPSS scores with correlation analysis. Classification schemes were evaluated with mean similarity analysis (MEANSIM) (Van Sickle 1997) and non-metric multidimensional scaling (NMDS) (McCune and Grace 2002). The following combinations of strata were evaluated: Level III ecoregion (67, 69, and 70); season (spring (March–through May), summer (June–early October)); bioregion (combined level III mountain ecoregions (67/69), plateau (70)); level III ecoregion × season; bioregion × season.

We calculated classification strength using MEANSIM with a Bray–Curtis similarity matrix of reference site communities (Hawkins and Norris 2000) by comparing the average within-class similarity (W) to the average between-class similarity (B) among strata combinations. The classification strength (CS) is simply W minus B. NMDS of REF sites was performed with PC-ORD™ (v. 6, Gleneden Beach, OR) and although this ordination method is non-parametric, genus-level invertebrate abundances were log (x + 1) transformed to reduce any effect of skewed abundance distributions on the outcome of the ordination. NMDS was performed using the Bray–Curtis coefficient on six dimensions with 250 real runs and 250 randomized runs. Taxa observed at less than 2.5 % of all REF sites were omitted from the ordination (NMDS based on 158 taxa out of 322 total taxa). By excluding these infrequent taxa, multivariate analyses are more robust and patterns are more evident (McCune and Grace 2002). These omitted taxa were added back into the calculations for all metrics and the final GLIMPSS.

Metric selection and index construction

Forty-one biological metrics were calculated from queries programmed in the WVDEP database (see Appendix 1 for complete list). Ecological attributes included six categories of richness, composition, dominance, tolerance, trophic or functional feeding groups, and habit (Barbour et al. 1999). Taxon traits (e.g., pollution tolerance values, habit, and feeding) are coded in the WVDEP database and derived from published literature (e.g., Merritt et al. 2008) or other neighboring state attribute tables (e.g., PA and KY); ttrait metrics were based on WVDEP-specific designations. Evaluation of metrics followed published techniques (e.g., Blocksom 2003; Hering et al. 2006; Whittier et al. 2007; Stoddard et al. 2008; Blocksom and Johnson 2009). After classification, we evaluated metrics within each stratum by running them through a step-wise performance process. This process included sensitivity (i.e., ability to distinguish between REF and STRESS), response to a human disturbance gradient, sufficient range and scope for detecting impairment, and lack of redundancy with other metrics. Representation of the six ecological categories was considered during the metric selection process.

First in the step-wise process, we evaluated metric sensitivity using discrimination efficiency (DE) (Gerritsen et al. 2000). Percent DE was calculated as the number of STRESS site metric values that fell below the reference set 25th centile (or >75th centile for negative response metrics) divided by the total number of STRESS sites, and multiplied by 100. We removed metrics that had DE values of less than 65 % unless the metric offered additional ecological information desired for the index (e.g., habit, trophic, and tolerance) and the metrics passed other selection criteria (see below). Those metrics that had the highest DE were considered for inclusion in the multimetric index.

Secondly, metrics significantly (p < 0.01) and moderately correlated (|r| > =0.25) to the disturbance gradient were retained for further evaluation. We modeled a human disturbance gradient from linear combinations of abiotic factors at 1,617 sites using principal components analysis (PCA) (McCune and Grace 2002). We evaluated individual metric response along the disturbance gradient (PC1 scores) using Pearson correlation. The variables used in the PCA included same-time measures of: pH, and log transformations of temperature, D.O., fecal coliform, conductivity, sulfate, chloride, total phosphorus, nitrite–nitrate, total suspended solids, total aluminum, total iron, total manganese, and 7 of the 10 EPA RBP habitat metrics (channel flow status, velocity regime, and frequency of riffles were excluded).

Thirdly, we evaluated redundancy between metric pairs (highly correlated r values of >0.75) using Pearson correlation and further compared pairs using scatterplots. For metrics that were >0.75, both metrics were judged as adding information and were retained for further analysis if nonlinear relationships were apparent or if there was sufficient dispersion (i.e., scatter) of the paired metric data (Barbour et al. 1999).

The fourth step removed metrics with limited range. This involved richness metrics consisting of <5 taxa and abundance metrics with a range of <10 % (Klemm et al. 2003; Blocksom and Johnson 2009). The final step in metric selection evaluated a form of variability, referred to as “scope of impairment” (SOI) (modified after Klemm et al. 2003; Blocksom and Johnson 2009). We examined boxplots of the REF site metrics within individual strata. Metrics were rejected if the interquartile range was greater than the range between zero and the lower quartile, that is, an interquartile coefficient of >1.

Metric scoring and aggregation into an index

The set of most sensitive and responsive, and least redundant and variable metrics were aggregated for each stratum so that indicators could best contribute to the final GLIMPSS. These metrics included “positive” (i.e., increase with improving water or habitat quality) and “negative” (decrease with improving quality) scoring metrics. For scoring purposes, metrics were normalized by first calculating the 5th and 95th centiles of each metric based on all sites within each stratum. Using these upper and lower centiles (ceilings and floors), we scored metrics following Blocksom (2003), Hering et al. (2006), Whittier et al. (2007) Blocksom and Johnson (2009). Normalization (scoring) of a positive-direction metric was based on the following equation: (metric value − floor)/(ceiling − floor)*100; negative-direction metrics were scored as: (ceiling − metric value)/(ceiling − floor)*100. This resulted in dimensionless and equally-weighted metric scores on a scale of 0 − 100 %. If a positive-responding metric scored over 100 (e.g., a value above the 95th centile), then it was corrected to the maximum score of 100. Alternatively, if metric values fell outside the 95th or 5th centile (depending on metric direction), it received a score of zero. All metric scores were then averaged to produce the final index score (after Gerritsen et al. 2000). Four interpretive condition categories (very good, good, degraded, and severely degraded) were established based on REF distributions. GLIMPSS impairment thresholds were based upon REF 5th centiles within each stratum (designating “good” from “degraded”). Very good conditions were met when scores were more than the 25th centile of REF scores; scores that were <50 % of each impairment threshold represented severely degraded conditions.

Index performance: sensitivity and responsiveness

GLIMPSS impairment thresholds were based upon the 5th centile of the reference distribution within each stratum. The CAL GLIMPSS was evaluated for DE and response (i.e., correlation) to the human disturbance gradient (PC1 from the 1,617-site symmetric dataset) in each stratum. We then used the independent VAL dataset to calculate the classification efficiency (CE), or ability of GLIMPSS to correctly assign sites to either reference or stress categories (Southerland et al. 2007). This is different from DE in that CE was calculated as the sum of the number of VAL REF sites scoring above the 5th centile, plus the number of VAL STRESS sites scoring below the 5th centile of the development REF distribution, divided by the total number of sites. We also compared VAL sites to the PC1 disturbance gradient with correlation analysis.

Reproducibility

An estimate of GLIMPSS measurement error and associated precision (a type of performance measure) from same-day duplicates collected at 90 sites was estimated within individual strata according to Stribling et al. (2008). Firstly, the GLIMPSS mean square error (MSE or variance) was calculated; the square root of the MSE (the RMSE) provides the standard deviation, which is an estimate of measurement error. We used the standard deviation to calculate 90 % confidence intervals (90 % CI) around a single observation using the standardized z-score. The coefficient of variation (%CV) and relative percent difference (RPD) followed standard calculations. Estimates of precision are essential to the interpretation of the index because it helps identify variability from sampling method and laboratory processing and can be used to compare precision among field teams or between entities. Estimates of measurement error also provide a means to compare GLIMPSS scores between single site observations. To estimate natural temporal and method variability, we repeated this analysis with GLIMPSS scores for 30 REF sites revisited yearly within the same index period to evaluate natural annual variability at REF sites. These estimates can be used to detect long-term trends in GLIMPSS scores at individual sites through time.

Relationship between WVSCI and GLIMPSS

In order to compare the GLIMPSS with the WVSCI, the WVSCI impairment threshold (5th centile) was first updated with all current REF sites used in GLIMPSS (391 vs. original 109 used by Gerritsen et al. 2000); this increased the original threshold from 68.0 to 71.6. We then calculated and compared each index impairment rate (i.e., number of sites impaired based on all sites and within individual strata).

GLIMPSS with family-level Chironomidae

Within each stratum, we developed a second index differing only in the use of family-level vs. genus-level taxonomic identification of a group of dipterans known as Chironomidae (non-biting midges), termed GLIMPSS (CF). Although it is well known that some chironomid genera are responsive and can indicate certain stressors (Rosenberg 1992), chironomid taxonomy increases sample processing time and requires additional expertise that may be lacking in some laboratories. We used the identical datasets to calibrate and validate the GLIMPSS (i.e., CAL and VAL; REF, non-REF, and STRESS) to re-analyze GLIMPSS (CF) metrics in which all chironomid genera in the benthic enumeration table were first converted (collapsed) to family level. Stratum-specific GLIMPSS metrics that involve chironomid genera include: no. of total genera, no. of intolerant genera (pollution tolerance value (PTV) < 4), no. of intolerant genera (PTV < 3) no. of clinger genera, HBI, % tolerant (PTV > 6), % 5 dominant genera, no. of scraper genera, and no. of shredder genera. Second, these metrics were then recalculated with chironomids collapsed to the family level. The % Orthocladiinae metric, a chironomid sub-family metric, was automatically omitted from the GLIMPSS (CF). In the case of % Orthocladiinae, two highly comparable analogs (% Chironomidae and % Chironomidae + Annelida) were tested as a potential replacement metric. All other metrics that comprise the GLIMPSS were retained and unmodified (e.g., no. of Ephemeroptera genera, % EPT (minus Cheumatopsyche)).

Metric re-testing and confirmation involved checking the DE, PCA correlation, redundancy magnitude, range, and SOI of the newly modified GLIMPSS metrics from the CAL dataset. The upper (95th) and lower (5th) centiles for metrics were recalculated from the full dataset and GLIMPSS scores were calculated for all CAL and VAL sites. As in the development of the full GLIMPSS, the GLIMPSS (CF) was similarly evaluated for sensitivity and responsiveness. Finally, we compared the full GLIMPSS and the modified GLIMPSS (CF) models using scatterplots and Pearson correlation coefficients.

Results

Community classification

MEANSIM showed that bioregion–season (CS = 18 %) and ecoregion–season (CS = 17 %) had the highest classification strengths at REF sites (Table 2). Ecoregion and bioregion alone produced the lowest CSs (10 and 12 %, respectively) indicating the stronger influence that seasonality had on structuring REF benthic communities in these regions. In the NMDS analysis, a three-dimensional solution was fitted after 110 iterations. Axes (i.e., dimensions) 1 and 2 were 100 % orthogonal and explained the most variance in the distance between the original distance matrix and final configuration (28 and 23 %, respectively) and stress was relatively low (18.9 %). Axis 3 explained only 11 % variance and was not plotted. Visual inspection of the NMDS ordinations (Fig. 2 a–d) showed comparable results to MEANSIM analysis (i.e., classification strength can be visualized in the ordinations). The best ordination was bioregion-season (Fig. 2d) which had the least scatter across strata and good clustering within strata. Although ecoregion-season had high CS, there was considerable overlap of ecoregion 67 and 69 sites in both seasons (Fig. 2c).

Table 2 Similarity analysis (MEANSIM) results for strata combinations using Bray–Curtis coefficients from REF calibration sites based on log-transformed (x + 1) abundance data showing within (W) and between (B) similarity
Fig. 2
figure 2

NMDS ordination of CAL reference assemblages for ecoregion (a), season (b), ecoregion–season (c), and bioregion (d). Data in parentheses report the percent variance of axis contribution to the ordination. Bioregions include mountains (combined ecoregions 66, 67, and 69) and plateau (ecoregion 70). MT Sp Mountain Spring, MT Su Mountain Summer, PL Sp Plateau Spring, PL Su Plateau Summer. Dashed lines in (d) drawn by eye for emphasis

Some environmental factors were more highly correlated with NMDS axis scores (e.g., elevation and Julian day) than others (e.g., longitude and latitude). Intuitively, bioregion-season can be largely explained by elevation (i.e., mountain areas of ecoregion 67–69 vs. lower elevations in 70) and Julian day (spring vs. summer index period). The strong significant influence of Julian day and temperature on axis 1 (r = 0.70 and 0.72, respectively) confirms that benthic communities changed substantially along a seasonal (spring to summer) continuum. Stream width, longitude, latitude, elevation, pH, or specific conductance were not significantly correlated (p > 0.05) to axis 1 scores (all |r| < 0.28). Only elevation was significantly correlated with axis 2 scores (r = 0.49, p < 0.05), as characterized by the separation of mountain and plateau regions along axis 2.

The ordination results indicated that splitting the data into common seasons (i.e., spring and summer) in addition to bioregions provided good separation. Thus, we chose four classification strata for index development purposes: Mountain Spring (MT Sp; CAL n = 86 sites), Mountain Summer (MT Su; CAL n = 126 sites), Plateau Spring (PL Sp; CAL n = 31), and Plateau Summer (PL Su; CAL n = 28 sites).

The stressor gradient

A synthetic human disturbance gradient was modeled with PCA. Table 3 lists the eigenvalues, percent variance explained by each axis, and factor coefficients of each environmental variable on the respective axes. PC axis 1 explained roughly 26 % of the variance in the dataset (eigenvalue = 5.1) and was judged suitable for use as a human disturbance gradient. Specific conductance had the highest correlation of chemical variables to axis 1 (+0.47) followed closely by temperature (+0.46). But daily temperatures fluctuate widely; the correlation using instantaneous readings might simply relate to regional and seasonal patterns which were likely controlled by our selected stratification. Habitat metrics were also strongly correlated to axis 1 (+0.58–0.69). Measures of pH and total metals were mostly correlated with axis 2 (>+0.50), but this axis only accounted for 13 % of the total variance (eigenvalue = 2.6). Appendix 1 summarizes general stratum-specific abiotic data for REF and STRESS sites.

Table 3 PCA results (eigenvalues and percent variance accounted for) and factor correlation coefficients of the first three components based on 1,617 sites

Metric evaluation and selection

The stepwise selection process yielded GLIMPSS models that incorporate ten metrics in both MT strata (spring and summer), 8 metrics in the PL Spring, and 9 metrics in PL Summer stratum. We selected one or more metrics from each of the six ecological categories in all strata, except PL Sp where no FFG or dominance metric met the metric test criteria. Table 4 summarizes metric performance (i.e., sensitivity, responsiveness, and redundancy) by bioregion-season for the selected metric sets. See Appendix 2 for stratum-specific results from all metric testing.

Table 4 List of selected GLIMPSS metrics by stratum with %DE, correlation to PCA axis 1 (p < 0.05), and maximum correlation value for selected metric pairs as a measure of redundancy (Redun; absolute value)

While there was consistent performance observed for many metrics, other metrics showed wide ranges of sensitivity across strata, indicating differences in biological potential from seasonal and geographic factors or sensitivity to region-specific stressors. For example, the % 5 dominant taxa metric was very sensitive in the PL Su (DE = 74.8) but was rejected from further consideration in PL Sp (DE = 47.1). By comparison, %Orthocladiinae was very sensitive in the MT Sp (DE = 87.3) but had much less discrimination ability in the PL Su (DE = 60.2). Although no. of EPT genera had high DE in all strata, we chose to use independent measures of Ephemeroptera, Plecoptera, and Trichoptera genera where possible, in order to benefit from the known diagnostic capability of these individual metrics (Karr and Chu 1999). Moreover, the no. of Trichoptera genera metric was not sensitive in many strata (except MT Sp). While %EPT is a commonly used metric in many MMIs, it was apparent that component orders were not always discriminatory (e.g., % Trichoptera and % Plecoptera components failed DE and SOI in many strata). However, excluding the tolerant and abundant caddisfly Cheumatopsyche (after Pond et al. 2003) often increased %EPT performance. Like in the DE analysis, we observed differences in responsiveness (PC 1 correlation) between metrics across strata. The most notable of these was no. of scraper genera which was significantly related to increasing disturbance in the MT Sp (r = −0.45) but was much weaker in the MT Su (r = −0.24). This could be due to the fact that scraper richness naturally declines (i.e., reduced range) in summer when streams are more fully canopied, rather than depicting a loss of sensitivity to stress. Several metrics (e.g., HBI, no. of intolerant genera, and no. of clinger genera) consistently showed high correlation to PC 1 in all strata. Overall, there were few redundant pairs and most metrics appeared to offer somewhat different information as denoted by having correlations well below 0.75. Metrics that had high DE and stressor responsiveness rarely showed redundancy. For example, in the PL Su, no. of Plecoptera genera was highly redundant with no. of intolerant genera (PTV < 3) (0.93); however, since no. of intolerant (PTV < 3) had higher range and DE, it was selected as the preferred metric. For the selected metric sets, maximum redundancy magnitudes (within each stratum) are reported in Table 4. Appendix 2 tabulates those metrics that passed DE and PCA tests but failed the SOI test. For instance % scrapers had good DE (79 %) and moderate PC 1 correlation (r = −0.34) in the MT Sp, but had an unacceptable SOI of 1.42. In the MT Su stratum, % Ephemeroptera (minus Baetis) had a similarly good DE (77 %), but a SOI of 1.37. In these cases, we selected comparably sensitive and responsive metrics with acceptable SOIs. Out of all metrics, only % Annelida failed the range test. Within each stratum, we reported CAL metric upper and lower 5th centiles (Table 5) and scored metrics for each site in the CAL dataset.

Table 5 Calibration standard values (ceilings = upper 95th, and floors = lower 5th centiles) for metrics (by stratum) used for CAL and VAL scoring purposes
Fig. 3
figure 3

Boxplots of calibration (CAL) and validation (VAL) GLIMPSS scores between REF, non-REF, and STRESS categories in MT Sp (a), MT Su (b), PL Sp (c), and PL Su (d). Number of sites in each stratum shown as n = CAL/VAL. Percent discrimination efficiency (DE) for CAL and percent classification efficiency (CE) for VAL also provided. Dashed line represents approximate 5th centile of REF, indicating the impairment threshold. See Fig. 2 and text for stratum definitions

Index performance

To examine the sensitivity of GLIMPSS, we plotted score distributions using box plots and then calculated DE within individual strata. GLIMPSS discriminated between REF and STRESS sites with a high degree of efficiency (>75 %) (Fig. 3). Non-REF sites fell into an intermediate position between REF and STRESS, with respect to score distribution. DE was greatest in the PL Su stratum where 89 % of the STRESS sites fell below the 5th centile of the REF distribution. The independent VAL (n = 118 REF sites, 288 STRESS sites across all strata) CE (i.e., % of correctly classified REF and STRESS sites) was excellent for all strata (Fig. 4), indicating successful validation (CE ranged from 89 to 95 %). GLIMPSS was also responsive to increasing stress (as indicated by PC 1); stratum-specific correlation of all CAL and VAL GLIMPSS scores were significantly correlated (p < 0.001) to the disturbance gradient. Correlations with PC 1 ranged from −0.60 to −0.70 for both CAL and VAL MT Sp and MT Su sets; correlations in PL Su were −0.66 and −0.62 for CAL and VAL, respectively. PL Sp had the lowest correlation to disturbance at CAL and VAL sites (−0.48 and −0.42, respectively).

Fig. 4
figure 4

Box plots of REF vs. non-REF and STRESS site classes using GLIMPSS (CF) scores from full dataset (total number of sites per stratum and class provided) among MT Sp (a), MT Su (b), PL Sp (c), and PL Su (d) strata. Dashed line represents approximate 5th centile of REF distribution (impairment threshold). See Fig. 2 and text for stratum definitions

REF GLIMPSS relationships with natural gradients

Natural abiotic variable correlations (Julian day, latitude, longitude, stream width, and elevation) with REF site GLIMPSS were weak (|r| < 0.22) and non-significant (p > 0.05) indicating that seasonal and geographical stratification effectively accounted for most natural variation. However, scores in the PL Su were significantly negatively correlated (albeit only moderately) with stream width (r = −0.44) and conductivity (r = −0.45). Although specific conductance can increase with watershed disturbance, it might also represent a gradient of geologic or edaphic variability within this ecoregion.

Final metric scoring

Data from the CAL and VAL sets were combined and upper and lower 5th centiles were recalculated to produce final GLIMPSS scoring benchmarks for the four strata. Appendix 3 (Table 11) provides final GLIMPSS upper and lower 5th centiles for all metrics in each stratum. Overall, there were only minor adjustments made from the CAL values using the larger dataset. A few metrics showed substantial variation across seasons (e.g., % EPT (minus Cheumatopsyche) and no. of intolerant genera in PL Sp vs. PL Su) or bioregions (e.g., no. of intolerant genera (PTV < 4) in MT Sp vs. PL Sp); however, other metrics were not considerably different across strata.

Reproducibility

Precision of GLIMPSS was satisfactory based on same-day replicate sampling (Table 6). PL Su had the highest variability of all strata with respect to %CV and RPD (but these were highly dependent on the low population mean). Overall, estimates of measurement error (SD) were fairly stable across strata (Table 6). We also evaluated GLIMPSS scores at REF sites for precision, by comparing annual revisits to 30 sites within the same index period (Table 6). These sample events occurred between 1 and 8 years apart, and we assumed that REF site environmental condition was unchanged, so changes in the MMI should not result from anthropogenic disturbance. Differences in scores, however, do incorporate many factors: timing within an index period, preceding weather patterns, natural disturbance, and any unknown human impacts that may have occurred within the catchment between sample years. Despite these factors, SD, CV (%), and RPD were very good and considerably less than all same-day duplicates.

Table 6 GLIMPSS precision estimates and statistics from same-day duplicates (n = 90) and annual revisits at REF sites (n = 30 sites) among all strata and within individual strata

Comparison of WVSCI and GLIMPSS

The indices differed in assessing impairment 15 % of the time; the majority of agreements were at the high- and low-end scoring sites. Based on the full 3,737-site dataset, GLIMPSS rated 2,218 as impaired vs. 2,088 for WVSCI (130 less sites); however, there were stratum-specific differences in impairment rates. For example, in MT Sp (total n = 697) and MT Su (total n = 1530), GLIMPSS assessed 10 and 9 % more impaired sites than WVSCI, respectively. In contrast, the PL Sp (total n = 653) GLIMPSS rated only 1 % more impaired sites, while in PL Su (total n = 857), the family-level WVSCI rated 9 % more sites as impaired than the GLIMPSS. Using <50 % of threshold to demarcate the worst condition category, the GLIMPSS rated 626 sites (out of 3,737) as “severely degraded” while WVSCI rated 258.

Comparison of GLIMPSS vs. GLIMPSS (CF) performance

Using identical metric evaluation techniques, the process yielded sets of modified metrics that were as robust as in the full model. The final GLIMPSS (CF) incorporates 10 metrics in the MT Sp, 9 metrics in the MT Su, 7 metrics in the PL Sp, and 9 metrics in the PL Su (see Appendix 3, Table 12). Using mean DE, mean correlation with PCA, and mean redundancy magnitude across metrics within each stratum (Table 7), we conclude that on average, the GLIMPSS (CF) metrics performed similarly to, or slightly better than, the full GLIMPSS metric sets, indicating that the selected GLIMPSS (CF) metrics are highly suitable for assessment purposes. As with the full GLIMPSS, the 95th and 5th centile of the CAL and VAL datasets were re-calculated and all metrics were scored for each site within individual strata. Final upper and lower 5th centiles for each GLIMPSS (CF) metric within strata are shown in Appendix 3 (Table 12). Similar to the full GLIMPSS, box plots (Fig. 4) indicate excellent sensitivity of GLIMPSS (CF). The simplified index could discriminate between REF and STRESS sites with a high degree of efficiency (≥75 % of STRESS sites score below 5th centile of REF distribution). DE was greatest in the PL Su stratum where ~90 % of the STRESS sites fell below the 5th centile of the REF distribution.

Table 7 Comparison of metric performance values for calibration data: discrimination efficiency (%DE), responsiveness (r value for PCA), and redundancy magnitude (r value for selected metric pairs) between stratum-specific GLIMPSS metrics and GLIMPSS (CF) metrics

As expected, the two models were strongly correlated (Fig. 5). There was slightly more scatter in MT Su and PL Su compared with other strata. In each stratum, the full GLIMPSS tended to score higher than GLIMPSS (CF) as depicted by points lying above the 1:1 line. Adjustments to the impairment thresholds (based on the 5th centile of REF) only ranged from 2–4 points; however, in the lower left quadrants of the plots (impaired sites), larger differences were noted indicating that the GLIMPSS (CF) rated impaired sites as being in worse condition than with the full GLIMPSS.

Fig. 5
figure 5

Relationships between full GLIMPSS and modified GLIMPSS (CF) for MT Sp (a), MT Su (b), PL Sp (c), and PL Su (d) strata. Solid lines represent 1:1 relationship while dashed lines approximate the 5th centile thresholds of REF sites. See Fig. 2 and text for stratum definitions

Final benchmarks based on the reference distribution

We elected to retain the 5th centile of reference scores as an impairment threshold as with the original WVSCI (after Gerritsen et al. 2000) for GLIMPSS and GLIMPSS (CF). Although other states often use the 10th or 25th centile of the reference distribution as impairment thresholds, our rationale to use the 5th centile is based on our confidence in the quality of the reference condition (see abiotic data in Appendix 1). Because of this high confidence in reference conditions, we used 25th centile of the reference distribution as a lower threshold to identify and protect high quality biological assemblages found in WV. Values below the impairment threshold were partitioned to provide categories that reflect increased stress to biological communities (e.g., degraded, severely degraded). Table 8 presents stratum-specific GLIMPSS and GLIMPSS (CG) scores indicating the 5th and 25th centile values and equal bisection of the impairment range.

Table 8 Final GLIMPSS and GLIMPSS (CF) scoring criteria for all strata

Discussion

Development of MMIs for WV wadeable streams followed common standardized techniques for selecting metrics (Barbour et al. 1999; Klemm et al. 2003; Stoddard et al. 2008) so that the final index was sensitive, responsive, and precise. Moreover, by selecting metrics from key ecological attribute categories (e.g., richness, composition, tolerance, dominance, trophic groups, and habit), a more comprehensive view of stream condition is assessed as the index encompasses and integrates multiple features of a stream’s biological system (Barbour et al 1999). Many state water agencies have incorporated MMIs into their routine CWA monitoring and assessment programs; most having calibrated their own MMI using state-specific data. Regionally-based MMIs also exist (e.g., Klemm et al. 2003; Stoddard et al. 2005) and large-scale, nationally based bioassessment models have been developed by US EPA (e.g., Stoddard et al. 2008); European countries and the European Union also assess streams with national-based indices. Interestingly, variations and combinations of many metrics selected for state or regional MMIs reveal that similar macroinvertebrate groups (e.g., Ephemeroptera richness and chironomid abundance) and traits (e.g., intolerants, scrapers, and shredders) are frequently shared.

The GLIMPSS and GLIMPSS (CF) are robust yet practical tools for evaluating impacts to water quality, instream and riparian habitat, and aquatic wildlife in wadeable riffle-run streams based on sensitivity, responsiveness, precision and independent validation. Confounding of macroinvertebrate responses by seasonal (Julian day), and geographical (latitude, longitude, and elevation) influences was effectively removed by stratification. Stream size was also not significantly correlated to GLIMPSS scores (except in PL Su). Barbour et al. (1999) and Karr and Chu (1999) recommended that site classification (e.g., ecoregions, seasonal index periods) were critical to developing refined biocriteria using MMIs. Overall, the GLIMPSS models can be used for a broad spectrum of water resource management programs including (1) assessing streams for ALU attainability, including characterizing the existence and extent of point and nonpoint source stressors, (2) targeting and prioritizing watersheds for remedial or preventive programs, (3) evaluating the effectiveness of nonpoint source best management programs and mitigation projects, and (4) identifying exceptional quality streams for enhanced protection and conservation. Future analyses should focus on using GLIMPSS models to compare relationships with different stressor and land use types, and to identify stressor thresholds for maintaining and protecting ALUs.

Genus-level taxonomy offered obvious improvements for bioassessments compared with family-level WVSCI. First, genus-level taxonomy identified distinct seasonal and geographical classification strata, helping to refine ecological expectations across WV. Second, it allowed for the selection of several additional indicator metrics with larger response ranges, and could better track stressors in different seasons and within particular bioregions. Since macroinvertebrate communities were well-clustered within individual strata based on NMDS, we chose to construct individual MMIs for each bioregion and season. Vlek et al. (2004) argued that developing separate MMIs for each stratum avoided selection of metrics based on natural differences in streams (or seasons), rather than on the ability to detect degradation. In our dataset, several metrics differed seasonally and regionally in their ability to distinguish stressed sites and their measured response to a stressor gradient. Our results confirmed the need to test and select metrics in the context of a classification scheme such as bioregion and season which led to stratum-specific differences in impairment rates, possibly indicating regional differences in stressor intensity and the quality of the REF assemblages (Vlek et al. 2004).

While more sites were rated as impaired with GLIMPSS as compared with WVSCI overall, PL Su was an exception. This suggests that, in general, WVSCI over-identified degradation at sites in this stratum because WVSCI metric scoring and the impairment benchmark were based on statewide expectation criteria which were weighted heavily by REF sites in mountain ecoregions (Gerritsen et al. 2000). However, PL Su GLIMPSS had the highest CE and assessed nearly 3x more “severely degraded” sites than WVSCI. Overall, GLIMPSS rated more than twice the number of sites as “severely degraded” compared with WVSCI across all strata.

The use of genus-level PTVs and the associated increase in range of the modified genus-level HBI, and the increase in range of taxa richness increases the sensitivity and range of response of the index (Lenat and Resh 2001). Some of the more obvious improvements in metrics using genus-level data were the increases in range within specific indicator groups (e.g., Ephemeroptera, Plecoptera, clingers, scrapers) and finer resolution of macroinvertebrate PTVs. For instance, common mayfly families (Heptageniidae and Baetidae) have family-level PTVs of 4. In WV, as many as 10 heptageniid genera and 13 baetid genera are known to occur with PTVs ranging from 0 to 4 and 2 to 10, respectively. Similarly, the family-level PTV for craneflies (Tipulidae) is 3, but 19 genera are reported in WV streams with genus-level PTVs ranging from 2 to 7. Comparatively, Pond et al. (2008) reported that genus HBI was more sensitive than family HBI, had wider range of response, and stronger correlations to stressors among WV unmined and coal mined catchments. For those metrics that count taxa richness, the gain in information can be substantial (e.g., loss of several heptageniid mayfly genera at a site would occur before the family was extirpated). We found this to be the case with other families, further confirming the benefit of using genus-level macroinvertebrate data for more accurate assessments. Interestingly, the commonly employed total number of taxa metric (no. of total genera) was redundant with, and not nearly as sensitive or responsive as other higher performing metrics in some strata.

Since metric scoring and the GLIMPSS reference distributions differ across strata (seasons or regions), it is difficult to directly compare index scores between the strata without further standardization. Therefore, a “percent of threshold” value for each sample, calculated as actual stratum-specific MMI score divided by impairment threshold value, multiplied by 100, accomplishes this standardization. Thus, when compared with the 5th percentile of the reference distribution within a particular stratum, a percent of threshold value >100 % is unimpaired, while a score <100 % is impaired. Other applications of this method to interpret relative site ratings could be done by calculating the percent of other benchmarks found in Table 8 (e.g., the 25th percentile of REF; scores of <50 % of thresholds have severely degraded condition).

Conclusions and recommendations

A refined MMI developed for WV wadeable streams incorporates genus-level taxonomy and regional and seasonal stratification. The GLIMPSS is applicable for use in most moderate to high gradient streams throughout the state, including streams that do not flow year-round (based on the characteristics of the dataset used to develop the index). However, future sampling and analysis is required to develop assessments for true limestone springs, non-riffle/run low gradient streams, and larger rivers in WV. To correctly apply GLIMPSS, it is critical that WVDEP standardized field methodology (e.g., target habitat, sampling gear, and sample area) and laboratory processing (e.g., subsampling and taxonomic identification) are closely followed. Although the delineation of seasons is relatively straightforward, professional judgment should be applied when sample dates fall close to season cutoffs. For example, after a cooler than normal spring, sites at higher elevations may exhibit spring-like communities well into June. We recommend a 2- to 3-week buffer between seasons to remove seasonal uncertainties and improve assessment performance. For streams that transcend plateau and mountain regions, the dominant contributing catchment area above the site should determine which suite of metrics and final assessment criteria should be applied. Autoecological designations (e.g., tolerance values for HBI and intolerant genera, FFGs for scraper or shredder genera, or habits for clinger genera) must be identical to those used by WVDEP as the GLIMPSS is specifically calibrated to those designations. Whether to use the full GLIMPSS or the GLIMPSS (CF) depends on chironomid taxonomic expertise and laboratory effort applied to the benthic sample. It was clear that the two indices performed comparably. However, the slight increase in sensitivity with the CF model suggests that although chironomid genera are known to be good indicators of certain stressors, it is permissible to exclude genus-level chironomid taxonomy for broader condition assessments with MMIs in WV streams.

The fundamental improvement of genus- over family-level data is representativeness. Compared with family-level taxonomic data, genus-level assessments more accurately represent the composition and diversity of the aquatic community in WV’s flowing streams. Using macroinvertebrate data (one assemblage) to represent ALU (which can also include mussels, salamanders, fish, and a host of other aquatic life) is valid, robust, and cost-effective for identifying degraded streams. At a minimum, using genus-level macroinvertebrate data is critical in order to better detect impacts to aquatic resources, instead of relying on assessments based on coarser taxonomic resolution.