Keywords

1 Introduction

The US Environmental Protection Agency (USEPA) defines biological assessment as the “…evaluation of the condition of a waterbody using biological surveys and other direct measurements of the resident biota in surface waters” [1]. Investigations that fall under this broad definition may be focused on any level of biological organization, from studies of subcellular effects of toxic compounds [2] to ecosystem-scale assessments using multiple taxonomic assemblages [3]. The terms biological assessment, bioassessment, biological monitoring, and biomonitoring are often used interchangeably. For clarity, we restrict our discussion to the term bioassessment.

The value of aquatic organisms as pollution indicators has been recognized by scientists for over 100 years. The Saprobiensystem of Kolkwitz and Marsson [4], most probably the first bioassessment index, was a system for quantitatively rating the tolerance of aquatic organisms to sewage pollution, much akin to modern pollution tolerance values. This concept has been adapted and modified many times, and both the concept and use of the word “saprobity” persist in contemporary literature [5, 6]. The practice of bioassessment invokes the concept of biological integrity, defined as “the capability of supporting and maintaining a balanced, integrated, adaptive community of organisms having a species composition, diversity, and functional organization comparable to that of natural habitat of the region” [7, 8]. Practitioners conducting bioassessments assume that biotic integrity reflects overall ecological integrity, which describes the state of an ecosystem with respect to biology as well as physical and chemical factors [8]. Therefore, the purpose of using biota to assess environmental conditions is that they integrate the effects of all environmental factors to which they are exposed over their entire life-spans and habitat ranges [8, 9]. However, because biota are responsive to such a multitude of environmental factors acting over multiple temporal and spatial scales, determining clear and unambiguous relationships between biota and anthropogenic impairment remains a challenging and active area of research.

This chapter addresses the development of numerical indices, based on biological assemblage-level data, to make inferences regarding anthropogenic stress to freshwater ecosystems. We follow the framework of Fauth et al. [10] in defining the terms community and assemblage. Communities refer to all organisms within the spatial boundaries of the system of interest. For bioassessment, the spatial boundaries of communities are generally artificial constructs, rather than distinct, natural boundaries, and are chosen based on some combination of scientific, logistical, and political criteria. The term assemblage refers to a taxonomically defined subset of a given community, for example, the benthic macroinvertebrate assemblage of a stream system.

The general objective of all bioassessments is to separate the signal of anthropogenic impairment effects from the noise of effects related to natural variations in space and time that are not related to anthropogenic impairment. Evaluation of the relative importance of these two effects requires measurement or estimation of variables related to anthropogenic impairment, which we refer to hereafter as stressors, as well as those related to natural variation, which we refer to as natural environmental variables.

The assemblages chosen for bioassessments depend on the expertise and resources available to investigators, public interest, and on those that are most expected to respond to anthropogenic stress. Algae, fish, and macroinvertebrates are the most commonly used assemblages, and numerous examples of useful bioassessment indices exist for each. Investigations comparing these assemblages commonly show that they respond differently to anthropogenic stress, and each represents a unique aspect of ecological integrity [1113]. Therefore, we focus on describing the analytical methods used for the development of contemporary indices, not on comparing the usefulness of different assemblages. We do not address descriptions of field and laboratory methods, but do note that sampling methodology [14, 15], sampling effort [16, 17], and taxonomic resolution [18, 19] have important and well-documented effects on bioassessments. Our focus is on perennial streams and rivers, as these systems dominate the literature and are the focus of most bioassessment programs. We also provide examples from lakes, impoundments, and wetlands when they enhance our discussion. The analytical methods presented here are also applicable to other aquatic systems and assemblage types.

Biological integrity is defined by one or a series of bioassessment metrics, which are quantitatively defined aspects of assemblages that are expected to vary in response to impairment. Some investigators favor the use of multiple metrics, which are aggregated within a multimetric index (MMI). MMIs provide a checks-and-balances system to account for variable responses of metrics to multiple stressors. Others prefer single-metric indices, most notably the observed-to-expected (O/E ) index, which we introduce in Sect. 3.1. Developers of MMIs commonly group metrics based on the general type of ecological information they express. Metrics from different ecological categories are included in MMIs in order to reduce the redundancy of information and increase the explanatory power of indices. In practice, a vast array of assemblage-level metrics has been used for aquatic bioassessments.

To provide relevant examples, we have assembled a short list of fish, macroinvertebrate, and algae metrics, which we group into four broad categories (Table 1). Diversity and composition metrics are taxonomy-based metrics associated with assemblage characteristics such as richness, evenness, diversity, and dominance. Trait-based metrics incorporate information on ecological habits, habitats, morphology, life history, and life cycle characteristics of populations in the assemblage of interest. Pollution tolerance metrics are numerical ratings of the degree to which individuals in the assemblage are tolerant to stressors. Individual condition metrics are associated with visually apparent morphological anomalies of individual specimens. In addition to these general metric types, the absolute abundance of fish and biomass of algae are also sometimes used, although the absolute abundance of macroinvertebrates is rarely used.

Table 1 Selected fish, macroinvertebrate, and algae metrics used as indicators of anthropogenic impairment

In Sect. 2, we introduce the most widely used method for bioassessment, the Reference Condition Approach (RCA) [3032]. In Sect. 3, we discuss predictive modeling of aquatic assemblages, which is conducted to control for the effects of natural environmental variation in order to obtain an unambiguous determination of anthropogenic effects. Once selected and properly calibrated for natural environmental variation, metrics are used individually, or are aggregated within an MMI, to provide a scoring system that reflects the assemblage-inferred level of anthropogenic impairment at a given study site. This process, as well as methods for evaluating the performance of metrics and indices, is reviewed in Sect. 4. In Sect. 5, we take a broader perspective and present interviews with three experts who provide valuable insights into some of the most important emerging issues and challenges in the field of bioassessment.

2 The Reference Condition Approach

Reference conditions serve as surrogates for negative controls, representing the assemblage characteristics at test sites that would occur in the absence of impairment. Following the RCA, reference conditions are derived from assemblage data at least-disturbed reference sites (sensu Stoddard et al. [32]). In comparison to experimental studies, where variation among replicates is carefully controlled and expected to be minimal, variation among reference site assemblages is high and difficult to predict. Replicate samples from the same site are generally insufficient to account for this variation. Therefore, the RCA calls for the use of multiple reference sites in order to adequately account for the effects of natural environmental variation on assemblages [3032].

We use the term reference to broadly encompass streams considered to be in least-impaired conditions, though as discussed in Sect. 2.2, the actual level of impairment at reference sites is highly variable among studies [32]. We use the term impaired to refer, in general, to sites that are subject to the deleterious effects of stressors and the term test sites to refer to those of unknown impairment status (i.e., those for which bioassessments are needed). This chapter focuses on the RCA, although alternative approaches for estimating reference conditions may be used when system conditions and data availability warrant (Sect. 2.1). Application of the RCA proceeds by first screening multiple potential reference sites to determine if they reflect appropriate least-impaired conditions (Sect. 2.2), then classifying the screened reference sites as to expected or quantified patterns of variability among their biotic assemblages (Sect. 2.3).

2.1 Alternatives to the Reference Condition Approach

Though not often available, data describing past assemblages may provide valuable information for inferring reference conditions. Investigations of sediment records, historical accounts of landscape conditions, and museum records have been used to infer past environmental conditions and assemblage composition in aquatic systems [3336]. Historical approaches, while important, lack broad applicability for bioassessment. In lotic ecosystems, sediment deposition is generally insufficient to provide a historical record. Moreover, data that describe assemblage composition before anthropogenic development occurred may reflect conditions that are no longer attainable given the effects of factors acting at large spatial and temporal scales such as atmospheric deposition of pollutants and global climate change.

When anthropogenic impacts are spatially discernable, a paired-site approach may be useful. For example, lotic sites impacted by point source pollution such as mine effluent [37, 38] or municipal wastewater [39] may be paired with upstream sites above the source of stressors. Plafkin et al. [40] referred to paired upstream sites as controls, though this implies that confounding natural environmental factors on the upstream–downstream comparison are being actively controlled, which is generally not feasible. For bioassessments conducted over large spatial scales (e.g., ecoregions), the paired-site approach is problematic because much assemblage variation is driven by local-scale differences in natural environmental variables [41, 42]. Assemblage variation among reference sites, that is, variation that is not likely caused by impairment, is typically much larger than variation among replicate samples collected at a single site. Therefore, comparisons of replicate samples from a single reference and test site pair often do not provide a realistic representation of the effects of impairment on assemblage characteristics [43].

2.2 Reference Site Screening

Reference site screening is the process of estimating the degree of anthropogenic impairment at study sites, such that those with acceptably low stress levels may be designated as reference sites. Screening criteria vary among studies based on both data availability and on the opinions of investigators as to which criteria are most appropriate. Some have advocated for the use of professional judgment regarding whether observed assemblages represent reference conditions [44, 45]. Professional judgment may also be applied to the environmental conditions at the sites, providing a more independent, and potentially less biased, means of estimating impairment status [46]. Advocates of professional judgment often favor consensus opinions among groups of experts [45, 47]. Potential criticisms of such approaches include a lack of empirical support for decisions regarding reference designations and circular logic when sites are designated as reference based on the assemblage that is also used for bioassessment. However, Davies and Jackson [45] recently showed that the opinions of trained experts were highly consistent when rating ecological integrity based on assemblage data, leading the authors to assert that well-informed professional opinions provide reliable, ecologically relevant benchmarks for bioassessment.

Independent stressor variables (i.e., those not related to the assemblage used for bioassessment) are often used to estimate impairment status. Screening approaches that employ both professional judgment and independent stressor variables produce reference sites that are of higher ecological integrity than those selected using either approach alone [48]. Variables used to estimate anthropogenic stress include physicochemical water quality parameters, land-cover features derived using a geographic information system (GIS) that indicate development, and physical habitat quality assessments based on site observations. Studies often employ a filtering approach, whereby threshold levels for each measured stressor variable are set to designate sites as reference or impaired [21, 49]. The filtering criteria used to select reference sites are generally evaluated with an all approach for reference sites and any approach for non-reference sites. For example, Blocksom et al. [49] required that sites meet all of the reference criteria listed in Table 2 to be designated as reference, but considered sites impaired if any one of the impaired criteria was met.

Table 2 Filtering criteria used by Blocksom et al. [49] to differentiate between reference and impaired streams in the Mid-Atlantic Highlands (USA)

The level of anthropogenic impairment considered acceptable varies greatly among studies, and many authors fail to provide clear descriptions of how reference conditions are defined. Recognizing this problem, Stoddard et al. [32] advocated for the use of the terms minimally disturbed condition (MDC) to describe expected conditions in the absence of substantive anthropogenic disturbance, least-disturbed condition (LDC) to describe the best available conditions present under current disturbance regimes, and best attainable conditions (BAC) to describe the expected conditions when all avoidable sources of anthropogenic influence are removed (BAC is generally intermediate between LDC and MDC). MDC sites, within virtually unimpacted, near pristine watersheds, are rarely available, and most often reference sites are chosen based on the best available conditions (LDC sites; e.g., [50, 51]).

The reliability of bioassessments depends largely on the existence of a sufficient number of reference sites to encompass the range of natural variability in the study region. Given the pervasiveness of human impacts on aquatic ecosystems, achieving a sufficient number of reference sites is often difficult, and may be impossible if standards regarding the acceptable level of impairment are unrealistically high [50, 51]. Impairment is highly variable among geographic regions because development pressure is nonuniform. Recognizing this, Yates and Bailey [50] developed a novel strategy for selecting reference sites that allowed for flexibility in the standards used for screening to select LDC sites within study regions exposed to different impairment regimes. This is a pragmatic strategy, as flexibility in the level of impairment allowed is unavoidable in areas where impairment is pervasive; however, clear comparisons of bioassessment results among studies are hindered when the standards used to select reference sites vary.

2.3 Reference Site Classification

Reference sites that successfully pass the screening process are used to predict assemblage conditions under minimal impairment. Classification of reference sites is intended to increase the precision and accuracy of these predictions by grouping sites inhabited by similar ecological assemblages. Broadly, there are two major types of classification systems: (1) typologies, wherein sites are grouped based on spatial proximity and/or similarity in their natural environmental variation, and (2) biotic classifications, which employ statistical analyses on assemblage data to group sites. We begin by discussing the two as distinct approaches, although bioassessment programs often use elements of both to develop the best site classifications. Throughout this section, we discuss analytical techniques that are specialized for the analysis of assemblage data. McCune and Grace [52] and Legendre and Legendre [53] present additional details on most of the analytical methods presented here. Software for conducting most of the techniques is available free of cost in the R statistical programming language [54].

The objective of classification is to maximize assemblage similarity within classes, while maintaining sufficient replication to allow for statistical comparisons of test sites with the reference classes. Many measures of assemblage similarity exist [52, 53]. The Bray–Curtis coefficient [55] for abundance data and the Sorenson coefficient [56, 57], the equivalent of Bray–Curtis for presence-or-absence data, are most commonly used. Both are well suited to the numerical structures of assemblage datasets. Similarity may also be calculated using bioassessment metric values in place of taxonomic data (a sound but under-used technique, [58]).

Similarity is summarized as the mean within-class similarity (W, the mean of all pairwise similarities of sites within classes) and the mean between-class similarity (B, the mean of all pairwise similarities of sites not in the same class). The precision of a classification is described by the relationship of W to B, referred to as the classification strength [58]. High classification strength is indicated by a large positive difference or large ratio of within- to among-class similarity (i.e., high W–B or W/B). Predictions regarding assemblage conditions are most reliable when classification strength is high. Multivariate techniques such as MEANSIM [58], analysis of similarity [59], multiresponse permutation procedure [60], and nonparametric, multivariate analysis of variance [61] are used to test the hypothesis that classification strength is higher than expected by chance, providing an indication of whether the classification improves the reliability of predictions regarding reference conditions.

2.3.1 Typological Site Classification

Typological site classifications are based on a priori judgments regarding the site conditions that best group reference sites with similar assemblages. Early typological approaches focused on coarse-scale, map-based classifications (e.g., ecoregions) [62, 63]; however, typological classifications that do not account for the effects of local-scale variables typically exhibit much lower classification strength than biotic classifications [43, 64, 65]. Typological classifications are a convenient and useful tool that should be at least considered as an initial step toward site classification [65, 66]. Like all classifications, the effectiveness of a priori-defined typologies should be assessed by a posteriori, quantitative evaluations of the assemblages of interest [66]. For example, investigators in Virginia (USA) observed a striking difference in stream macroinvertebrate assemblage structure between low-gradient coastal plain sites and upland piedmont and mountain sites, requiring the use of separate bioassessment indices for coastal and non-coastal sites (Fig. 1) [67, 68]. Because assemblages are affected by both regional- and local-scale environmental factors, typological classifications that consider smaller spatial-scale variables as well as large-scale zones may provide comparable, or greater, classification strength than biotic classifications [64, 69].

Fig. 1
figure 1

Nonmetric multidimensional scaling ordination of macroinvertebrate assemblages, identified at the family level, from 269 least-impaired, reference Virginia stream sites. Symbols: (open triangles) coastal plain sites; ( filled circle) non-coastal sites. Distances between sites correspond to their proximity in Bray–Curtis distance space. Percentages indicate the percent of variance in the Bray–Curtis coefficients explained by the axis coordinates. Adapted from Dail et al. [67] with permission

2.3.2 Biotic Site Classification

Biotic classification of reference sites came to prominence with the introduction of the River Invertebrate Prediction and Classification System (RIVPACS, [70, 71]). Agglomerative cluster analysis is used in many variations of the RIVPACS approach, including those developed for the USA [72] and the Australian River Assessment Scheme (AUSRIVAS, [73]), both of which employ presence/absence data for clustering, and the Benthic Assessment of Sediment (BEAST) method of Canada [74] which clusters based on abundance data. Agglomerative clustering proceeds from the bottom up, progressively grouping sites of increasingly dissimilar taxonomic composition. Most investigators cluster sites based on Bray–Curtis or Sorenson dissimilarity [7276], although other measures such as Euclidean distance can be employed [77]. Figure 2 shows a cluster analysis of 46 Kentucky (USA) stream reference sites where genus-level macroinvertebrate data were collected by the USEPA.

Fig. 2
figure 2

Agglomerative cluster dendrogram generated using the flexible-beta method (β = −0.30) on a Bray–Curtis dissimilarity matrix of genus-level macroinvertebrate data at 46 least-impaired Kentucky stream sites. Branch lengths correspond to dissimilarities between sites and clusters. Bioregions are regional classifications as described by Pond et al. [78]: (open triangle)—Mountain; ( filled circle)—Miss. Valley-Interior River; (open square)—Pennyroyal; (open circle)—Bluegrass. The Mountain and Bluegrass regions separate perfectly. Some overlap occurs for other bioregions because Julian day and latitude (not included in the analysis) were also important variables related to assemblage structure in these bioregions. Data courtesy of Gregory Pond, USEPA

The standard RIVPACS approach classifies sites using two-way indicator species analysis (TWINSPAN, [79]). In contrast to agglomerative cluster analysis, TWINSPAN is a divisive technique, whereby sites are progressively divided based on taxa (indicator species) that best differentiate them. Also unlike agglomerative clustering, the user cannot choose a dissimilarity measure [80]. TWINSPAN has received criticism for poor performance and seemingly arbitrary methodology [52, 81]. However, based on comparisons with other techniques, the developers of RIVPACS concluded that TWINSPAN performed well, and the method is still used within the current RIVPACS framework [80].

Relationships among sites can be visualized using a variety of ordination techniques, which reduce the n-dimensional hyperspaces created by ecological distance matrices to fewer (usually 2 or 3) dimensions that best explain the overall pattern of variability (Fig. 1) [52, 53]. When assemblage–environment responses are assumed to be linear, principal components analysis (PCA) is commonly used, whereas when responses are assumed to be unimodal, reciprocal averaging-based techniques such as correspondence analysis (CA) and detrended correspondence analysis (DCA) are often used. We agree with others [52, 82] in preferring nonmetric multidimensional scaling (NMS) to these techniques because NMS includes no assumptions regarding the underlying data distribution and is highly effective at explaining assemblage structure while reducing dimensionality. Ordinations are often used for exploratory purposes, for example, to confirm classifications made using other analyses [76], but also may be used directly for site classification [67, 83].

Multivariate analyses on assemblage data aid in showing the user where distinctions between classes may occur. However, subjectivity in drawing distinctions among classes is unavoidable, as investigators must decide on the appropriate level of similarity at which to consider sites within the same class [84]. The final decision is made as a compromise between including as many reference classes as possible, while still including enough replicate sites within classes to adequately represent within-class assemblage variability among sites. Bowman and Somers [85] recommend “a minimum of 20, but preferably 30–50 reference sites per group,” though this may be overly optimistic given the data constraints experienced in many studies.

2.3.3 Conclusions Regarding Site Classification

Hawkins et al. [43] and Melles et al. [86] draw distinctions between geography-based methods in which reference classes can be clearly delineated within discrete spatial units and geography-independent methods driven by patterns in assemblage variation regardless of physical location. Typological classifications which include map-based delineations of classes are geography dependent, whereas biotic classifications, focused on patterns of assemblage variation, are geography-independent. However, the most effective classifications consider both geography-dependent and geography-independent factors, for example, limitations on the spatial scale over which biotic classifications are developed can increase their classification strength [87]. While biotic classifications provide precise descriptions of the patterns of variability with respect to the assemblage of interest, the resulting classifications may not be applicable to other assemblages. Inclusion of geography-dependent variables that implicitly encompass a wide range of environmental factors provides a more comprehensive classification of sites [88]. A priori typological classification based on large-scale variables (e.g., ecoregions) provides useful, convenient, and easily communicated initial classifications of sites, though classifications are often improved when supplemented by smaller-scale variables that are not spatially discrete (e.g., flow regime; [89]) or not associated with geography (e.g., sampling date; [75]). Good scientific practice requires that the effectiveness of a priori approaches be evaluated with a posteriori evaluations of relationships between classes and biota [6467, 88].

3 Predictive Modeling of Aquatic Assemblages

The objective of predictive modeling for bioassessment is to control for the confounding effects of natural environmental variables so that the effects of stressors on metrics can be clearly evaluated. The methods used to meet this objective are as diverse and varied as the assemblages themselves. As an introduction to the core concepts in predictive modeling, we outline the basic steps of the RIVPACS framework and present an example of its use in Table 3 and Fig. 3. Although now 30 years old, the framework remains relevant and is still used with little modification of the original methodology [70, 75, 80]. The conceptual basis and technical details of RIVPACS have been described thoroughly by others [71, 72, 90]. Following our introduction to RIVPACS, we present several promising recent advancements in predictive modeling.

Table 3 Example calculation of O/E based on biotic classes and distances in environmental space shown in Fig. 2
Fig. 3
figure 3

Reference and test sites displayed in environmental distance space as defined by two discriminant function axes. Discriminant function scores were simulated from normal distributions with variance = 1. Symbols: ( filled circle) reference class centroids, (open circle) test site, (1, 2, 3) reference sites, (d1, d2, d3) distances in discriminant function space between test sites and reference class centroids

3.1 The Observed-to-Expected (O/E ) Index

The standard metric derived from RIVPACS-type predictive modeling approaches is the observed-to-expected ratio (O/E ), a single-metric index that compares the observed taxonomic richness at a study site to the expected richness under minimally impaired conditions. The O/E ratio indicates the degree of “taxonomic completeness” (sensu Hawkins [91]) of the test site. O/E values less than one indicate that taxa expected to be present if the sites were unimpaired are absent.

Following biotic classification of reference sites, RIVPACS employs Multiple Discriminant Analysis (MDA) to develop linear functions that best describe the relationships of natural environmental variables to the biotic classes. The discriminant functions are used to determine the distance, in environmental variable space, of the test site to the biotic reference classes, which in turn are used to estimate the probabilities that the test site belongs in each reference class (referred to here as class probabilities). For all native taxa in the study region, the proportion of reference sites within a given biotic class where a taxon is present represents the probability of observing that taxon at a site in that class (referred to as the probabilities of capture). The probabilities of capture of a given taxon within each reference class, and the class probabilities of the test site for each reference site, are used to estimate the probability of capturing the taxon at the test site assuming unimpaired conditions. The expected richness at the test site (the E in O/E ) is given by summing probabilities of capture at the test site for all taxa (see Fig. 3 and Table 3 for additional details).

O/E values greater or less than one indicate departures from what is predicted under unimpaired conditions. Simpson and Norris [92] recommended that O/E values below the tenth percentile of the reference site distribution indicate impairment, with the extent of impairment increasing as the ratio decreases. They also postulated that O/E values greater than one may indicate areas of exceptionally high natural biodiversity or those subject to mild impairment that artificially increases richness.

A common modification to the basic framework is to exclude rare taxa from the analysis, as their inclusion can result in a site receiving an O/E score near one when the assemblage observed deviates considerably from statistical expectations. Several authors have indicated that excluding taxa with probabilities of capture less than 0.5 (producing the O/E 0.5 index) improves accuracy and precision [72, 93, 94]. As an alternative, Van Sickle [93] adapted the Bray–Curtis dissimilarity measure to compare observed and expected assemblages (referred to as BC) and showed that BC was generally more accurate than O/E for identifying impairment across a wide range of assemblages and study systems. O/E 0.5 and BC indices developed for Appalachian stream macroinvertebrate assemblages exhibited similar accuracy and precision [75].

3.2 Advances in Predictive Modeling

The widespread application of RIVPACS-type models has inspired many alternative approaches. Recognizing that assemblages occur along continuous environmental gradients, investigators have developed nearest neighbor methods that compare the environmental similarity of test sites to each individual reference site, rather than to the average assemblage of each class as is done using RIVPACS [83, 95]. Modeling approaches often skip the biotic classification step and predict assemblage characteristics at reference sites directly using natural environmental variables [9698]. Direct prediction approaches may allow for different sets of environmental variables to be used as predictors for each taxon. Though appealing in this respect, the development of separate models for each taxon may be overly complex for taxon-rich systems.

In contrast to the long history of predictive modeling for O/E indices [70], until recently, developers of MMIs rarely employed predictive modeling to account for natural environmental variability. McCormick et al. [99] used linear regression to control for the effects of watershed size on a fish MMI. Equations derived from the regression of metric values on watershed size at reference sites were applied to test sites, and the residuals from the regression were used to indicate deviations from the expected metric values in the absence of impairment. Oberdorff et al. [100] expanded this approach, modeling metrics based on a suite of natural environmental variables using logistic regression (for presence/absence metrics) and multiple linear regression (for abundance-based metrics). Variations on this residualization technique have been developed for more advanced modeling strategies such as prediction tree approaches (discussed below), improving both the accuracy and precision of MMIs by removing the confounding effects of natural environmental variables [21, 101, 102].

Although conventional techniques such as MDA and linear and logistic regression have provided utility for predictive modeling, several newer methods better account for the variable, often nonlinear and interactive effects of environmental predictors on biota. The generalized additive modeling approach of Yuan [103] shows the flexibility of this nonparametric regression technique for predicting variable responses among different taxa to a suite of environmental factors. Bayesian frameworks provide a comprehensive evaluation of uncertainty in predictive models [104106] and have been used for this purpose in MMI development. Machine learning techniques, including artificial neural networks (ANNs) and ensemble prediction trees, where models are iteratively trained at prediction to minimize error, have received much recent attention for predictive modeling in ecology [107110]. Though the method is not yet widely used, support vector machines have performed favorably compared with other machine learning techniques for predicting the occurrence of macroinvertebrate taxa [111, 112].

ANNs structure predictor–response relationships in a manner similar to vertebrate neurological systems. Variables are represented as neurons connected by a multitude of axons representing the possible interrelationships among variables [113]. ANNs have shown substantial improvement over traditional RIVPACS-type models for predicting the richness of macroinvertebrate and fish assemblages [97, 114].

Prediction tree approaches such as classification and regression trees account for the complex effects of both continuous and categorical predictors by recursively bisecting the dataset into groups that are increasingly similar with respect to the response variable after each division [52, 53, 115]. Ensemble prediction tree approaches such as random forests and boosted regression trees combine the results of hundreds to thousands of trees to reduce prediction error. Random forests have been used to model assemblage metrics directly [21, 101, 116] and to define relationships between environmental variables and predefined biotic classes, effectively replacing MDA as used in RIVPACS [12, 101]. Comparisons of random forests to boosted regression trees, a related ensemble tree method, indicate that the latter may provide superior performance [108, 117].

In the absence of suitable reference sites, investigators have used whole-set approaches that employ all sites in the dataset, rather than only reference sites, to control for the effects of natural environmental variables. Most whole-set approaches involve the use of regression techniques to model the responses of metrics to stressors and then to estimate metric values at the point where the model estimates that no impairment occurs [118, 119]. Because few to no minimally impaired sites are included in these analyses, they are effectively estimates by extrapolation of a stressor–response gradient and therefore may be subject to greater prediction errors than models for which reference sites are available. Such errors, however, may be unavoidable when test sites cannot be matched with comparable reference sites. As an alternative whole-set approach, Chessman and Royal [120] estimated the tolerance limits and preferences of macroinvertebrates to substrate, temperature, and flow conditions across an extensive dataset of Australian rivers. These limits were then used to predict the presence of taxa and derive O/E values at test sites, which exhibited stronger correlations with stressor gradients than O/E values derived using the AUSRIVAS method.

A case for using the whole-set approach as a replacement for the RCA was recently presented [121]. Data simulations were conducted to model scenarios in which natural environmental variables and stressors affected biotic metrics independently and also interactively. Metrics that were model-adjusted using the whole-set approach exhibited more accurate and precise relationships with the simulated stressor gradient than metrics adjusted using the RCA. The difference in performance was greatest when stressors and natural environmental variables interacted, as the RCA cannot account for such interactions. While the authors present a compelling case, additional field-based empirical comparisons of the whole-set approach to the RCA are needed.

4 Index Development and Performance Evaluation

We begin this section by discussing methods for evaluating the performance of metrics and indices (Sects. 4.14.4). For MMIs, these characteristics should be evaluated in order to include the best-performing metrics in the final index. Final metric selection, scoring, and aggregation are discussed in Sect. 4.5. After scoring and aggregation of metrics within an MMI or alternatively the development of an O/E index, performance should be re-evaluated using the finished index scores, ideally using independent data not used for index construction (Sect. 4.5). Further information on MMI development has been presented by others [1, 32, 122]. For clarity, these works present index development in a stepwise manner; however, it is important to note that index development is an iterative, rather than a linear process. Metrics that are acceptable based on one criterion (e.g., numerical range, Sect. 4.1) may subsequently be considered unacceptable based on another criterion (e.g., accuracy, Sect. 4.2), requiring the evaluation of new metrics.

4.1 Numerical Range

Assemblage data are often plagued with abundant zeros due to the patchy distribution of biota among habitats, and metrics related to rare taxa typically have narrow numerical ranges. Metrics with limited ranges, and those for which many sites in the dataset exhibit the same value, are unlikely to exhibit clear numerical responses to stressors [123]. Others have presented guidelines for acceptable numerical ranges for metrics, though these vary among studies [123125]. Simple distribution plots of metric values often provide clear indications of highly limited metrics (e.g., see Fig. 2 in [122]).

4.2 Accuracy and Precision

We broadly define accuracy as the degree to which a given metric or index is quantitatively related to variations in anthropogenic stress. Accurate metrics and indices exhibit low Type II error rates by correctly identifying impairment and low Type I error rates by correctly identifying reference conditions. As others have indicated [126], the precise impairment state of a system, and therefore the absolute accuracy of metrics, can never be truly known. We therefore use the term accuracy to refer to estimated accuracy for identifying impairment, as indicated by relationships of metrics with a priori-selected stressor variables.

Relationships between metrics with continuously varying stressor variables may be expressed using correlation analysis. The objective is often to assess the responsiveness of metrics to overall stressor gradients. PCA is commonly used to aggregate individual stressor variables into a comprehensive index of impairment. Metrics and indices can then be evaluated as to the strength of their correlation with this aggregate stressor index [124, 127, 128]. Tests of whether metrics differ significantly from reference conditions provide binary results regarding whether a metric classifies sites correctly. Standard approaches such as ANOVA, however, may indicate statistically significant differences that are not biologically meaningful [123, 129]; therefore, specialized techniques have been developed that are more practical for determining whether metrics differ from reference conditions [129131]. The magnitude of the departure from reference conditions is most important, regardless of statistical significance. To this end, test statistics such as ANOVA F-statistics or t-scores, rather than p-values, are used to determine the degree to which metrics differentiate between reference and stressed sites [116, 123, 125]. Estimates of Type II error rates are given by choosing a threshold value in the reference site distribution that indicates impairment (e.g., 5th or 25th percentile for metrics that decrease with stress and the 95th or 75th percentile for those that increase with stress) and determining the proportion of impaired sites where metric scores exceed this threshold (for metrics that increase with stress), indicating that impairment has not been correctly identified [132]. Barbour et al. [133] developed a similar, graphical approach for evaluating the degree to which metrics and indices discriminate between reference and impaired conditions (Fig. 4). Distribution-based methods such as these are especially susceptible to the confounding effects of outliers, which should be carefully scrutinized to determine whether they are caused by imprecise metrics or site misclassification.

Fig. 4
figure 4

Box plots of simulated data illustrating the method of Barbour et al. [133] for evaluating the discriminatory power of metrics and indices. Boxes represent 25th and 75th percentiles; whiskers represent non-outlier maximum and minimum values. The metric is expected to decrease with impairment. Top left: discriminatory power = 0 (lowest), as the reference and impaired site interquartile ranges (IQRs) overlap and include both medians. Top right: discriminatory power = 1 because the IQR overlap includes only one median value. Bottom left: discriminatory power = 2 because the overlap does not involve either median. Bottom right: discriminatory power = 3 (highest) because the IQRs do not overlap

Measures of precision describe the reliability of metrics and indices for consistently indicating site conditions. Those that exhibit high variability that is not attributable to environmental predictors are not useful for bioassessment. Precision is expressed by measures of variability in metric or index values among samples, most commonly as the standard deviation (SD) or coefficient of variation (CV). Variance partitioning is conducted to determine the relative importance of the three primary sources of variation: among-site spatial variation, within-site spatial variation, and temporal variation [29, 134].

Temporal precision is often evaluated using the signal-to-noise ratio (S/N ) [135], which is the ratio of metric variance among sites to variance among multiple visits at the same site. When evaluated using both stressed and reference sites, S/N reflects both accuracy and temporal precision. Stevenson et al. [125] set S/N > 2 as the acceptable ratio for diatom metrics. Stoddard et al. [123] indicated that acceptable S/N values should vary, based on organisms’ generation times, from >1 for algae to >4 for fish (though these preliminary guidelines require further evaluation). Within-site spatial precision is reflected by metric variability in samples collected at the same site and time, which may be affected by sampling error among spatially or temporally replicated samples [134, 136] or by variation among bioassessments employing different protocols [134, 137].

The among-site precision of metrics and indices may be assessed among sites within the same reference class, thus limiting the confounding effects of environmental variability on the evaluation. For both O/E indices and MMIs, the ratio of observed-to-expected metric values at reference sites should differ negligibly from one, and thus the standard deviation (SD) for this ratio should be nearly equal to its coefficient of variation (CV). The distribution of observed-to-expected metric or index values provides a graphical illustration of both accuracy and precision (Fig. 5) [72].

Fig. 5
figure 5

A graphical comparison of the accuracy and precision of two O/E indices (after methods in [72]). Figure panels depict frequency distributions of O/E values for two indices at reference sites. The data were simulated for example purposes only. The O/E index in the top panel is relatively accurate (mean = 1.01) and precise (SD = 0.08). In contrast, the index depicted in the bottom panel is less accurate (mean = 0.70) and less precise (SD = 0.16, also note the greater spread of the distribution). This approach can be modified to assess the value distribution of any metric or index at reference sites (e.g., [125])

4.3 Metric Redundancy

Redundant metrics may respond in the same manner to stress and if used together result in overly complex indices that include metrics that do little to increase (and may decrease) accuracy. The problem of redundancy has long been recognized, although the best approach for minimizing it remains unclear. Some investigators prefer to focus on ecological redundancy by including metrics from different ecological categories [122, 128], while others focus on reducing statistical redundancy by evaluating pairwise correlations among metrics and choosing only one metric within each pair that is correlated [124]. Combinations of these approaches may be employed, which consider both ecological and statistical redundancy [138].

Correlations among metrics generally reduce MMI precision and accuracy, but these characteristics appear to be most related to the mean pairwise correlation rather than the maximum correlation among metrics in an index [139]. To best reduce the mean correlation among a group of potential metrics, multivariate analyses such as PCA and cluster analysis can be used to aggregate correlated metrics [21, 101, 102]. The correlation of metric errors (e.g., residuals of stressor–metric regressions) rather than the correlation of metric values may be more appropriate for judging redundancy, a concept whose applicability should be further evaluated [140].

4.4 Metric Aggregation and Scoring

MMI development is completed by aggregating the best-performing metrics to derive an index score. The number of metrics used in the index varies among studies, and the choice is rarely supported by clear empirical justification [139]. Professional judgment is often used to select metrics based on best overall performance, although ordered stepwise processes have been recommended and present more comprehensive and objective options [139, 140]. To express metrics on an equivalent numerical scale, raw values are commonly rescaled to reflect percent or proportional comparability to values in the reference site distribution or to the distribution of all sites producing metric scores on continuous 0–100 or 0–1 point scales that increase with impairment. Blocksom [49] reviewed the details of these and other common scoring methods. After scoring, metrics are nearly always aggregated into an index by simple averaging, although other methods, such as differential weighting based on relative importance [131, 141] or to account for variations in metric precision [131], have been used. Alternative aggregation strategies for MMIs represent yet another area where additional research is needed.

4.5 Index Validation

Validation of the index with independent data provides the most comprehensive evaluation of performance. Validation typically proceeds by randomly selecting subsets of impaired and reference sites, which are excluded from the dataset used for index development and used for a posteriori evaluation of the performance characteristics described above. The feasibility of index validation depends on the amount of data available, as statistical power is compromised by dividing datasets for this purpose. Categorical approaches for validating index accuracy are data expensive, as the validation set must be divided according to impairment status. When only a few sites are available, index accuracy may be validated by analyzing for correlations of index scores with stressor gradients, which requires fewer validation sites. This approach is especially useful in highly developed landscapes where there are few reference sites [127, 142]. Index accuracy is often prioritized over other performance characteristics, although more thorough validation strategies also evaluate precision [21, 106, 143]. Evaluation of index bias, as indicated by relationships between indices and natural environmental variables, provides further information on index performance [21, 101, 106, 116].

The effectiveness of predictive models used for reference site classification is often evaluated using cross-validation by constructing models using only subsets of the available data. Leave-one-out cross-validation is a data-efficient method in which reference sites are excluded from the dataset one at a time. After each exclusion, the classification and modeling process is repeated. Each left-out site is then classified using the model constructed without that site. The proportion of agreements between original and cross-validated classifications, relative to the total number of reference sites, is a measure of the effectiveness of the model. This technique can be used with any site classification approach [114, 134, 144]. Null models, which are constructed by predicting metric values across all reference sites, with no classification, are useful for evaluating all types of classifications. If classification strength does not exceed that of the null model, then the classification provides no advantage [145].

5 Expert Interviews: Challenges and Important Considerations in Bioassessment

To provide a broader and more comprehensive perspective, we conducted interviews with three expert practitioners and developers of bioassessment programs. Their responses to our questions, provided here in a question-and-answer format, have been summarized with a focus on emerging issues relevant to bioassessment in aquatic systems.

Expert 1

Michael Barbour, Ph.D.—Adjunct Senior Scientist, Mote Marine Laboratory, Sarasota, FL, USA, and retired Director, Center for Ecological Sciences, Tetra Tech, Inc., Owings Mills, MD, USA

Q: What factors limit the potential for increased use of genetic information in bioassessment surveys? Is it likely that molecular genetic analysis will replace traditional taxonomic approaches, or will these processes be used in conjunction with each other?

A: A major challenge in the use of genetic data for bioassessment will be in determining how reference conditions are expressed and how to account for the effects of natural environmental variability on reference populations. It is unlikely that genetic analysis will replace traditional taxonomic approaches in the near future. Evolving DNA methods, however, should help to decrease taxonomic uncertainty and improve our evaluations of aquatic assemblages.

Q: What are the most important factors to consider in developing a bioassessment program?

A: Adherence to the Critical Elements Process in the design and implementation of bioassessment protocols should provide an objective means of evaluating the rigor of regulatory assessment programs and a basis for comparing data quality among programs [146]. This process is used to evaluate programs with respect to 13 critical elements within three general categories: study design, methods implementation, and data interpretation. The process considers logistic feasibility and cost-effectiveness, calling for the highest methodological and data-quality standards that are reasonably attainable given existing technological and monetary constraints. When high-quality methods exist, new bioassessment programs should employ methods consistent to these to maximize efficiency (i.e., use of preexisting datasets) and historical significance.

Q: What are the most important recent developments that have improved comparability among assessment programs?

A: In addition to the critical elements process, the Biological Condition Gradient approach (BCG, [45]) provides a framework for developing consistent, meaningful, and understandable aquatic life use standards and is applicable to a wide variety of monitoring strategies and assemblages. The BCG establishes a baseline by employing best professional judgment within an organized framework whereby experts assign bioassessment samples to ecological condition tiers. Biologists trained to use the BCG produce highly consistent evaluations of site conditions. The use of the BCG should greatly facilitate the comparability of bioassessments conducted by different agencies and using different protocols.

Large-scale monitoring programs, such as the USEPA National Aquatic Resource Survey, are of great importance. This nationwide assessment program includes standardized sampling protocols and a probabilistic study design for the assessment of US streams, rivers, lakes, wetlands, and coastal waters. The ongoing intercalibration exercise, a key component of the European Union (EU) Water Framework Directive (EC 2000/60/EC; [147]), and resulting multi-country aquatic ecosystem surveys are other important examples. The major advantage of these programs is the development of consistent and rigorous protocols that allow for large-scale biological assessments of aquatic ecosystems.

Expert 2

Simone D. Langhans, Ph.D.—Humboldt Research Fellow, Leibniz Institute of Freshwater Ecology and Inland Fisheries, Berlin, Germany

Q: What factors are most important in limiting comparability among bioassessment schemes?

A: Variability in the definition of reference conditions (i.e., the allowable amount of impairment within the reference dataset) can hinder the comparison of index scores from different assessment programs. Index scores are typically an expression of how similar a site is from the reference state; therefore, the use of similar reference criteria facilitates comparability among assessment indices.

The expression of index scores in a continuous manner, rather than as categorical ratings, is helpful when aggregating scores derived from different assessment schemes. Due to their discrete nature, categorical ratings may differ for scores that are actually quite similar; therefore, the most reliable aggregated indices are based on continuous scoring systems. For management purposes, categorical ratings can be applied after aggregation. Langhans et al. [141] present a method for standardizing metric scores or attribute measures among indices to a continuous 0–1 scale. The method can be applied to both categorical and continuous scales and preserves the best professional judgment of developers regarding the interpretation of score or measurement values.

Q: What aspects of the relationships of anthropogenic activities and ecological integrity are most misunderstood by nonscientists?

A: The watershed (or catchment) concept is of great importance, though often not understood by the layperson. Following from the simple concept that water flows downhill, water bodies integrate the effects of human activities everywhere within their watersheds. Responsible management of aquatic ecosystems must consider not only in-stream and local effects but also effects at much larger spatial scales. The range of relevant spatial scales that should be considered increases with the size of the water body, with the largest river systems integrating the effects of anthropogenic activities over thousands of square kilometers.

Q: In considering streams in different natural settings and at different positions along the general gradient of impairment, which are the most important candidates for preservation or restoration?

A: How best to prioritize conservation and restoration efforts for aquatic ecosystems is currently a popular and important topic in the EU. A strategy that considers the conservation of existing ecological integrity and the restoration of impaired systems simultaneously is best. When biological assemblage objectives are given high priority, the most effective areas for restoration are those in close proximity to high-quality conservation areas because the conservation areas provide sources for recolonization. For example, Tonkin et al. [148] evaluated the likelihood of recolonization by invertebrates at 21 river restoration sites in Germany. They determined that the density of occurrence of a taxon at surrounding sites (proportion of sites with the taxon present) and the distance to the nearest potential source site were important factors for predicting recolonization.

Expert 3

Gregory J. Pond, M.S.—Aquatic Biologist, USA Environmental Protection Agency, Region III, Wheeling, WV, USA

Q: Given the myriad protocols currently employed to conduct assemblage-level assessments, what considerations should be made by investigators and managers to select the most accurate, precise, and cost-effective strategies?

A: Protocols should be flexible, thoroughly documented in standard operating procedures, and based on the varying assessment goals, characteristics of the system being studied and available funding. For example, species-level macroinvertebrate data may in some cases produce the most effective assessment results. In other situations, temporal variation among samples may produce confusion at high levels of taxonomic resolution, for example, if errors in identification increase at times of the year when early instars predominate in the samples. In such cases, coarser taxonomic resolution at the genus or family level may be necessary to avoid inconsistency among samples.

If assessment on a large spatial scale is a priority, investigators may use less time-intensive methods to assess a greater number of sites within the time and monetary constraints of the project. In an attempt to provide a spatially comprehensive assessment of Pennsylvania (USA) waters, the state’s Department of Environmental Protection (DEP) conducted rapid surveys focused on macroinvertebrate assemblage characteristics that could be evaluated in the field by trained biologists without the need of extensive sampling and laboratory processing. This protocol allowed DEP to conduct an initial screening of several thousand stream sites over a 2-year period within budgetary constraints. However, the accuracy and precision of field-based rapid surveys such as this are likely far lower than would be expected from more intensive sampling and processing protocols that produce quantitative, genus- or species-level datasets.

Data consistency and comparability are also of great concern. When USEPA conducts bioassessments, the protocols developed by the state are typically followed. This insures that the data collected are comparable with those produced by state biologists and that the methods have been calibrated for the region of interest. Natural variability of system-specific characteristics should also be considered. For example, Virginia (USA) is currently developing a new protocol for swamp streams, which have not been previously assessed for regulatory purposes.

Q: What are your thoughts regarding the use of continuous environmental variables within a predictive modeling framework versus typological approaches for reference site classification?

A: Large-scale typological approaches have generally been insufficient in accounting for the variation in natural environmental factors that affect biota among aquatic systems. Typological approaches, however, are convenient, easy to use and understand, and can be effective within relatively small and homogenous geographic regions. Natural environmental gradients often persist within typological categories, and care should be taken to ensure that typological approaches do not oversimplify these gradients. Predictive modeling is more analytically intensive and requires more precise data, but generally provides more reliable results in highly heterogeneous regions. In the development of an O/E model for Central Appalachian streams, Pond and North [75] determined that subecoregion [149], Julian day and latitude were the most important natural predictors of reference macroinvertebrate assemblages. For that study region, a predictive modeling approach was chosen as the best strategy, given the importance of the continuous variables Julian day and latitude. A typological approach, where reference conditions are developed at the subecoregion level, could also be effective; however, the effects of latitude and seasonality should be carefully observed and potentially controlled.

6 Conclusions

In this chapter, we provided an introduction to the major components of assemblage-level bioassessments of aquatic systems. Macroinvertebrates, fish, and algae are the most commonly used assemblages, although the methods described here are applicable to, and have been successfully used with, other biotic assemblages [150152]. One emerging strategy is in the use of prokaryote assemblages, which has historically been limited because many prokaryotes are not readily cultured in the laboratory. However, emerging technologies that allow for quantification of assemblage composition through DNA sequencing have largely removed this limitation, making prokaryote assemblage assessments an emerging new option for bioassessment of aquatic systems [152, 153]. Rapidly evolving DNA sequencing methods have the potential to greatly enhance not only bioassessments using prokaryotes but also those using assemblages that have traditionally been evaluated by identification of specimens based on morphological characteristics [154157].

Regardless of the assemblage type chosen or the methods used for identifying taxa in the assemblage, the most challenging aspect of bioassessments has been, and remains, the difficulty in separating environmental effects on assemblages that are the result of naturally varying factors such as climate and geology from those caused by anthropogenic factors. The use of the RCA, coupled with advanced predictive modeling methods such as machine learning techniques, has enhanced our ability to predict how assemblages should vary based on natural environmental factors. Such enhanced predictive power should ultimately allow for more accurate determination of assemblage variation patterns that indicate impairment. Despite these advancements, predictive modeling and the use of the RCA are greatly confounded by the lack of suitable reference sites in many regions. To this end, alternative strategies that employ both impaired and reference sites to derive expected reference conditions have been proposed [118121] and warrant further evaluation to determine their widespread applicability. Because of the scarcity of reference sites in many regions and the high potential for complex interactions between natural environmental factors and stressors, the development of additional data-efficient methods for predicting expected assemblages under unimpaired conditions and for quantifying deviations from these expectations is much needed.

An additional challenge for contemporary bioassessment programs is the shifting baseline syndrome (sensu Hawkins et al. [43]) wherein future climate change is likely to alter temperature and precipitation regimes globally, thus changing the assemblage compositions that might reasonably be expected under minimally impaired conditions. To meet this challenge, spatially and temporally extensive monitoring is essential to derive realistic reference conditions. Several large-scale assessment programs have been recently developed, such as the EU Water Framework Directive, the US Geological Survey’s National Water Quality Assessment program, the US EPA’s National Aquatic Resources Survey, the US National Science Foundation’s National Ecological Observatory Network, and the Canadian Biological Monitoring network. These programs include rigorous and thoroughly documented bioassessment protocols focused on monitoring aquatic assemblages over large spatial and long temporal scales. Data produced by these important programs will enhance our ability to overcome the inherent challenges in evaluating ecological integrity when least-impaired reference conditions are rare, highly variable among regions, and changing in response to global climate change.