Introduction

Biological monitoring of rivers and streams has its origins over a century ago (Metcalfe 1989; Cairns and Pratt 1993; Dolédec and Statzner 2010) and a wide variety of taxonomic groups have been suggested for use, ranging from viruses and bacteria to plants, macroinvertebrates, and fish (Hellawell 1984). In examining the advantages and disadvantages reported for benthic diatoms, zooplankton, macroinvertebrates, and fish, Resh (2008) concluded that of all these potential bioindicators, macroinvertebrates provided the highest return on investment in terms of information gained for research funds spent. Indeed, benthic macroinvertebrates are the most widely used group for biological monitoring of streams and rivers around the world (Bonada et al. 2006).

Most currently used macroinvertebrate analyses for water quality assessments typically involve the use of tolerance values (TVs), often in calculating scores of biotic indices (Carter and Resh 2013). The use of TVs, which describe resistance of organisms to pollution, has a long-standing tradition in aquatic biomonitoring programs. For example, the initial attempts at biological monitoring (i.e., the saprobien system of Kolkwitz and Marsson 1902) were based on the premise that different taxa have different pollution tolerance (Bonada et al. 2006).

The long-standing tradition of using TVs has resulted in a variety of basic assumptions about the tolerance of benthic macroinvertebrates that we believe have become entrenched in aquatic biomonitoring programs and are assumed to be correct. These assumptions include that: (1) Arthropoda are less pollution tolerant than non-Arthropoda; (2) Insecta are less tolerant than non-Insecta; (3) non-Oligochaeta are less tolerant than Oligochaeta; (4) other benthic macroinvertebrate taxa are less tolerant than Oligochaeta + Chironomidae; (5) other benthic macroinvertebrate taxa are less tolerant than Isopoda + Gastropoda + Hirudinea; (6) Ephemeroptera + Plecoptera + Trichoptera (EPT) are less tolerant than Odonata + Coleoptera + Heteroptera (OCH); (7) EPT are less tolerant than non-EPT insects; (8) Diptera are less tolerant than other Insecta; (9) Bivalvia are less tolerant than Gastropoda; (10) Baetidae are more tolerant than other Ephemeroptera; and (11) Hydropsychidae are more tolerant than other Trichoptera. These assumptions were selected because they include the most commonly used biomonitoring metrics reported by state agencies in the US (from Carter and Resh 2013, their Table 4) and from perusing international programs (Resh 2007).

We believe that these assumptions were based on the results of earlier biomonitoring studies conducted in particular regions, such as in Europe (described by Metcalfe 1989), South Africa (Chutter 1972), and North America (Hilsenhoff 1982, 1987). Tolerance was often determined by a group of organisms' response to dissolved-oxygen deficits resulting from sewage inputs. These tolerances generally have been assumed to hold true for regions and pollutants other than for the ones for which they were developed. However, we have not found any reports indicating that the accuracy or robustness of the above-described assumptions has been statistically tested.

The objectives of this study are to: (1) compile available information about how TVs reported for benthic macroinvertebrates are developed in different regions around the world; (2) statistically evaluate the validity of the 11 basic assumptions (described above) concerning the TVs of specific groups of benthic macroinvertebrates, combinations of benthic-macroinvertebrate groups, and of tolerance-based metrics; and (3) assess how TVs vary geographically, within macroinvertebrate groups at different taxonomic levels and the influence of methods used to derive TVs.

Methods

The progression of our methods is as follows; we: (1) compiled information on TVs worldwide; (2) converted the TVs to the same scaling system; (3) tested the validity of the 11 basic assumptions using permutation tests and bootstrapping applied to the worldwide dataset containing all regions; (4) used five subsets of this worldwide dataset to evaluate potential effects of interdependencies among regions on the results; (5) used one subset of this worldwide dataset to evaluate the effect of Best Professional Judgment (BPJ); and (6) used k-medoids clustering to examine the geographic distribution of TVs within the macroinvertebrate groups at different taxonomic levels and the influence of their derivation methods.

Obtaining information on TVs

We collected information on TVs used in 29 regions located on six continents and Oceania (Table 1). We only used information based on numerical scores assigned to individual benthic-macroinvertebrate families, which resulted in the exclusion of metrics based on multi-metric indexes, biological traits, or molecular data. Most TV databases are published in the grey literature rather than in peer-reviewed journals, and some TVs were only able to be obtainable through email contact with researchers directly involved in the score development. We typically only used one set of TVs from each country. However, we used three sets from the US (New York, Midwest, California) and two sets from China (Eastern region and Yangtze River) because of their large size.

Table 1 Data sources used in the analyses

Scaling scores

In many regions, TVs are based on an 11-point numerical scale that represents a gradient of pollution resistance. For example, in the US, TVs typically range from 0 to 10, with 0 representing highly intolerant organisms and ten highly tolerant (e.g., Hilsenhoff 1987; Lenat 1993). A 10-point scale is used in Europe, but it is in reverse order with 1 representing highly tolerant and ten highly intolerant (e.g., Armitage et al. 1983). Similarly, the eastern region of China uses a system with ten classes, with 1 representing tolerant and 10 intolerant (Wang and Yang 2004). Other regions use different scales (e.g., France uses a scale from 9 to 1; Austria, Czech Republic, Germany, and Slovakia from 0 to 4; Mekong River Basin from 100 to 1; South Africa from 15 to 1; Costa Rica from 9 to 0; and Brazil from 1 to 4; see Table 1 for references). Japan uses only two classes of tolerance, A and B (Tsuda and Morishita 1974), and was thus excluded from the worldwide analysis because of the limited potential to differentiate TVs among taxa.

To compare the TVs from different regions, we first converted all the scores to a uniform 10-point scale ranging from 1 to 10, where 1 represents least tolerant and 10 most tolerant. We then converted all the original scores into this 10-point scale using linear interpolation, rounding the converted scores to the nearest whole integer.

Statistical comparisons

We restricted our analyses to permutation tests and bootstrapping, which are non-parametric re-sampling techniques that can be used with categorical data (Anderson 2001; Good 2005). These two methods do not require the assumptions of traditional parametric tests (e.g., normal distributions and homogeneity of variance, as for t-test) to be met (Collingridge 2013). These tests were the best available choice because the TVs had non-normal distributions resulting from different methods of development or assignment, and from differences in professional knowledge in the different regions (P. deValpine, University of California, Berkeley, personal communication).

We tested: (1) the validity of 11 basic assumptions described in the Introduction; (2) which taxonomic groups above the level of order are significantly different from one other in terms of these assumptions; (3) which aquatic insect orders and biomonitoring metrics (e.g., EPT, OCH, and non-EPT taxa) based on these orders are significantly different from each other; and (4) which regions clustered together, based on their TVs. The False Discovery Rate (Benjamini and Hochberg 1995) correction was applied to all p values. All permutation and bootstrapping procedures were conducted in R statistical software.

The validity of 11 basic assumptions regarding TVs

We used permutation tests to evaluate the first nine of the 11 assumptions described above; the final two were evaluating using bootstrapping. Both permutation and bootstrapping tests are non-parametric methods of re-sampling used to test significance (Legendre and Legendre 2012). We performed permutation tests for the first nine of the 11 assumptions because we were examining if the TVs of two distinct metrics were significantly different from each other. In contrast, we performed bootstrapping for the last two assumptions because we were examining if the TV of a certain family is significantly different from the rest of the TVs in its respective order (i.e., the precision of the sample). In testing the assumptions, we first calculated the average TV for each benthic-macroinvertebrate family among all the regions examined, which we refer to as the family average. For the permutation tests, we used these family averages to calculate the pre-permutation value (i.e., the value derived directly from the TV databases) of the metrics analyzed (e.g., EPT and OCH) or of taxonomic groups (e.g., Arthropoda, Insecta, Oligochaeta).

The permutation tests included 10,000 random permutations, each of which included the appropriate number of families specific to the metric(s) or taxonomic group(s) being compared. For example, each permutation comparing Arthropoda (272 families) versus non-Arthropoda (87 families) included a total of 359 family averages (272 + 87), which were randomly drawn from the pool of family averages contained in the Arthropoda (i.e., 272 were drawn) and non-Arthropoda (i.e., 87 were drawn) groups, and were distributed accordingly. Based on the random combination of family averages that occurs during each permutation, we calculated permutated values (i.e., values representing the average of family averages from the randomization procedure). In each permutation, we calculated the difference between the average of two permutated values, which represent the two metric(s) or taxonomic classification(s). This permutation procedure was repeated 9,999 times to generate a distribution of all the differences between the permutated values. Lastly, we included the difference between the averages of the pre-permutation values as the 10,000th iteration in this distribution. The likelihood of the difference between two pre-permutation values occurring in this distribution provides the p value, which indicates whether the difference is statistically significant.

We used bootstrapping to compare the TV of the individual families examined to that of other members of their order (e.g., the TV of the mayfly family Baetidae to that of other taxa in the order Ephemeroptera). For example, we first used the family averages to calculate the average TV of Ephemeroptera. Then, we calculated the difference between the family average of Baetidae and the average TV of Ephemeroptera, which we refer to as the pre-bootstrapping difference. The family average of Baetidae was fixed, whereas the average of Ephemeroptera was variable because we re-sampled family averages within the order with replacement to obtain a bootstrapped average TV score (e.g., we drew 34 family averages with replacement from the 34 family averages within Ephemeroptera to calculate a bootstrapped average, which varied with each iteration). Subsequently, we calculated the difference between the family average of Baetidae and this bootstrapped average of the order, repeated this procedure for 9,999 iterations, and included the pre-bootstrapping difference as the 10,000th iteration. The likelihood of the pre-bootstrapping difference occurring in this distribution provides the p value, which indicates whether the difference is significant. This bootstrapping procedure was also used to compare Hydropsychidae to other Trichoptera.

Accounting for non-independence

The worldwide dataset contains scoring systems that are potentially not independent of one another because one program may use the same (or modified) TVs from another program. However, there presumably is some independent thought applied by researchers in each region to modify TVs, such as by BPJ. To explore this issue of potential non-independence, we repeated all the permutation tests and bootstrapping described above on five subsets of the entire dataset that were grouped according to the scoring system they derived TVs from (Table 1), including the use of a: (1) locally derived methods group, as done in the Mekong River basin, China (Eastern), and South Africa; (2) Hilsenhoff method group, including US (Midwest, California, and New York), Canada, and China (Yangtze); (3) Trent index method group, including France and Belgium; (4) Saprobien System method group, including Germany, Slovakia, Austria, Latvia, Czech Republic, and Brazil; and (5) Biological Monitoring Working Party (BMWP) method group, including all the remaining 13 countries. We then repeated the procedures described above to test the robustness of the results from using the entire dataset.

Accounting for potential effects of BPJ

We formed a final subset because many of the regions examined depend somewhat on BPJ to determine their TVs. Although not documented, we believe that BPJ often relies on some of the 11 basic assumptions that our study aimed to test. To test this circularity issue, we prepared a subset consisting of five least BPJ-related method regions, including the: (1) Mekong River Basin, where TVs are based on environmental condition; (2) China (Eastern), which is diversity-index based; (3) Brazil and Germany, which are Saprobien-system based; and (4) France, which is Trent Index-based. All permutation tests and bootstrapping were performed on this subset. We then repeated the procedures described above to test the robustness of the results from using the entire dataset.

Differences tested among taxonomic groups and among metrics

We tested for differences among ten higher taxonomic groups above the level of order, including Insecta, Arthropoda, non-Oligochaeta, Bivalvia, non-Insecta, Hirudinea, non-Arthropod, Isopoda, Gastropoda, and Oligochaeta. For each of these groups, the lower and upper 95 % confidence intervals (CI) of the average were determined using bootstrapping. Non-overlap in the 95 % CI indicated significant difference between the two averages.

We tested for differences among aquatic Insecta orders (Plecoptera, Trichoptera, Ephemeroptera, Odonata, Diptera, Heteroptera, Lepidoptera), and commonly used metrics (EPT, OCH, and non-EPT insects) for benthic-macroinvertebrate biomonitoring. Orders with TVs for <4 families, which included Neuroptera (3) and Megaloptera (2), were excluded because of their small sample size. For each of the groups examined, the lower and upper 95 % CI of the average were determined using permutation tests. Non-overlap in the CI indicates significant difference between the two averages.

Differences among regions

In terms of regional similarities, we based our expectations solely according to geographical proximity and climate type, although we were aware that other related factors, such as altitude and cultural/historical ties, could affect the similarity of TVs. Consequently, our expectations were that: (1) TVs in the North America (Midwest, California, and New York in USA, and Canada) should differ because of the distance among these regions; (2) in Europe, north-temperate countries should differ from Spain because of climate; (3) Spain should differ from Central Europe because of climate; (4) however, the Central European countries, including Czech Republic, Poland, Austria, Slovakia, Germany, Belgium, and Latvia should not differ among themselves; (5) Asian countries (Thailand, India, Mekong River basin, China) should differ from other, non-Asian countries because of geographic and climatic differences, and we also expected these Asian countries or regions to be grouped according to their geographic proximity; (6) in South America, Colombia, Bolivia, Ecuador–Peru should not differ because of location and climatic similarities; (7) and lastly, in Latin America, Costa Rica and Chile should differ because of the distances among them.

An alternative hypothesis was that countries would be grouped by the methods used for deriving TVs (Table 1). To test these hypotheses, we first prepared a modified Euclidean-distance matrix among countries to represent the differences among their scoring systems. The modified Euclidean distance is the Euclidean distance divided by the number of families in common between any two given countries. This distance matrix was subsequently used to perform cluster analysis using the k-medoids method (Kaufman and Rousseeuw 1987), which is a non-hierarchical clustering technique that achieves maximum within-cluster homogeneity without relying on hierarchy. Another advantage of k-medoids is that it is based on the Partitioning Around Medoids (PAM) algorithm, which minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances (Reynolds et al. 1992). Therefore, the result from k-medoids is robust to noise and outliers. The k-medoid procedures were performed by using the "cluster" package in R (Maechler et al. 2013).

To determine the most appropriate number of groupings for the cluster analysis, we first calculated the Average Silhouette Width (ASW) of each cluster for groups of different sizes. A higher ASW value indicates that the dataset is more highly clustered and each cluster is more likely to be homogeneous (Rousseeuw 1987). Kaufmann and Rousseuw (2005) suggest that an ASW value ≥0.5 indicates the data are structured, 0.25–0.5 indicates the possibility of structure, and <0.25 indicates lack of structure. We defined the most appropriate number of groups to be that number resulting in the highest ASW, with the condition that this ASW must be ≥0.25. Lastly, we superimposed the grouping onto an NMDS (Non-metric Multidimensional Scaling) ordination plot, which is a statistical method widely used to visualize the similarity of objects in large, complex datasets (Legendre and Legendre 2012; Oksanen et al. 2013).

Results

Analysis of the methods used to assign TVs

BPJ is the method most widely used worldwide for assigning TVs, and most regional TVs are usually modified from those used in other geographical areas. For example, 72 % of the regions examined reported using BPJ, at least in part, to develop TVs (Table 2; methods 2, 3, 4, and 5). Over two-thirds (79 %) of programs drew scores from one region and modified them to fit another region (Table 2; methods 3, 4, 5, and 8, excluding Germany where Saprobien TVs were originally developed). The next-most-used method of TV assignment was based on the Saprobien System, which uses the frequency of occurrence of species or taxonomic units at different saprobity (pollution) levels to assign the TV (or saprobity score) to each taxon. Other approaches used in a single region or country are based on either using a mathematical equation that relies on site disturbance scores assigned by BPJ of human impacts to assign TVs (method 6) or using the Shannon–Weiner taxonomic diversity of sites to assign TVs (method 7).

Table 2 Examples of tolerance value (TV) assignment methods used worldwide with a description of how widely used each method is and where it is applied at the regional level

Testing tolerance assumptions

We found that eight out of 11 of the basic assumptions tested were statistically valid (Table 3). The exceptions were that Gastropoda and Bivalvia were not significantly different in their TVs, and that neither Baetidae nor Hydropsychidae were significantly different from Ephemeroptera or Trichoptera, respectively.

Table 3 Taxonomic comparisons underlying 11 basic assumptions regarding tolerance values (TVs) of benthic macroinvertebrates

The coefficients of variation (CV) of the different orders, combination of orders, and metrics indicated that the variability was generally low (Table 3). The metrics Oligochaeta, and Oligochaeta + Chrionomidae, had the lowest CVs, 3–10 %, whereas the metrics Insecta, Arthropoda, non-Oligochaeta taxa, all taxa except Oligochaeta + Chrionomidae, Isopoda + Gastropoda + Hirudinea, EPT, Diptera, and other Insecta beside Diptera all had CVs in excess of 30 %.

Differences among higher-level taxonomic groups

The higher taxonomic groups showed statistically significant differences among each other (Fig. 1a). Insecta had the lowest TVs (95 % CI of mean = 4.4–4.8), whereas Oligochaeta had the highest (95 % CI of mean = 8.5–9.0). Insecta had significantly lower TVs than all other groups examined except Arthropoda, whereas Oligochaeta had significantly higher TVs than all other groups. Except for non-Oligochaeta, Arthropoda had significantly lower TVs than all of the more tolerant groups (i.e., those to the right of Arthropoda in Fig. 1a). Non-Oligochaeta also had significantly lower TVs than all the more tolerant groups to the right in Fig. 1a. Bivalvia had significantly lower TVs than Oligochaeta. Non-Insecta had significantly lower TVs than Hirudinea and Oligochaeta. Non-Arthropoda had significantly lower TVs than Hirudinia and Oligochaeta but not Isopoda. Isopoda were significantly less tolerant than Oligochaeta, but not Gastropoda or non-Arthropoda (Fig. 1).

Fig. 1
figure 1

Comparison of average tolerance values (TVs) for a higher classifications of aquatic organisms and b aquatic insect orders and metrics based on these orders. The error bars represent the 95 % confidence interval, which was determined using permutation tests. Non-overlap of this interval indicates statistical significance. EPT Ephemeroptera, Plecoptera, and Trichoptera; OCH Odonata, Coleoptera, Heteroptera. The numbers in parentheses are the number of families included in the averages

Differences among aquatic insect orders and metrics based on these orders

The aquatic insect orders and metrics based on these orders also showed statistically significant differences from each other (Fig. 1b). Plecoptera had significantly lower TVs (95 % CI of mean = 1.9–2.6) than all the other groups examined (95 % CI = 3.3–6.8), whereas Lepidoptera had the highest (95 % CI = 5.8–6.8). The EPT orders when combined had a significantly lower TV than all the more tolerant groups to the right in Fig. 1b, except for Trichoptera and Ephemeroptera. However, EPT was significantly higher than for Plecoptera alone. Trichoptera had significantly lower TVs than all the more tolerant groups except Ephemeroptera, but was significantly higher than Plecoptera. Ephemeroptera had significantly lower TVs than all the more tolerant groups to the right in Fig. 1b. Odonata had significantly lower TVs than all the more tolerant groups. OCH had significantly lower TVs than all the more tolerant groups except Coleoptera and non-EPT Insects. Non-EPT Insects, Diptera, Heteroptera, and Lepidoptera were not significantly different from each other.

Accounting for potential effects of BPJ and non-independence

When we validated our results from our worldwide dataset using a subset of the five regions with non-BPJ-derived methods (e.g., Mekong River Basin, China [Eastern], and Germany), we found similar results as when using the worldwide dataset. However, these five regions showed two contrasting results. TVs of Diptera were not significantly higher than other Insecta, and Bivalvia were significantly lower than Gastropoda. In terms of non-dependence issues, however, differences among these five method groups were not statistically significant and were also comparable to those derived from the entire dataset with only two exceptions. First, when considering only locally developed methods, Insecta TVs were not significantly lower from those of non-Insecta. Second, when considering the Saprobien System users, Diptera TVs were not significantly more tolerant from other Insecta.

Tests of hypotheses about differences among regions

The combination of TVs among the regional scoring systems examined clustered best into five groupings, based on an ASW of 0.26 (indicating the possibility of structure). These groupings ranged in number of regions from one (France) to 14 (Great Britain, Spain, Poland, Thailand, India, Australia, New Zealand, South Africa, Costa Rica, Colombia, Ecuador–Peru, Chile, Bolivia, and Brazil) (Fig. 2). The other three groups had two (Egypt and Mekong River Basin), five (Belgium, Germany, Austria, Slovakia, Latvia, Czech Republic), and six (all four North American and two Chinese) regions. When we reexamined the data with regions arranged into the five clustered groupings, the results were generallyly the same as those from the analysis performed on the worldwide dataset.

Fig. 2
figure 2

K-medoid cluster superimposed on 2-D NMSD ordination plot. Groups are indicated by bold capital letters (i.e., A, B, C, D, E) and regions within each group share the same symbol. Dotted lines indicate the perimeter of each group in n-dimensional space

We only found support for one of our seven a priori geographic hypotheses regarding the distribution of TVs, which was (hypothesis 6) that Colombia, Bolivia, and Ecuador–Peru should group together. However, even in this case they were also grouped with Great Britain, Spain, Poland, Thailand, India, Australia, New Zealand, South Africa, Costa Rica, Chile, and Brazil. In the US, the Midwest, California, and New York were not different (reject hypothesis 1). In Europe, north temperate countries (UK) were not different from Spain (reject hypothesis 2). Spain was in the same group as Poland, a central European country (reject hypothesis 3). Poland was in a different group than the other central European countries (reject hypothesis 4). Thailand and India were together, but the Mekong and China were in different groups (reject hypothesis 5). Costa Rica and Chile were in the same group (reject hypothesis 7).

When we examined regions to determine if the clustering followed the type of scoring system used, we found some agreement. For example, group B (Fig. 2) comprised all of the regions using BMWP-derived scores. However, Brazil, which does not use this system, was also in group B. In contrast, the two systems used in China clustered together, but they are based on different scoring systems.

In terms of the variability of TVs, the modified Euclidean distance between regions grouped by scoring system used (i.e., locally derived, Hilsenhoff, Trent Index, Saprobien, and BMWP; Fig. 3) indicated that the BMWP-derived scores had the least variability (i.e., the smallest difference between the first and third quartiles) and, not unexpectedly, locally derived scores had the highest variability. Regions that used the Hilsenhoff-derived scores and those that use the Saprobien-derived scores are more similar in their TVs, but with more variation than those regions using BMWP-derived scores. We note that the above pattern could have resulted simply from statistical artifact, because there were more BMWP-derived systems (13) than Hilsenhoff-derived (5) or Saprobien-derived systems (6). The Trent Index-derived TVs have very low variability because they only include Belgium and France.

Fig. 3
figure 3

Box and whisker plot showing tolerance value (TVs) variation among regions within the five groups of methods used to calculate TVs. Bold horizontal lines represent the median of pairwise distances between regions, lower and upper end of boxes represent the first and third quartiles, respectively, and the lower and upper error bars represent the minimum and maximum, respectively

Discussion

The lack of information provided by programs about how TV scores are developed is surprising given how much weight TVs are given in regulatory decisions, such as those regarding the location and need for wastewater treatment plants and other urban planning issues (Chessman 1995; Purcell et al. 2002). Moreover, we were surprised that most programs use TVs developed by others, albeit with modifications.

Most modified TVs were based on local knowledge and BPJ. The poor performance of multi-metric biotic indices has been attributed in some cases to incorrect TVs for some taxa to heavy metals (Hickey and Clements 1998). Given that innovative projects are underway to reinvent urban water infrastructure for ecosystem benefits, such as streamflow and wetland augmentation with recycled water (Bischel et al. 2013; Halaburka et al. 2013; Lawrence et al. 2013), reliable TVs could play a key role in evaluating the success of such habitat rehabilitation efforts.

Typically, BPJ is informed by both environmental and macroinvertebrate data, relies to a large degree on institutional knowledge, and its programmatic applications have differed among regions and over time (Carter and Resh 2013). For example, in the US, most state-based TVs are based off adaptations of the original Hilsenhoff values, whereas in the UK TVs are assigned by a commission of experts (BMWP, described in Hawkes 1997). Hellawell (1978) found a strong relationship of TVs resulting from the BMWP with taxonomic diversity metrics. However, Washington (1984) questioned the value of diversity indices in terms of their appropriateness to water quality assessments.

Carter and Resh (2013) examined the issue of BPJ and TVs for the programs in the different US states. They found that TVs are generally reported as total tolerance to pollution, tolerance to organic pollution, or tolerance to metals. Most TVs reported by state programs refer to organic loading (87 %), but far fewer referred to total tolerance (13 %) and metal tolerance (28 %). The source for TVs also varied among US programs. Local expertise accounted for 31 % of program choices, which was followed by values from Barbour et al. (1999) (27 %). Values from Hilsenhoff (1982, 1987, 1988, 1998) (13 %) and Lenat (1993) (9 %) were less commonly cited as used.

The lack of geographic specificity, taxonomic resolution, and stressor specificity of TVs represent a limitation in their use. However, several US states are developing TVs that are specific to their regions and for specific stressors such as metals, acid mine drainage, and sediments (Carter and Resh 2013). Bonada et al. (2006) suggested that TVs would be more intuitive to apply if they were on a linear scale where, for example, a TV of 10 would represent an organism that has twice the tolerance as an organism with a score of 5.

Most (eight of 11, or 73 %) of the assumptions that were the basis for our hypotheses were supported in our worldwide analysis of TVs using the entire dataset, and the results were generally similar following our use of subsets of these data. There were two exceptions. First, Bivalvia and Gastropoda were not different from each other in their TVs in the analysis of the entire dataset (but they were different in the subsets). However, this failure to detect a significant difference could relate to the relatively low number of families in the original data for Bivalvia and Gastropoda relative to other groups. Bivalvia had nine families and Gastropoda had 30 families. The lower number of families would result in reduced statistical power. In contrast, the average number of families in the other groups for which there were significant differences examined was 128 (SD = 117).

Second, neither the Baetidae nor the Hydropsychidae were significantly different from other families in their respective orders (i.e., Ephemeroptera and Trichoptera, respectively). Both are commonly occurring and abundant families that are often found in mildly polluted waters in some regions (Ratia et al. 2012; Xu et al. 2013). In particular, filter-feeding Hydropsychidae tend to benefit from increases in particulate matter, such as from wastewater treatment plant or fish farm effluents (Paul 2011; Guilpart et al. 2012). However, as with all generalizations about the perceived higher tolerance of Baetidae and Hydropsychidae, there are many exceptions, and broad generalizations are difficult at taxonomic levels above species (Lenat and Resh 2001). The lack of any significant difference in our analysis likely reflects the different levels of tolerance seen by species in these families that occur in the different regions of the world and the difficulty of applying family-level TVs.

Plecoptera have long been regarded as the most pollution intolerant of the aquatic insect orders (e.g., Friedrich et al. 1992), and this widely held view was also supported in our results, which showed that Plecoptera had lower TVs than all the other groups, including Ephemeroptera and Trichoptera. It is generally accepted that Plecoptera evolved in cold mountain streams where oxygen stress was minimal (Zwick 2000), and we would expect lower TVs because oxygen depletion accompanies a wide range of human disturbances. However, several studies report Plecoptera occurring in organic polluted streams (Bispo et al. 2002; Tomanova and Tedesco 2007) and metal-polluted acidic streams (Rosemond et al. 1992; Sjøbakk et al. 1997; Ruse and Herrmann 2000). Furthermore, there is a high variability among the species of this order in terms of tolerating metal pollution (Ruse and Herrmann 2000) and low oxygen (Tomanova and Tedesco 2007) concentrations.

The potential interdependence in the scores among the regions examined led us to conduct an internal validation based on a subset of the entire dataset. This subset included only TVs that were derived from methods other than BPJ. We saw general agreement in results between this subset and the entire dataset.

The multi-dimensional illustration of clustering among countries (expressed in two dimensions in Fig. 2) indicates that TVs were not distributed as expected, but that methods of TV assignment and geographic proximity were often factors in explaining these differences. For example, the North America and Central European countries each clustered according to their expected locations (Fig. 2, groups A and D). These two clusters each have their own respective methods of TV assignment, i.e., the US uses the Hilsenhoff system and Central Europe the Saprobien system (Table 2). Likewise, group B countries clustered together because they all use the BMWP system, with Brazil being an exception. Instead, Brazil uses values derived from the Saprobien system and also uses values at the generic or family level whereas Central Europe generally uses species level, which may explain it not being in group D. However, in group B, Brazil is at the extreme edge of the grouping (Fig. 2). To us, it is unclear why France does not cluster with other regions. TVs coming from the Mekong River Basin and Egypt likely cluster together because they were developed along two large river systems, the Mekong and the Nile. We also note that other factors could lead to the clustering pattern that we found (Fig. 2). For example, altitude could be influential. As the number of macroinvertebrate families decreases with increasing altitude, the sites at higher elevation become more similar to each other than to those at lower altitudes (Prat et al. 2009, Villamarin et al. 2013). This could explain why Costa Rica and Chile were grouped together because the TVs of both countries do not differentiate among high and low altitude areas.

Several methods have been proposed to improve TVs. For example, there have been programs that based TVs entirely on environmental values (e.g., for the Mekong River Basin, MRC 2007; 2010; Resh et al. 2013). In the western US, Whittier and Van Sickle (2010) used multi-year environmental data and principal component analysis to construct a synthetic disturbance variable, which was used with averages and weights of taxa to calculate TVs. Another multivariate approach assumes a Gaussian response of populations to multiple environmental variables (Juggins 1997). Such approach thus calculates, on a taxon-by-taxon basis, both the degree of intolerance using the mode of population abundance (i.e., the environmental optima) and the range of tolerance using the standard deviation of the mode (Bonada et al. 2004). Other new techniques use generalized additive models (GAMs) to define taxa as tolerant, intermediately tolerant, or sensitive to phosphates, pH, suspended solids, or other specific stressors (Yuan 2004). Smith et al. (2007) defined the sensitivity of macroinvertebrates to nutrients and Utz et al. (2009) defined their sensitivity to land-use coverage at the catchment scale.

The globalization of TVs does have clear limitations. Families may be present, for example in tropical regions (Thorne and Williams 1997; Resh 2007) that do not occur in the temperate regions that typically develop these scores. To solve this problem for missing families, scoring systems in Ecuador–Peru used extensive literature studies to determine their TVs (Rios-Touma et al. 2013). However, globalization can provide synergy. An awareness of what methods are used among different regions to develop TVs and which methods are most effective is a clear advantage whereas decreased awareness can delay biomonitoring advances. For example, the US multimetric system and the UK multivariate-based systems developed largely independent of each other (Resh and Yamamoto 1994). It has taken over two decades for the advantages of each approach to be combined in current biomonitoring programs (Carter and Resh 2013). Because TVs are the foundation of many regional multi-metric approaches that are being used or under development, it is crucial for these scores to be as accurate as possible.

There are also local or national influences that can hinder the improvement of TVs in biological monitoring programs. For example, once a monitoring system is shown to be appropriate for a region or country, and is in place, changing it can be quite difficult even when its limitations are demonstrated and potential improvements proposed. Examples of where changes from long-used indices have been made include some countries in Europe (e.g., Italy and France), and these changes have been strongly influenced by the mandate of the European Framework Directive.

In summary, we collected and examined all the literature that we could find reporting family-level TVs. We subsequently standardized those values and applied non-parametric statistical methods (e.g., permutation tests and bootstrapping) to test 11 basic assumptions about the tolerance of benthic macroinvertebrates, and to examine the geographic and climatic relationships of their TVs among regions. Our comprehensive, global-scale study reveals that those basic TV assumptions are generally supported, and suggests the need for new, perhaps more robust methods of TV development and the reporting of how TVs are assigned.