Introduction

Current knowledge of diatom autecology is incomplete, gleaned from studies which are not specifically designed to determine the environmental requirements of common species. Frequently cited environmental preferences are often based on qualitative best professional judgment, which were derived from studies within a limited geographic area or region (e.g., Lange-Bertalot, 1979; van Dam et al., 1994) or based on synthesis of studies with different study objectives, sampling designs, and various spatial and temporal scales (e.g., Lowe, 1974; Beaver, 1981; van Dam et al., 1994). As a consequence, species environmental preference lists are often incomplete and inconsistent and autecological information about common species in many areas is lacking. More recently, numerical modeling approaches have been employed to quantitatively characterize the relationships between diatoms and environmental variables in streams. While gradient analyses are commonly employed to elicit overall assemblage patterns (e.g., Biggs, 1990; Leland, 1995; Pan et al., 1996), species optima for key environmental variables are most often determined by weighted-averaging (WA) techniques (e.g., Pan et al., 1996, Leland & Porter, 2000; Winter & Duthie, 2000; Potapova & Charles, 2003).

WA techniques provide a quantitative evaluation of diatom autecology and in many cases expand our knowledge of diatom species preferences. In streams, WA models have been used to develop species optima and tolerances for conductivity (e.g., Leland, 1995; Leland et al., 2001; Munn et al., 2002; Potapova & Charles, 2003), pH (Kovács et al., 2006), phosphorus (e.g., Pan et al., 1996; Winter & Duthie, 2000; Soininen & Niemelä, 2002; Schönfelder et al., 2002; Potapova et al., 2004; Ponader et al., 2007), nitrogen (e.g., Leland, 1995; Ponader et al., 2007), sulfate (Potapova & Charles, 2003), major cations (Potapova & Charles, 2003), and dissolved inorganic carbon species (Schönfelder et al., 2002; Potapova & Charles, 2003). In addition, WA optima have successfully been used to reconstruct environmental conditions in lakes, including pH (e.g., Birks et al., 1990; Dixit et al., 1999) and total phosphorus (Dixit et al., 1992; Hall & Smol, 1992; Bennion, 1994), and wetlands (e.g., Gaiser & Taylor, 1995; Bunting et al., 1997; Cooper, 1999).

However, WA approaches suffer from the simplicity and assumptions of these models (see summary in Imbrie & Webb, 1981; Birks et al., 1990). Of primary concern when dealing with the relationships between diatom species and environmental variables is that WA modeling assumes that the variable of interest is the sole variable responsible for determining the species distribution. The importance of other environmental variables is implicitly included in the calculation of the WA optima. However, WA can’t explicitly illustrate the interactions among environmental variables. Subsequently, environmental variables must be interpreted one at a time. In reality, most studies have displayed interactions among environmental predictors and stream algal assemblages. Interactive relationships between pH and nutrients were demonstrated in streams in the Illinois River Basin (Leland & Porter, 2000) and Finland rivers (Soininen & Niemalä, 2002). The development of WA models, for example, nutrient models, has often been less-successful when other environmental variables, such as pH, are important in structuring the diatom assemblage (e.g. Hall & Smol, 1992; Reavie et al., 1995; Pan & Stevenson, 1996). Because WA can’t explicitly address other environmental variables that influence patterns of species abundances, the applicability of models developed in one area to other areas is questionable. A second assumption of WA modeling that often does not hold true for stream diatom assemblages is the idea that species abundance forms a unimodal relationship with the environmental variable of interest. While Gaussian responses of diatom abundance to physiological environmental variables (e.g., pH and salinity) have been reported (ter Braak & van Dam, 1989; Juggins, 1992), this relationship is often not the case for resource variables such as total phosphorus (Potapova et al., 2004). Recently, advanced regression techniques, such as generalized linear models (GLM) and generalized additive models (GAM), which place fewer assumptions and constraints on species–environmental relationships have been used to model species–environmental relationships (see Guisan et al., 2002 for review). Potapova et al. (2004) used GLM to model relationships between diatom relative abundance and total phosphorus for northern Piedmont streams. While these models implicitly incorporate biotic and environmental interactions (Guisan & Zimmerman, 2000), they are additive and still do not explicitly model interactions or incorporate hierarchical structure.

Relationships between diatoms and environmental variables in streams are complex, often with several variables operating through hierarchical interactions. Strong relationships between algal biomass and nutrients are not often shown in streams (e.g., Leland, 1995). Diatom autecology would benefit from approaches that allow interactions between variables to be explicitly incorporated. Increasingly, regression tree approaches have been used to examine species–environmental relationships in plants and animals (e.g., Iverson & Prasad, 1998; De’ath & Fabricius, 2000; O’Conner & Wagner, 2004; Hershey et al., 2006). Regression trees (RT) and classification trees (CT) are useful for visually facilitating interpretation, revealing data structures, and displaying interactions (Clark & Pregibon, 1993; De’ath & Fabricius, 2000). RT and CT models performed better than analysis of variance and linear regression in predicting the abundance of coral taxa from environmental variables (De’ath & Fabricius, 2000) and the authors suggest using regression tree approaches to select more simple interaction terms for regression models. CT models had higher predictive power than both GLM and GAM for modeling vegetation species (Franklin, 1998; Vayssières et al., 2000 in De’ath, 2002). Another advantage of RT analysis is that it can also be used to identify thresholds or change points along environmental gradients, where species abundances change from one state to another (Qian et al., 2003). RT change point analysis was successfully used to identify total phosphorus concentrations at which the relative abundance of tolerant macroinvertebrate species shifted in wetlands (Qian et al., 2003) and the level of wetland impairment at which macrophytes, diatoms, and zooplankton abundances shifted (Lougheed et al., 2007). While Pan et al. (1999) used RT analysis to explore the hierarchical relationships between algal biomass and environmental variables in streams, RT approaches have not been used to explore individual diatom species distribution in streams or to identify change points along environmental gradients. RT and CT are non-parametric, and therefore, they are well-suited for diatom species relative abundance data that often contain many zero values.

The objective of this study was to use both weighted-average and regression tree approaches to explore the relationships between environmental variables and common diatom taxa in Mid-Atlantic Highlands (MAHA) streams. In addition, RT analysis was used to identify change points along environmental gradients (pH and total phosphorus), where taxa shift from low to high relative abundances. TP and pH were selected for change point analysis because these two variables have been shown to be most influential in controlling MAHA stream diatom assemblages (Pan et al., 1996). Regression tree models developed on relative abundance data were compared to classification models developed using both presence–absence and relative abundance categories to determine if different data transformations reveal different relationships between species and environmental variables. Change points identified by RT analysis were compared to optima derived from WA methods to determine if change point analysis can classify diatom taxa’s environmental preferences more precisely. This work will further our understanding of the autecology of common stream diatom taxa and of how environmental variables interact to determine their abundance.

Materials and methods

Study area

Streams were located in the Mid-Atlantic Highlands region of the U.S.A. (Fig. 1). The region has been delineated into several ecoregions (Omernik’s, 1987 Level III ecoregion classification), which include the Northern Appalachian Plateau and Uplands, the North Central Appalachian Mountains, the Blue Ridge Mountains, the Central Appalachian Ridges and Valleys, the Central Appalachian Mountains, and the Western Allegheny Plateau. Topography of the ecoregions included a mixture of mountains, plateaus, plains, and valleys. Data were collected as part of the U. S. Environmental Protection Agency’s Environmental Monitoring and Assessment Program (EMAP). Stream sites in the region were selected using a stratified random sampling design (Herlihy et al., 1998). The stream population in the region was defined on the basis of digitized versions of U.S. Geological Survey’s 1:100,000-scale topographic maps. In order to get a more equitable distribution of stream sizes, sample probabilities were set so that roughly equal numbers of first-, second-, and third-order streams would be sampled. A total of 256 unique sites, sampled for periphyton and water chemistry once from late April to early July in 1993 and 1994, were used for this analysis.

Fig. 1
figure 1

Map showing location of the Mid-Atlantic Highland region, U.S.A. and location of study sites. The solid lines are ecoregional boundaries

Sampling design

A study reach was established around either side of the selected sample sites with a total length equal to 40 times the average wetted channel width (minimum length 150 m). Each study reach was evenly divided into 10 equal length intervals and 11 cross-section transects were set up (including one transect at the start and end of each reach).

Field sampling

Periphyton samples were collected from erosional habitats at each of the 11 transects and combined into a composite sample. Transects with visible water movement were considered erosional habitat. At each transect, periphyton was collected from a 12-cm2 area of the stream bed using a 1.5-cm long piece of 3.9-cm diameter PVC pipe as a template. Periphyton was scraped from the upper surfaces of cobbles with a toothbrush and rinsed with stream water. Composite periphyton samples were then preserved with formalin. Stream water samples were taken near the middle of the stream in flowing water. Detailed field methodology can be found in Pan et al. (1996). Detailed information on the analytical procedures used for each of the analyses can be found in U. S. Environmental Protection Agency (1987).

Laboratory analyses

An aliquot of homogenized algal suspension was acid-cleaned and mounted in HYRAX® to identify and enumerate diatom species (Patrick and Reimer, 1966). A minimum of 500 diatom valves were counted at 1,000× magnification. Diatom taxonomy followed mainly Krammer & Lange-Bertalot (1986, 1988, 1991a, b) and Patrick & Reimer (1966, 1975).

Data analysis

Weighted-average models: Weighted-averaging regression and calibration were used to quantify relationships between individual diatom taxon’s relative abundance and environmental variables (Birks et al., 1990). The taxon’s optima was calculated as the mean of the measured environmental variables weighted by the abundance of this taxon in all sites. Tolerance was calculated as the weighted standard deviation of the taxon abundance in all sites. Tolerance values were corrected for bias by taking into account the effective number of occurrences (Hill’s N2) (Hill, 1973). Models were developed with tolerance down-weighting and using inverse de-shrinking methods. Cross-validation with leave-one-out jack-knifing was used to validate the models. WA regression and calibration and model validation were performed using C2 v. 1.4 (Juggins, 2003).

Regression tree change point analyses: RT analysis was used to explore the relationships between each of the 10 most commonly occurring benthic diatom taxa and in-stream environmental variables and to identify biological thresholds along environmental gradients. TP and pH change points were identified as the first or second splits of the RT for each diatom taxa. RT provides an alternative to linear and additive models for regression problems. RT can identify a set of important predictors (both numeric and categorical), automatically handle interactions between predictor variables, and illustrate hierarchical relationships among predictor variables (Venables and Ripley, 2002). Regression trees are a subset of top-down induction of decision trees that develop a hierarchical set of recursive, binary partitioning rules for classifying objects based on their values for several attributes of each object (Quinlan, 1986). In this study, the various water quality parameters (Table 1) are the “attributes” or predictor variables and the relative abundances of diatom taxa (Table 2) are the “objects.” The “leaves” of the decision tree represent the classes, and the “nodes” represent an attribute-based decision criteria with a “branch” for each possible outcome. Classes can be categorical (CT) or continuous (RT).

Table 1 Environmental variables (median (med), minimum (min), and maximum (max) values) used in predicting relative abundance of 10 most frequently occurring diatom taxa in Mid-Atlantic Highland streams. Frequency of occurrence of each variable in the regression trees (RT) and classification trees (CT) based on presence–absence data of all 10 species is also provided
Table 2 Summary data for 10 most frequently occurring diatom taxa (mean and maximum (max) relative abundance, and number sites) in Mid-Atlantic Highlands streams

To start, all objects are placed in one node (root node). The tree is developed by splitting the objects into two subsets or child nodes based on the decision rule that results in two groups with the greatest within group homogeneity or purity. Decision rules are then applied to each group separately until either all classes have been sorted or the tree has reach maximum complexity. The effectiveness of the decision rule at each split is evaluated as a function of the decrease in impurity achieved by dividing the sample according to that rule. In our study, impurity within each child node was measured as deviance by the Gini Index (Therneau & Atkinson, 1997). The deviation after each split is calculated as

$$ D_{{{\hbox{i\,child}}}} = D_{{{\text{i}},{\text{L}}}} + D_{{{\text{i}},{\text{R}}}} $$

where D i,L is the deviance in the left child node, D i,R is the deviance in the right child node, and D i child is the total deviance in the split (Brieman et al., 1984). For any given node, the decision rule that maximizes the reduction in deviance (∆D = D i  − D i child) is selected. For regression trees, this reduction in deviance is equivalent to maximizing the between group sum of squares (Therneau & Atkinson, 1997).

For CT, the end point of each tree is characterized by the distribution of objects in each of the classes along with a hierarchical set of decision rules (predictor variables) that define it. For RT, the end point of each tree is the predicted mean of the response variable (e.g., relative abundance of Achnanthidium minutissimum), number of objects in each group, and the hierarchical set of decision rules (response variables) that define it. Thus, the decision tree can be used to classify other sets of data based on these rules. Only a subset of attributes may be encountered on a particular path from root to leaf and attributes may be encountered more than once. For each split, alternative splitting variables are presented and evaluated using an improvement index. The improvement index is calculated as the number of sites in the branch times the impurity index. It is the relative size of the improvement index that gives an indication of the utility of a variable in splitting the data rather than its absolute value (Therneau & Atkinson, 1997).

A cross-validation procedure was used to determine when to stop partitioning the data. Cross-validation occurs by dividing the original data into several, mutually exclusive datasets, and producing trees of different sizes (numbers of splits). The dataset not used to build the series of trees is then used to evaluate the predictions of trees. The final tree size is selected by examining the plots of relative predictive error versus number of splits. For this study, the 1-SE stopping rule was used (Therneau & Atkinson, 1997). For this rule, any tree that is within one standard error of the tree with the lowest relative predictive error is considered as being equivalent to this tree and the simplest model (fewest number of splits) among those within 1-SE is selected. If splitting the relative abundance data based on measured environmental variables did not result in reduced predictive error, no tree was developed for that taxon.

Diatom abundances were measured as continuous variables, allowing regression tree models to be applied directly. For the regression models, species data were double-square-root transformed to stabilize variance in the species data. Due to the species-rich nature of diatom data and fixed-count methodologies, the relative abundance of a taxon is often zero at many sites, resulting in left-skewed data. As an attempt to deal with left-skewed data, we created trees for data transformed into presence–absence (a special case of classification with 2 groups). RT and CT analyses were performed using the rpart package for R (Therneau & Atkinson, 1997; Atkinson & Therneau, 2000).

Results

Diatom species and environmental characteristics

Water chemistry variables are presented in Table 1. Stream water pH ranged between 3.4 and 8.4. Total phosphorus ranged between 1 and 108 μg l−1. A total of 619 diatom taxa were identified from the 256 stream sites. Average taxa richness at a site was 28 (range: 6–68). The relative abundances of the ten most common taxa (based on frequency) are presented in Table 2. Achnanthidium minutissimum (Kütz.) Czarnecki was the most common taxa, with a mean relative abundance of 25% and was present at 239 sites. This taxon dominated the diatom assemblage at most sites, having a relative abundance greater than 25% at 110 sites and a relative abundance greater than 10% at 179 sites. A. biasolettianum (Grun.) Round & Bukht., a small, stalked taxon, similar in morphology to A. minutissimum, was also very common, having a relative abundance greater than 10% at 68 sites. Other common taxa were much less dominant in a sample. Of the remaining 10 most common taxa, the number of sites in which their relative abundance was greater than 10% ranged from 2 to 35.

Weighted-average models

WA pH models had relatively high predictive power (WA r 2 = 0.70, WA jack-knifed r 2 = 0.64) and low root-mean squared error of prediction (WA RMSE = 0.44, WA jack-knifed RMSEP = 0.49). Performance of WA TP models was lower (WA r 2 = 0.33, WA jack-knifed r 2 = 0.30, RMSE WA = 2.1 μg l−1, RMSEP = 2.3 μg l−1). The pH optima for the ten most commonly occurring taxa were all approximately neutral to slightly alkaline, ranging from 7.0 (Gomphonema parvulum (Kütz.) Kütz.) to 7.5 (Nitzschia dissipata (Kütz.) Grun.; Table 3; Fig. 2). Species TP optima ranged from oligotrophic to mesotrophic based on the suggested trophic boundaries in streams presented in Dodds et al. (1998; Table 4; Fig. 3). A. minutissimum had the lowest TP optima (11 μg l−1), while Planothidium lanceolatum (Bréb.) Round & Bukht. had the highest TP optima (30 μg l−1).

Table 3 Regression tree change points (CP) predicting higher relative abundance for pH and acid-neutralizing capacity (ANC), weighted-average pH optima, and van Dam et al. (1994) pH classification for 10 most frequently occurring diatom taxa in Mid-Atlantic Highlands streams
Fig. 2
figure 2

Relationship between pH and relative abundance for (A) Achnanthidium biasolettianum, (B) Achnanthidium minutissimum, (C) Fragilaria capucina, (D) Planothidium lanceolatum, and (E) Reimeria sinuta at all sites. Solid lines indicate regression tree change points for pH. Dashed lines indicate weighted-average pH optima. Only taxa where pH was an important predictor in the RT and/or CT were included

Table 4 Regression tree change points (CP) predicting higher relative abundance for total phosphorus (TP, μg l−1), weighted-average (WA) TP optima, published WA optima (with study range TP in parenthesis), and van Dam et al. (1994) trophic classification for 10 most frequently occurring diatom taxa in Mid-Atlantic Highlands streams
Fig. 3
figure 3

Relationship between total phosphorus (TP) and relative abundance for (A) Achnanthidium minutissimum, (B) Gomphonema parvulum, and (C) Planothidium lanceolatum at all sites. Solid lines indicate regression tree change points for TP. Dashed lines indicate weighted-average TP optima. Solid circles indicate sites used in change point calculation. Open squares indicate sites not used in change point calculation because they fell out at the first RT break. Only taxa where TP was an important predictor in the RT and/or CT were included

Regression tree change point analysis

Regression trees were developed for nine of the ten most commonly occurring taxa, with predictive power ranging between 0.18 and 0.40 (Fig. 4). Based on cross-validation, no RT could be developed for Synedra ulna (Nitz.) Ehr. We present detailed results of the RT for A. minutissimum relative abundance data as an illustrative example. The final RT had two splits (Fig. 4b, r 2 = 0.34). Further splits did not increase predictive r 2 or reduce relative predictive error. The explanatory variables used were pH and TP. The first split was on pH, with low pH predicting low A. minutissimum relative abundances. Alternatively, the first split could have been on ANC, as the improvement index for this variable was only slightly lower than that of pH (Table 5). Twenty-six sites had a pH < 6.1 and a low (0%–28%) predicted relative abundance of A. minutissimum (left node). The cross-validation indicates that this node was relatively homogeneous and not further divided. For sites with pH > 6.1 (right node), the second split was on TP. Higher relative abundance was predicted for sites with lower TP concentrations. For the 177 sites with TP < 28 μg l−1, highest relative abundance (0%–83%) was predicted. These nodes were not split further.

Fig. 4
figure 4

Regression tree structure for relative abundance of (a) Achnanthidium biasolettianum, (b) Achnanthidium minutissimum, (c) Encyonema minutum, (d) Gomphonema parvulum, (e) Fragilaria capucina, (f) Nitzschia dissipata, (g) Planothidium lanceolatum, (h) Nitzschia palea, and (i) Reimeria sinuta. Terminal nodes give maximum relative abundance for that branch and number of sites in that group (in parenthesis). The values beside each split represent the critical threshold of given variables, which provide the basis for that split. Only taxa where relative predictive power of the RT was >0 were included

Table 5 Predictor variables for each split and their improvement index (in parenthesis) and alternative split variables and their improvement index (in parenthesis) for regression tree (RT) models

Overall, the variability in relative abundance of common diatom species within the MAHA region was determined primarily by pH/buffering capacity and secondarily by nutrient concentration, particularly TP (Table 5). The first split was on pH for the RT of A. biasolettianum, A. minutissimum, Fragilaria capucina Desm., and Reimeria sinuata (Greg.) Koc. & Stoerm. and on acid-neutralizing capacity (ANC) for the RT models of Encyonema minutum (Hilse ex Rab.) D. Mann, G. parvulum, and N. dissipata (Fig. 4). In addition, pH and/or ANC were important predictors in the RT of all species, except N. palea (Kütz.) W. Sm. For all species where pH was an important predictor, lower pH predicted lower abundances. TP concentration was an important predictor of relative abundance for several species, including A. minutissimum, P. lanceolatum, and G. parvulum (Table 5). TP was an alternative predictor for the trees of both F. capucina and N. palea, with improvement indices very similar to the primary variable chosen for the split (Table 5). Eleven of the nineteen environmental variables were not selected in any of the cross-validated regression trees. These included measures of dissolved cations, total suspended solids, turbidity, total nitrogen, ammonium, and both dissolved organic and inorganic carbon. For most RT, two or three splits were able to predict taxon relative abundance (based on cross-validation r 2 and relative predictive error) as well as trees with more splits. Thus, for most species, pH and TP values are enough to predict relative abundances. High predictive power of RT using all sites was found for A. minutissimum (r 2 = 0.34) and P. lanceolatum (r 2 = 0.40; Fig. 4), two taxa with wide ranges in relative abundances. The RT for species with narrower ranges in relative abundances, including N. palea and G. parvulum, were poor or no successful tree was built (e.g., S. ulna).

Classification trees based on presence–absence data with high predictive power and low misclassification rates (range: 22%–30%) were developed for N. dissipata, N. palea, P. lanceolatum, and R. sinuata (Fig. 5). CT showed that the presence of common diatom taxa was determined by pH/buffering capacity and nutrient concentration. For N. dissipata and R. sinuata, lower pH/lower buffering capacity predicted the absence of these species. Similar to the RT for P. lanceolatum, its CT predicted the presence of this taxa at higher TP concentrations. While chloride was the primary predictor of the first split of the CT for N. palea, its improvement index was not much higher than that of TP (Improvement Index: Cl = 24, TP = 17; Table 6), suggesting that TP could also have been selected for the first split. Higher concentrations of TP also predicted the presence of N. palea. No classification tree based on presence–absence data was developed for A. minutissimum, as it was only absent from 17 sites.

Fig. 5
figure 5

Classification tree structure based on presence–absence for (a) Nitzschia dissipata, (b) Nitzschia palea, (c) Reimeria sinuta, and (d) Planothidium lanceolatum. Terminal nodes give number of sites in each class (presence/absence). Predicted class for each branch shown in bold. The values beside each split represent the critical threshold of given variables, which provide the basis for that split. MR = misclassification rate. Only taxa where relative predictive power of the CT was >0 were included

Table 6 Predictor variables for each split and their improvement index (in parenthesis) and alternative split variables and their improvement index (in parenthesis) for classification tree (CT) models. The improvement index is calculated as the number of sites in the branch times the change in impurity index. The number of splits to retain for each species was based on cross-validation results and 1—SE stopping rule

Change points identified by RT occurred under a range of pH conditions for the common diatom taxa. For A. biasolettianum, A. minutissimum, and F. capucina, change points from higher to lower abundance were at slightly acidic pH (5.7–6.7), while for R. sinuta, the change point was at neutral (7.1) pH (Fig. 4). TP change points occurred under a range of TP conditions (Fig. 4). For P. lanceolatum, TP change point was 18 μg l−1, with shifts from high to low abundance occurring at this concentration. For A. minutissimum, TP change points only occurred for pH > 6.1. The TP change point was 28 μg l−1 for sites with pH > 6.1, with shifts from high relative abundance at sites below this concentration to medium relative abundance at sites with TP greater than 28 μg l−1. For G. parvulum, the TP change point only occurred at sites with low ANC. For sites with ANC < 955 μeq/l, TP > 6 μg l−1 predicted high relative abundance. Change points for pH were not identified for E. minutum, G. parvulum, and N. dissipata. Change points for TP were not identified for A. biasolettianum, E. minutum, F. capucina, or N. dissipata.

Discussion

Regression tree and weighted-averaging approaches provided different, yet complementary, information on the complex relationships between common stream diatoms and environmental variables. While WA provides information on the optimal conditions for a taxon for a single environmental variable, RT both highlights the interactive effects of multiple predictors and can identify breakpoints at which taxa’s abundance change from one state to another. In our study, change points identified by regression trees highlighted the interaction between stream acidity (pH and ANC) and TP in shaping the relative abundance of common diatoms in MAHA streams. Stream water pH and ANC were important determinants of diatom species composition in another study of MAHA streams (Pan et al., 1996). Acid mine drainage affects approximately 4% of the streams in the MAHA region (US EPA, 2000) and has been shown to have profound effects on stream periphyton communities (Verb & Vis, 2000; Brake et al., 2004). For most common taxa, RT illustrated that TP is only an important variable under circumneutral to alkaline conditions. The performance of our WA models also reflected the importance of pH in these streams. While the WA pH model performed well (r 2 = 0.70), performance of the WA TP model was poor (r 2 = 0.33). Nutrients and pH often interact to structure the stream diatom assemblages (e.g., Leland & Porter, 2000; Soininen & Niemalä, 2002). WA nutrient optima models have been less-successful when pH gradients are important (e.g., Hall & Smol, 1992; Reavie et al., 1995; Pan & Stevenson, 1996). Environmental change points identified by RT take into account hierarchical relationships between environmental variables, and therefore, might provide more accurate information on where diatom species abundances shift along environmental gradients.

The autecological information gained through our study augments previous work on diatom species–environmental relationships in streams. The WA pH optima for the 10 most commonly occurring taxa were circumneutral to alkaline (7.0–7.5), agreeing with WA optima developed for Swedish and Hungarian streams (Kovács et al., 2006). WA pH optima of A. biasolettianum,G. parvulum, N. dissipata, P. lanceolatum, and S. ulna agreed with van Dam et al.’s (1994) autecological classification (Table 3). RT change point analysis predicted higher relative abundance at higher pH for all taxa, where pH was an important predictor variable. The change point from low to high abundance identified for all species were lower than their WA optima (Fig. 2). While WA provides an idea of the environmental conditions where abundance is maximized, change point analysis provides information as to where along the environmental gradient the onset of major shifts in abundance occur.

RT change point analysis appeared to characterize certain taxa’s relationship with TP better than WA approaches. Performance of the WA TP model was weak, indicating that species optima derived from this method might not be accurate. The WA TP optima developed in our study tend to be lower than published optima (Table 4). This might potentially be due to the lower maximum TP concentration in streams in our study compared with many other studies (Table 4). An assumption of WA optima calculations is that the species distribution is unimodal with respect to the environmental variable of interest and the optima will be calculated as the weighted midpoint of this distribution. Consequently, if the maximum TP concentration in a study is high, calculated TP optima will tend to be higher. For both P. lanceolatum and G. parvulum, change points predicting a shift to high relative abundance were at lower TP concentrations than optima generated by WA in this study. While both of these taxa are considered eutraphentic by van Dam et al. (1994), indicating tolerance to elevated TP, our results suggest that increases in abundance may occur at relatively low TP concentrations (6 μg l−1 and 18 μg l−1 for G. parvulum and P. lanceolatum, respectively). In contrast, for A. minutissimum, the change point indicating a shift to lower relative abundance occurred at a higher TP concentration than our calculated WA optima. This species is characterized as oligotrophic to eutraphentic (van Dam et al., 1994), indicating that it can tolerate a wide range of nutrient conditions. Because change point analysis identifies where species shift relative abundance states, it may provide more refined environmental preferences than WA optima, which is solely based on a mathematical average. Based on our change point analysis, A. minutissimum may be considered to tolerate oligotrophic to mesotrophic conditions.

While WA optima approaches are common in the diatom autecological literature, the utility of regression tree approaches in exploring species–environmental preferences has yet to be demonstrated. In this study, RT and CT with high predictive power and low misclassification rates were developed for several of the common diatom species, including A. minutissimum, P. lanceolatum, R. sinuta, N. dissipata, and N. palea. We feel that the strength of RT and CT analysis may depend on the distribution of individual species throughout the study region and how well their abundance is characterized at any given site. In our study, RT had the highest predictive power for species that were dominant within samples and had a wide range of relative abundances throughout the study sites (e.g., A. minutissimum). We feel that RT works well in these instances because the abundance of these species has been well-characterized within the sample and can therefore be accurately modeled. Transformation of relative abundance data into categories and subsequent CT analysis might provide an alternative to RT for less common species whose relative abundance might not be well-characterized by the fixed count methodologies commonly employed in diatom studies. Our CT analyses had misclassification rates comparable to CT of fish abundance (misclassification rate 22–25%; Hershey et al., 2006) and tree species (misclassification rate 20–71%; Iverson & Prasad, 1998). N. dissipata, N. palea, and R. sinuata have low ranges in relative abundance and were found at less than half of the sites in the study. While presence–absence abundance categories seem to provide a successful alternative when developing trees for less-well represented and characterized diatom species, there are a few issues of concern with these types of data transformations. For N. dissipata, P. lanceolatum, and R. sinuta, CT predicted presence better than absence. For example, the CT for N. dissipata predicted presence correctly 90% of the time, while only predicted the absence class correctly 60% of the time. This finding contrasts CT for fish species abundance, where misclassification of fish absence was only slightly higher (25%) than misclassification of species presence (22%; Hershey et al., 2006). For diatom relative abundances generated through fixed-count methods, zero relative abundance does not necessarily mean that a species is not present at a site, but rather that it was not encountered during the fixed count, and thus an absence in the count does not necessarily equate to being truly absent from the site For larger species, such as fish, where counts are usually generated by trapping and identifying most of the individuals within the stream reach, presence–absence data probably reflects true presence–absence.

In conclusion, RT analysis illustrated the importance of the interaction between pH and TP in shaping the abundance of common diatom species in the MAHA region and complemented traditional weighted-averaging approaches to modeling species–environmental relationships. Change points identified by RT analysis provided more refined information on where relative abundance of common diatom species shifted along pH and TP gradients. Several authors have put forth hierarchical models of variables structuring stream algal distribution and abundance (e.g., Biggs, 1996; Stevenson, 1997). While our study, and most other diatom autecological studies, focused only on water quality variables, we feel that regression tree approaches have the potential to increase our understanding of how interactions among environmental variables shape stream diatom assemblages. In addition, we feel that tree approaches could be used in conjunction with other modeling approaches (e.g., Generalized additive models) to provide insight into interaction terms and also as a framework for developing and interpreting WA optima.