Introduction

The distribution of tree species in remote areas is currently not well known. Most areas in mountainous regions and remote islands for instance still suffer from a lack of detailed plant species distribution mapping. This is also true for Alaska. To date, literature on Alaska’s tree species is only available as atlases with coarse range maps (Hultén 1968; Viereck and Little 2007). However, reliable information even about simple presence or absence within this broad range of a single species cannot be sufficiently detected from such maps. Furthermore, species distribution information is given in eco-classifications (Viereck et al. 1992) and in forestry related articles (e.g. Farr and Harris 1979; LaBau and Alden 2000). Several maps show ecosystem or vegetation types of Alaska (Fleming 1997; Gallant et al. 1995), however, they are plant communities and do not show single species distributions as such. Articles concerned with single tree species either report on a certain region within Alaska (Hennon and Trummer 2001; Murray 1980) or focus on species genetics (Viereck and Foote 1970). An exhaustive report on single species’ ranges could not be found. Alaska is the home of over 15 tree species, and such species are of great interest for the assessment of wildlife habitat, vegetation type classification, and adaptive resource management. As large areas of Alaska are very difficult to access, there is a need for advanced approaches to mapping tree species. Here we investigate a predictive modeling approach that uses publicly available tools, data and environmental variables to predict tree species distribution promising a high accuracy.

Need for a species distribution model (SDM) of trees in Alaska

Plant–climate-relationships and the importance of various other environmental factors for the geographical distribution of plant species have been recognized early (Whittaker 1967) and are widely used to explain biogeographical patterns (e.g. Ellenberg 1988; Walter 1985). SDMs use these concepts to determine the ecological niche of a species based on several environmental variables. The ecological niche can be projected into geographical space, resulting in a predictive map of the species’ distribution (Franklin 1995; Tsoar et al. 2007). SDMs are widely applied for the study of plant species distribution (e.g. Engler et al. 2004; Franklin 1998; Guisan et al. 1998). They are a crucial tool for obtaining better maps, which are needed to facilitate further research on the species themselves (Parviainen et al. 2008), for developing informed hypotheses on wildlife and habitat (Guisan and Zimmermann 2000), for classifying plant communities and for assessing their change in composition or distribution (Ferrier and Guisan 2006; Zimmermann and Kienast 1999), and also for improving ecological theory and knowledge (Dunning et al. 1995). Furthermore, maps derived through predictive modeling are used to improve floristic and faunistic atlases (Araújo et al. 2005; Prasad et al. 2007-ongoing), to assess the impact of land-use change (Dunning et al. 1995) or to help decide on conservation priorities (Margules and Austin 1994). They are an inherent tool in modern Adaptive Management (Huettmann 2007; Walters 1986). Developing SDMs using publicly available data is easier, faster, and less expensive than mapping in the field. A more detailed literature overview on the use of SDMs and what it entails can also be found in Guisan and Thuiller (2005).

Concept of open access in predictive modeling

Open access (OA) offers an improved principle of sharing high-quality scientific information among scientists as well as with the global public. Also, it makes scientific methods transparent and repeatable to everybody, which should add to its credibility and increased trustworthiness. This concept becomes available due to recent advances in computing, databases and online data delivery. OA is a recent movement that is virtually promoted globally by ICSU, OECD, CODATA, NSF, the European Union as well as by global policies such as the Rio Convention, and megascience programs such as the IPY. Latest publicly funded science in the US and Canada is based on such OA principles and becomes a requirement for publication and funding (National Research Council 2003; Interagency Working Group on Digital Data 2009). This paper provides a further example of applied OA principles. A list of free available datasets and tools used in this study can be found in Table 1.

Table 1 Open access datasets and tools used in this study

White spruce

Our modeled species, white spruce (Picea glauca [Moench] Voss), is one of the most common tree species in Alaska, and is of ecological and commercial importance, occupying app. 25% (121,000 km²) of Alaska’s boreal forest (Labau and van Hees 1990). It has good overall data available, but suffers critical data gaps throughout its Alaska and wilderness distribution. White spruce occurs both in floodplains and uplands (Viereck et al. 1986; Walker et al. 1986). It is the dominant treeline species in the two main mountain ranges of Alaska, and forms large stands in the highlands of Interior Alaska (Juday et al. 1999), occurring from 100 m to treeline (300–1,600 m) (Viereck and Little 2007). White spruce appears to grow best on south-facing slopes and well-drained, sandy soils along the edges of lakes and rivers, but not in areas with continuous permafrost (Viereck et al. 1992). White spruce is known to be an important habitat for moose (MacCracken and Viereck 1990; Risenhoover 1989), red squirrel (Brink and Dean 1966; Smith 1968), marten (Buskirk 1984; Slough 1989), and hare (Sinclair et al. 1988; Wolff 1978). White spruce also plays an important role in local timber production and fuel supply (Holsten et al. 1991; Viereck and Little 2007). However, maps on single tree species distribution can hardly be found, and even the FIA database does not contain information beyond south-east and south-central Alaska.

Methods

Datasets

Our training dataset consisted of 108 confirmed white spruce presence datapoints, available for this species as biogeoreferenced records (online at Arctos Multi-Institution, Multi-Collection Museum Database, University of Alaska Museum Herbarium). The points represent samples dating from 1900 to 2000 and 85% have a location uncertainty of c. 3,615 m (horizontal and vertical datums unknown). Thus, we used a buffer of 3,615 m radius for each point (equivalent pixel size: 6,407 m × 6,407 m). As the Arctos Database does not include absence data, we created 600 pseudo-absence points (Engler et al. 2004; Tsoar et al. 2007) using the publicly available Hawth’s tools random sample tool in ArcGIS 9.2 (see Table 1). More specifically, half the points were randomly distributed all over Alaska, the other half only within non-forest vegetation types according to a digital vegetation cover map (Fleming 1997, online available from the AGDC), in order to obtain more absence points in areas where absence is more likely. In a multi-hypothesis fashion (sensu Burnham and Anderson 1998), we tested 24 environmental variables and latitude and longitude as potential predictors for the distribution of white spruce (see Table 2 for complete list of predictors). For climate data we used the datasets on Alaska average monthly precipitation/monthly mean temperature, 1961–1990, by C. Daly (2 km × 2 km raster data, provided by PRISM). Elevation, aspect, and slope came from AGDC as 1 km × 1 km raster, aspect was used as a continuous variable, which ensures more transparency and accuracy than the traditional use of categories. We also used permafrost, soil, and surface geology (polygon data) from AGDC. In order to extrapolate the model results evenly to a large area, a regular grid was created in Hawths Tools for the entire state of Alaska, carrying an even point spacing of 4 km.

Table 2 Environmental variables used for developing the models

Model

A model approach described in Fig. 1 was used to predict the distribution of white spruce, but is intended to represent a role-model for predicting any tree species distribution for remote areas anywhere in the world. The datasets were overlayed in ArcGIS 9.2 (step 1) and transformed to a consistent projection (Alaska Albers, geographic datum: NAD-83 Alaska). The values of each layer were extracted to the buffered presence points (mean values from raster datasets, prevailing class from polygon datasets) as well as to the 600 pseudo-absence points, resulting in a table with presence/absence as a response and climatic and bioclimatic variables as predictors. The environmental parameter values were also extracted to the regular grid (step 2).

Fig. 1
figure 1

Flowchart illustrating the concept of model development used in this study and proposed as a role model for future models

For modeling the associations between the tree species and its environmental predictors, we favoured non-parsimonious (with ‘parsimonious’ referring to approaches based on few preselected determinant values) and non-linear modeling. Hence, we applied machine learning concepts, such as classification trees (Breiman et al. 1984; Breiman 2001) to obtain best possible predictions. These methods account for complex ecological and environmental interactions between variables (Guisan et al. 2006; Lawler et al. 2006), and even when using noisy data (Craig and Huettmann 2008) they show high performance with fine and coarse resolution datasets (Guisan et al. 2007). We used the boosting classification and regression tree software TreeNet (SalfordSystems, San Diego, CA, USA) to analyze the data and to build the model (Hastie et al. 2001; Friedman et al. 2000).

After initial testing, and for obtaining best results, we used TreeNet with the following settings: three nodes per tree, minimum number of six observations per terminal node, 100-fold cross validation (to ensure high model stability), and the option ‘balanced’ for equal weight of number of presence and absence points (Maggini et al. 2006). Here we followed the concept of using informed default settings, as promoted in Blackbox modeling for ease and convenience. It is known that this approach with its settings in most cases helps to achieve good modeling results in a fast and reliable manner (Craig and Huettmann 2008). Obtaining good but time-critical results is usually crucial for management-related applications as provided here. First, we ran nine basic models (Table 3, models 1–9) to compare effects of temperature/precipitation with those of soil characteristics. We then compared ROC values and percent of correctly predicted presences (misclassification threshold 0.5, hereafter referred to as %corr).

Table 3 ROC values and %correctly predicted presences (misclassification threshold 0.5)

From these exploratory runs, the models with ROC > 0.75 and %corr > 0.65 (models 3, 5, 6, 7) were kept and slightly modified by dropping climate variables that consistently fell into the lower half of the variable ranking lists of most of the models (models 3, 5, 7), or were considered not important by TreeNet variable ranking (models 8, 9). Thus, we derived nine more models to further improve ROC values and %corr, and finally, the four best-performing models were chosen (models 6, 12, 14, 15, step 3). These four models were applied within TreeNet in order to predict presence/absence to the regular grid (step 4). The predicted value of relative occurrence for each gridpoint was mapped and points were interpolated using the IDW (Inverse Distance Weighting) tool. Thus, four maps with the statewide index of relative occurrence of white spruce were obtained (step 5). The concept followed principles described by Huettmann and Linke (2002), batch files of the TreeNet runs are available on request.

Accuracy assessment

Assessing the accuracy of a spatially explicit model means assessing prediction errors and spatial uncertainties. For a first comparison of models and their accuracies we interpreted ROC curves, derived from cost matrices (Bradley 1997; Fielding and Bell 1997). For a more detailed assessment, predicted map values (predicted index of relative occurrence) were compared to evaluation data points taken from four independent datasets with ‘presence only’ data (Fig. 1, step 6). As a measure of model performance, we found the Boyce index to be most suitable (Boyce et al. 2002), as it is independent of the prediction’s threshold between presence and absence and it is based on evaluation data using presence-only (Hirzel et al. 2006). The Boyce index F i is an area adjusted frequency index. Lower habitat suitability classes should have F i values <1 (less evaluation points than with a random distribution) and high habitat suitability classes should have F i values >1 (more evaluation points than with a random distribution). F i was then plotted against the mean index of relative occurrence for each class, resulting in a curve that is monotonically increasing for a model with high accuracy, and monotonically decreasing for models with low accuracy. Performing a (non-parametric) Spearman’s rank correlation of the Boyce indices of all classes versus the mean index of relative occurrence of each class provides an estimate about how stable the prediction of the specific class is, compared to the overall prediction accuracy of the model (step 7), while the overall prediction accuracy of the model can be assessed by the Spearman’s rank correlation coefficient r s.

Data management

For this study, we used OA tools and data as this has many advantages and implications for studies and project goals like ours. Most of these data proved to be of sufficient and reliable quality and carried high-quality metadata (all climate data and elevation dataset). Only some data came with very basic descriptions (species data, soil, permafrost, surface geology) and details had to be requested by email. We operated these data on a PC within Excel and ArcGIS 9.2 and with the help of additional free tools (Hawth’s tools). GIS data are presented in grids and shapefiles. Metadata were created within ArcCatalog and the freely available Metavist XML editor, and made globally available at the National Biological Information Infrastructure website (NBII, http://mercdev3.ornl.gov/nbii/). All data formats we used are supported by OpenGIS and OpenOffice.

Results

Model ranking with TreeNet

Model ranking was done by comparing ROC values and %correctly predicted presences (Table 3). Models 1–9 (exploratory runs) obtained relatively low ROC and %corr values (<0.8 and <0.7, respectively). Two exceptions were models 6 and 7, with model 6 (only elevation, aspect, and slope) reaching a slightly higher ROC value (0.806) and the highest value for %corr compared to all other 17 models (84.47). Model 7 (same as model 6 plus all temperature and precipitation variables) reached an even higher ROC value (0.869), but a lower %corr value (78.64). Model 10 showed only slightly improved values. The three models 11, 12, and 13 ranked highest according to the ROC values (all 0.875), with model 12 having the highest value for %corr compared to all other 17 models (79.61). Models 14 (improvement of model 11 by adding lat and long) and model 15 (improvement of model 11 by adding permafrost) scored with relatively high ROC (both 0.871) and %corr values (77.67 and 75.73, respectively). Adding soil (model 16), surfgeol (model 17), or permafrost + surfgeol (model 18) did not improve ROC or %corr values. However, it is worthwhile to point out that the best predictions were not achieved by the most parsimonious model, i.e. the one with the fewest predictors, giving further support for non-parsimonious non-linear model algorithms that can deal with highly complex data. This approach allowed us to identify interactions among variables and to determine systematically the variable combinations with the highest impact.

We chose (1) the model with the highest ROC value (model 12), (2) the one with the highest %corr value (model 6), (3) the best model including lat + long (model 14), and (4) the best one including at least one of the variables permafrost, soil or surface geology (model 15) as models for further consideration. For comparison of the ROC curves for the four most relevant models (6, 12, 14, and 15) see Supplemental Fig. 5a–d.

Variable importance

As an example, the variable importance, as obtained from TreeNet for model 12 (best-performing model) is shown in Table 4. Values represent absolute and relative importance, and thus aid in ranking the variables. The variable importance ranking shows a high contribution of aspect as a predictor variable, as well as total precipitation in August, followed by mean temperature in April. Precipitation sum of April, May and total precipitation sum between May and September are of minor importance, as are mean temperatures in May, September, and June, as well as elevation. The temperature differences between the warmest and coldest months are least important. The partial contribution of the variable values to the model can be seen in the single variable plots in Fig. 2a–d. The occurrence of white spruce appears to be favored by warm aspects of 150°–250° (SSE to SWW), whereas cooler aspects of 300°–50° (NWW to NE) appear to inhibit its presence (Fig. 2a). White spruce is also more likely to occur in areas with a total sum of precipitation in August below 75 mm and April mean temperatures above −4°C (Fig. 2b, c). The results from evaluation of the influence of elevation on the distribution of white spruce in Alaska are less clear. While an elevation below 1,000 m has a positive influence on the presence of white spruce (Fig. 2d), an elevation above 1,000 m has a negative influence.

Table 4 Variable importance (ranking) for model 12 according to TreeNet
Fig. 2
figure 2

Single variable plots as obtained by model 12 (best model, as selected by highest accuracy metric): thresholds of the most important variables used for predicting the habitat of white spruce (relative index; positive partial dependence indicates preference, negative partial dependence indicates avoidance); a influence of aspect (degrees); b influence of August precipitation (mm); c influence of mean April temperature (°C × 10); d influence of elevation (m)

Mapping the predicted distribution

Maps showing the predicted distribution of white spruce in Alaska, as obtained by the four chosen models (6, 12, 14, and 15), showed broad scale consistencies (for comparison see Supplemental Fig. 6). Visual comparison identified constantly low predicted values of relative occurrence for coastal regions, especially in the north and south, and higher values for Interior Alaska. Only some noise occured on the small scale in the midlatitudes of Alaska, resulting in a ‘salt-pepper’ like pattern, which might indicate true mid-range values overall in the wider region. However, the north-east part of the Interior revealed consistently high values of relative occurrence for all models. The map in Fig. 3 shows the index of relative occurrence of white spruce as predicted by the best-performing model (model 12). Fully in agreement with IPY Metadata & Data Policy, all maps as well as the according metadata are made available, e.g. the IPY data repository of the Global Change Master Directory (http://gcmd.nasa.gov).

Fig. 3
figure 3

Mapped distribution of White spruce in Alaska as predicted by model 12 (best model, as selected by highest accuracy metric)

Model performance

The Boyce index (Fig. 4a) revealed similar patterns for models 12, 14, and 6 for low and middle classes, with all F1–6 < 2, but differing patterns for higher classes (F7–10), with model 6 entirely omitting classes 9 and 10. In contrast to the broad pattern, only model 15 showed a more fluctuating curve for the Boyce indices. The spearman’s rank correlation for models 12, 14, and 6 showed that predictions for classes of high and low relative occurrence were more stable than were classes of mid-range relative occurrence (see trendline, Fig. 4b). This indicates that the models’ ability to predict low and high relative occurrence was better than the ability to precisely predict relative occurrence of mid-range, overlapping gray zones, on the pixel scale. However, models 12, 14, and 6 reached an r s of over 0.9 (0.952, 0.905, and 0.907, respectively), whereas model 15 showed a large instability in predictive ability for the mid-range classes, resulting in an r s of 0.649. Thus, the best correlation, i.e. the most consistent prediction was achieved by model 12, having the least departures from the trendline.

Fig. 4
figure 4

Model prediction accuracy. Assessment of the four best-performing models with Boyce index and Spearman’s rank correlation; a Boyce index (area adjusted frequency index) F i for each class of each model; b Spearman’s rank correlation between rank of F i value and class of relative occurrence i for each class of each model (classes along the trendline indicate high model reliability; classes deviating from the trendline indicate reduced model reliability); Spearman’s rank correlation coefficient r s = 0.907 (model 6), 0.952 (model 12), 0.905 (model 14), and 0.649 (model 15), respectively

Discussion

This study quantitatively models, predicts and maps for the first time the distribution of a tree species in a large wilderness area, with a high accuracy, using free online tools and data. As we focus on high prediction accuracy, we will discuss our methods and results in the context of which factors might potentially influence accuracy.

Freely available species data (confirmed presence/absence)

Museum data generally prove to be very useful for SDM where other data on species locations are sparse (Stockwell and Peterson 2002, Graham et al. 2004). The authors argued that often the limitation on high resolution comes with the environmental variables used as predictors, rather than with the species data (Fig. 1, step 1 and 2). However, as typical for wider parts of Alaska, 85% of the museum data we used came with an inherent location error of c. 3,615 km, whereas most of our predictors (elevation, aspect, slope with 1 km, climate data with 2 km cell size) were much more accurate. Effect of location error can be reduced by choosing an appropriate modeling technique (Fig. 1, step 3 and 4), such as TreeNet, as predictions with boosted regression trees are only slightly influenced by location errors (Graham et al. 2008). Thus, we suggest that using museum data with location errors is still an option for broadscale SDMs and statewide predictions.

Often, museum data tend to be unevenly distributed in space and time and lacking a relevant research design due to opportunistic sampling, also referred to as sampling bias (Stockwell and Peterson 2002; Graham et al. 2004). In our study, data were more abundant along the roadsystem, and few data existed elsewhere in the Interior, where white spruce is assumed to have the center of its range (see also Kadmon et al. 2004). Sampling bias might have the largest influence on prediction accuracy that cannot be accounted for, yet. However, additional information gained by using a multiparameter ecological approach and by considering interactions as presented here, should help mitigate sampling bias.

Resolution and choice of grain size

Cell size (also referred to as grain size) influences the accuracy of a prediction. If the cell size is too small, a slightly wrong geographic species location will result in an association with an environmental variable value of the neighbouring cell, i.e. with a different habitat (Fig. 1, step 2). If the cell size is too coarse, environmental conditions might be averaged, that do not provide an ecological meaning (Guisan et al. 2007). For making predictions for the entire state of Alaska (Fig. 1, step 4) the cell size used here (4 km × 4 km) is fine enough to keep as much information as possible, but coarse enough for not introducing a much higher accuracy than the original data (with location error) provided. It was also found that differences between species are often higher than between techniques (Elith et al. 2006), suggesting that grain size might need to be adjusted to average patch sizes and/or overall range of a species. This would pose the need of further research on patch sizes and spatial autocorrelation, which could be done using remotely sensed data of vegetation cover, or average or monthly NDVI. However, as well as remote sensing has proved to be capable of revealing information on patch sizes of vegetation types, it cannot do so for single species, yet.

Predictor variables

Although this model is not meant to be a mechanistic biological model, some inferences about the ecological niche can be drawn from the single variable plots (Fig. 2) and variable ranking (Table 4). Often, climate parameters are chosen a priori and with a focus on a low number of variables, including only annual values (e.g. Thompson et al. 2006) or only values for the growing season (Calef et al. 2005). As a result climate parameters rarely get tested against each other for their performance. We found it important to not exclude any predictors from the beginning, starting unbiased and virtually uninformed, and therefore tested first all of the 18 climate parameters, and in various combinations. This approach is easily possible, as TreeNet handles large numbers of variables and interactions conveniently. All following steps of dropping variables, that led to model 12 as the best-performing model (see also Table 3), indicate that the excluded variables be of minor importance for the distribution of white spruce.

Our results show, that the most important variables are not necessarily those, which are usually given higher priority by other investigators and in the literature, such as mean temperature of the growing season. We found that taking aspect into account surprisingly increases ROC and %corr values. The importance of aspect for the type of microenvironment and thus vegetation distribution was already stated elsewhere (Van Cleve et al. 1983; Calef et al. 2005; Huettmann and Diamond 2001 for wildlife applications). In contrast, using slope as a predictor lowers %corr values, although topographic slope is often regarded as important (Van Cleve et al. 1983; Calef et al. 2005). Latitude and longitude do not cause significant changes in ROC and %corr values, but help cluster predicted occurrences spatially (see also Supplemental Fig. 6 for comparison of maps). However, using latitude and longitude as predictors reinforces sampling bias, because it gives more weight to areas that were sampled thoroughly and lower weight to areas with lower sampling effort. Table 3 shows that permafrost, soil, and surface geology do not help increase model performance values and thus indicate, either not to contribute to explaining the distribution of white spruce, or that these three datasets are not very suitable for the applied modeling approach. For example, the soil variable reduced the %corr value by more than 20%, which might be due to a mismatch of data, as this dataset includes 268 classes and is thus too specific for a species dataset with 108 presence points. This might result in a loss of generalization ability. The importance of permafrost, contrarily to our findings, is supported by Van Cleve et al. (1983), and indirectly by Calef et al. (2005), who use drainage type as a predictor variable, which is highly correlated with the persistence of permafrost. However, both permafrost and drainage might be a function of elevation, aspect and slope, and thus are already included in the model.

Predictors found to be important (Table 4), can be used to “learn from the data”, because single variable plots (Fig. 2a–d) show the quantitative influence of each of the parameters on the distribution of white spruce (i.e. the partial dependence of white spruce on the specific parameter). Model 12 indicates the preference of white spruce for aspects ranging from 100° to 250° (Fig. 2a), which is consistent with (but more detailed than) the general idea of the typical white spruce habitat on south-facing slopes (Viereck and Little 2007). Little rainfall in August (total sum <80 mm) appear to favor the occurrence of white spruce (Fig. 2b), but we suggest that this variable is rather an indicator for distance to coast, than an actual climatic variable. Figure 2c shows the importance of time of snowmelt, as mean April temperatures above −4°C (it might be several degrees above zero within the days) help melt snow and thus provide moisture right at the start of the growing season. In contrast, mean April temperatures below −10°C (probably only around zero during the days) inhibit snow melting and moisture supply, and thus delay the start of the growing season, making these sites an unsuitable habitat for white spruce. According to Fig. 2d, white spruce favors elevations from slightly above 0 m (mainly along the rivers, where flowing water prevents the soil from permafrost) to about 1,000 m (highest occurrence of treeline, e.g. in the Alaska Range), which is, in a broad sense, consistent with the literature (Viereck and Little 2007).

Model performance

Model performance strongly depends on the choice of variables and the settings used. The model presented here should be seen as a first, conservative underestimate of model performance and accuracy, as there are many other settings we have not explored in concert, and thus, we could have missed the very best setting in TreeNet improving the model generalization and prediction accuracy further.

Comparing model results, maps, and accuracy assessments for the best four models, similarities and differences become evident. Most striking is that model 6 entirely omitts to predict classes 9 and 10, and model 15 shows high instability in Boyce indices and a very fluctuating curve for the spearman’s rank (Fig. 4a, b). Models 12 and 14 obviously show the highest model stability (Fig. 4a, b). They only differ in patterns of distribution for different parts of the state, with model 14 tending to cluster indeces of relative occurrence within the landscape (Supplemental Fig. 6), which is likely to be due to including latitude and longitude as predictors. These variables stress on the locations of the confirmed presence points in such a way, that spatial sample bias is reinforced. Thus, we would propose the results of model 12 as the most reliable prediction, which is supported by a very high r s.

The slight deviation of model stability for the mid-range classes would affect about 30% of the state-wide area, which might be due to the small patch size of lots of spruce stands. However, the most stable predictions are for the classes of high relative occurrence, proposing 138,192 km2 of the state-wide area being covered with white spruce, which is in the same range with values proposed by Labau and van Hees (1990), who suggest about 121,000 km2.

Quantitative comparisons to other white spruce maps are difficult, because few maps are published on this topic and most of them do not provide information on the methods used (e.g. Pojar and Mackinnon 1994). There are some areas in Alaska that were predicted by several of our models to have high potential for white spruce occurrence, although these areas were not recorded as such yet (e.g. some regions on the west coast, including offshore islands). They might have simply been undersampled, or, equally likely, range limitations, such as competition, disturbance, local extinction, or barriers to dispersal prevents white spruce from occurring there (Graham et al. 2004; Barry and Elith 2006). Only on a small scale, within the predicted range, e.g. in the westpart of Interior Alaska, values of relative occurrence are highly variable causing a salt-and-pepper like distribution of values on the maps. One explanation we found for this pattern was, that in the boreal and arctic, high variation in slope, aspect, drainage, postfire succession stage and vegetation cover results in larger changes in microenvironment over small distances than in humid midlatitudes (Van Cleve et al. 1983). It was found elsewhere that prediction errors might vary across landscape, and a call for an advanced model with spatial weighting was expressed (Fielding and Bell 1997; Fielding 2002), but is technically not available, yet.

Furthermore, we could not consider climatic trends yet, as our occurrence data span a time period of c. 100 years, and both temperature and precipitation data are averaging the period 1961–1990. Same applies for soil and permafrost characteristics, as those data were compiled once only in 1979 and 1965, respectively. However, according to Masek (2001), field investigations of tree stands at forest-tundra boundaries showed little indication of stand response to warming, yet. The boundaries were clearly mapped from satellite data, but no obvious change was apparent during the duration of the image time series (1970–1990th), constraining recent geographical expansion rates to <200–300 m per century. This might indicate time lags between forest response and climate change, or it reflects competition between trees and their surrounding vegetation (Masek 2001). The relevance of climate variability in time might also depend on the magnitude and spatial distribution of climate change. Given these facts, we decided to start our role-model with a long-term stable condition, until data with higher temporal resolution become available.

So far, we have captured the white spruce distribution as one single, transparent and repeatable formula in a quantitative fashion and small binary software code, ready for digital use, and open for public assessment. As we were working with data publicly available, it is foreseeable that more and better data will help us improve our proposed and publicly available model even further. We would welcome such efforts.

Suggestions for further research

This study can be used as a baseline for decisions about where more sampling efforts are needed in the future, as we have recognized undersampling to have the most severe impact on our model predictions. Model performance is furthermore dependent on variables such as fire history (fire intensity, extent and frequency; Rupp et al. 2001; Calef et al. 2005), which will be important for delineating deciduous versus coniferous forest, and to define white spruce versus black spruce (P. mariana) habitats (Calef et al. 2005). Knowledge about dynamic dispersal, e.g. life history, seed production and seed release applied in a spatially explicit manner (as already used by Rupp et al. (2001) on a smaller scale), as well as information on tree pests, such as the spruce budworm (Choristoneura occidentalis), to account for the probability of local extinction will surely improve model performance. Still, it will be the potential niche that is modeled by using this algorithm, rather than the actual niche, unless competition is included as a variable, e.g. by applying a plant community model that accounts for interaction between different species (Ferrier et al. 2002; Ferrier and Guisan 2006; Zimmermann and Kienast 1999).

Seeking for a balance between habitat protection, conservation and recreation, and potential timber and fuel supply can only be successful with detailed knowledge about the potential niche of a species and its spatial distribution within the landscape, as provided here. This model also offers itself as a baseline for assessing land-use change or changes in species ranges due to climate change (Leathwick et al. 1996; Graham et al. 2004; Prasad et al. 2007-ongoing, Huettmann et al. unpublished), as it could potentially be modeled backward and forward in time. It would also be valuable to apply this model to and obtain maps for other tree species. Furthermore, for forestry and timber volume prediction purposes, this model could be adjusted by using tree volume data instead of presence/absence for model calibration, or by linking the index of relative occurrence with timber volume. Definitely, this will affect forest management decisions, especially when pursuing sustainable forest management.

In this study, we operated data primarily in ArcGIS. However, there are options of exclusively using free software, such as GRASS GIS, which is applicable to geospatial data management and analysis, image processing, graphics/maps production, spatial modeling, etc. Further exploration of these options for modeling would help provide important tools for a broader research community. Without availability of high-quality data (Open access) accurate predictive modeling as presented here would not have been possible. Using these data and applying non-invasive methods helps preserve wilderness areas without disturbing them. Making model results publicly available helps connect scientists, resource managers, policy makers, and communities and shall enhance collaborative planning and management.