Introduction

Stem rust caused by the fungi Puccinia graminis f. sp. tritici, has re-emerged as major threat to wheat (Triticum aestivum L. and Triticum turgidum ssp. durum L.) following the appearance of new virulent races. Ug99, a particularly virulent strain (or TTKSK using North American race notation) of stem rust, was found in Uganda in 1999 and has increased the vulnerability of the global wheat yields (CIMMYT 2005; Vurro et al. 2010; Fehser et al. 2010). This virulent strain is spreading and has already taken its toll on wheat production in Sub-Saharan East Africa, Yemen and Iran and is now threatening Central Asia and the Caucasus; an area that accounts for 37% of global wheat production (Vurro et al. 2010; Fehser et al. 2010). Because of its virulence and spread it has attracted both public and research community attention worldwide and global efforts to track and monitor its expansion are underway to counter its potential impact (Kolmer 2005; Vurro et al. 2010; Hodson and DePauw 2011). Plants usually react to virulent strains through the so-called R (resistance) genes group (Eckardt 2001). In the case of wheat, there are about 45–50 genes, known as Sr genes, which confer resistance to different races of stem rust (McIntosh et al. 2008, 2010; Vurro et al. 2010). Deployment of new sources of resistance to stem rust has been made a top priority by the Borlaug Global Rust Initiative that was established in 2005 (http://www.globalrust.org).

The appearance of a new virulence for a crop disease is a typical recurring scenario in agricultural production that can lead to severe yield losses (Qualset 1975; Leonard and Szabo 2005; Vurro et al. 2010; Fehser et al. 2010). Utilizing novel disease resistance genes found in ex situ germplasm collections, or genebanks, can help to avert these losses (Qualset 1975). Genetic resources, such as crop landraces and wild relatives, represent potential sources for pest and disease resistance critical for the stability and sustainability of global production (Dinoor 1975; Bonman et al. 2005). However such novel variation is often rare and may not be captured in a representative or fixed collections of germplasm such as core collections (Brown and Spillane 1999; Polignano et al. 2001; Gepts 2006; Dwivedi et al. 2007; Pessoa-Filho et al. 2010; Xu 2010). The need to rationalize the search for rare adaptive variation has led to the use of alternative approaches including the development of specific or thematic genetic resource collections (Gollin et al. 2000; Gepts 2006; Dwivedi et al. 2007; Pessoa-Filho et al. 2010; Xu 2010). Recent trait-based approaches to selecting germplasm from genebanks have shown that they are more likely to provide useful and novel genes (Street et al. 2008; Mackay and Street 2004; El-Bouhsini et al. 2009, 2010; Bhullar et al. 2009).

By using the eco-geographical data of a reference dataset of accessions with resistance to the sought after adaptive trait, such as resistance to either diseases or pests, the Focused Identification of Germplasm Strategy (FIGS) has successfully helped to identify a number of novel genes in germplasm from environmentally similar sites to those of the reference/template dataset (Mackay 1995; Mackay and Street 2004; El-Bouhsini et al. 2009, 2010; Bhullar et al. 2009). Relationships between adaptive traits and collection site environmental parameters have also been revealed by recent studies using multi-variate and multi-way models such as N-PLS (multi-linear Partial Least Squares) (Endresen 2010; Endresen et al. 2011). Modeling of stem rust resistance using geographical information system (GIS) approaches has also led to the detection of a relationship between geographical areas and incidence of resistance to stem rust (Bonman et al. 2007). Furthermore, some of the traits that have been found to carry strong climatic signals in wild species are being used to model the impact of climate changes (Barboni et al. 2004; Webb et al. 2010). However, although there have been trait-environment studies in the past, they were generally limited to a single or a small group of environmental variables (Pakeman et al. 2009).

FIGS is a trait-based and user-driven approach to select potentially useful germplasm for crop improvement. It searches for specific sought-after traits, using as surrogate the environment, based on the hypothesis that the germplasm is likely to reflect the selection pressures of the environment from which it was originally sampled (Mackay 1990, 1995; Mackay and Street 2004). The FIGS approach addresses the lack of available evaluation data as well as the temporal (the moment when the accession is evaluated) issue of evaluation as reported by Koo and Wright (2000). In a simulation of the economic impact of disease manifestation Koo and Wright (2000) found that it was faster to develop an improved variety by incorporating novel resistant traits, provided the source of the resistance gene has already been identified. In terms of searching genetic resource collections for useful traits, Gollin et al. (2000) developed a theoretical model based on resistance to diseases and insects, including the Russian wheat aphid (RWA). The model highlighted that the search for a desirable trait is of equal importance to the process of transferring it into improved backgrounds. In their findings a focused search approach, which they called a specialized knowledge case, contributes positively to expected net benefits due to the increased probability of finding the desirable material and the associated cost savings. FIGS as a focused approach combines both the development of a priori information (dataset template or specialized knowledge as per Gollin et al. (2000)) based on the quantification of the trait-environment relationship and the use of this information to define a subset of accessions with a higher probability of containing the sought after genetic variation for adaptive traits (Mackay 1995; Mackay and Street 2004).

The distribution patterns of the adaptive trait might be, as in the case for taxonomic species distributions, the result of ecological and evolutionary factors, including, but not limited to, environmental factors, natural selection and local selection pressures that are hard to quantify, such as interactions with humans. However, according to Qualset (1975) and Dinoor (1975), disease resistance traits are more likely to be influenced by natural selection and thus have a restricted distribution. In this context, Hakes and Cronin (2011) assert that trait distribution patterns are not random and could be geographically and spatially structured. Compared to species distribution patterns, very few studies have been undertaken to identify key factors that influence specific trait distribution patterns (Chuine 2010). Furthermore, recent studies have shown that modeling the distribution for specific traits can improve the quality and predictive performance of plant species distribution models (Hanspach et al. 2010).

The objective of this research was to detect whether there is a link between stem rust resistance and climate, the results of which will be used to (1) develop a subset of germplasm accessions with an increased probability of finding resistance to stem rust and (2) develop algorithms to use in subsequent applications of FIGS for ‘trait mining’ of large germplasm collections. Five modeling techniques were tested to quantify the hypothesized link. These included both parametric and non-parametric approaches as well as machine learning methods. The overall assumption is that the novel genetic variation will be confined to areas with similar environmental profiles to sites where stem rust resistance has been previously found.

Methods

The R language (R Development Core Team 2011) was used as a platform for the preparation and analysis of data. The data consisted of stem rust scores for bread and durum wheat accessions (trait data) and environmental or site data (climate data) describing where the accessions were originally sampled.

The trait data

The stem rust trait data used in this paper to develop the FIGS a priori information was taken from The United States Department of Agriculture (USDA) National Genetic Resources Program (NGRP) GRIN database. The data is an accumulation of results over six different years (during 1988–1994) in two different research stations in USA; the University of Minnesota Agricultural Research Station (44°59′17′′ N, 93°10′48 W) and the Rosemount USDA Agricultural Research Station north of USA (44°43′01′′ N, 93°05′56′′ W). Dr Don V. McVey made all of the trait observations for both locations (Bonman et al. 2007; Endresen et al. 2011).

The accessions screened for stem rust originated from 2013 collection sites as one site was removed from the original data of 2,014 sites. Some of the sites lacking geographical coordinates were geo-referenced at ICARDA based on a description of the original collecting sites. The probability distribution for the disease scores shows that the number of susceptible accessions is more dominant than the number of resistant accessions. Cross-tabulation was carried out to assign a site’s (i) trait attributes using the expected frequency (e ik ) of each score per site times the actual score count (y ik ) per site of each of the scores, k, (0–9) across all sites. The sites were then compared based on the frequency of either resistant accessions (0–4 scores) or susceptible accessions (5–9 scores).

$$ Y_{i} = \left\{ {\begin{array}{*{20}c} {1,} \hfill & {\sum\limits_{k = 0}^{4} {e_{ik} y_{ik} \ge } \sum\limits_{k = 5}^{9} {e_{ik} y_{ik} } } \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$

This was based on the results reported by Endresen et al. (2011) where they reclassified the trait states into 2 groups from 9 groups and used single sites as observations such that the score that is most common would be the score that is attributed to each site.

The climate data

The climate data was extracted from climatic maps generated from station data by co-splining (Hutchinson and Corbett 1995), a local interpolation method. The ‘thin-plate smoothing spline’ method of Hutchinson (1995), as implemented in the ANUSPLIN software (Hutchinson 2000), was used to convert the point climate data into climate surfaces. This is a smoothing interpolation technique in which the degree of smoothness of the fitted function is determined automatically from the data by minimizing a measure of the predictive error of the fitted surface, as given by the generalized cross-validation (GCV). The GCV is calculated by removing each data point and calculating the residual from the omitted data point of a surface fitted to all other data points using the same smoothing parameter value. The thin-plate smoothing spline method including GCV has been proven to be effective as it improves the accuracy of the interpolation similar to that of cokriging interpolation when the appropriate variogram is well selected (Wood 2000; Tait and Turner 2005; Wratt et al. 2006).

The generation of climatic maps was based on the use of terrain variables as auxiliary variables in the interpolation process, whereby these variables were first converted into digital elevation models (DEM) using GIS software. In contrast to the climatic target variables themselves, which are only known for a limited number of sample points, terrain variables have the advantage to be known for all locations in between. In addition, some climatic variables, such as temperature and precipitation, are highly correlated with elevation, which increases the precision of the interpolated values significantly (De Pauw et al. 2000).

The DEM used to generate the climate surfaces was GTOPO30, a global DEM with 30 arc-second (approximately 1 km) resolution (Gesch and Larson 1996). Parameter estimation was undertaken over a regular grid with the same dimensions and resolution as the user-provided DEM. The combination of point climatic data and terrain, in the form of a DEM, allows generating spatially or temporally linked derived variables, such as potential evapotranspiration (pet) and aridity (ari) index (Table 1). From the climate maps a total of 60 climatic variables were extracted for the georeferenced 2,013 locations, where accessions scored for stem rust, were originally sampled. These 60 variables represent monthly average minimum temperature (tmin), monthly average maximum temperature (tmax), monthly average precipitation (prec), monthly average evapo-transpiration (pet) and monthly average aridity index (ari) (Table 1).

Table 1 The environment variables used in the study

Data preparation and data exploration

Prior to the analysis of the climate data each variable was examined individually. The climate variables appear to be mostly right- skewed and transformations were performed to tackle both the skewness and the different measurement scales to avoid the discrepancy between large and small values. Normality was not necessarily a prerequisite for some of the modeling techniques explored. Among the transformations used was the Box–Cox transformation, which is a power transformation included in Tukey’s original family of transformations that use logarithmic transformations when the power value is equal to 0 (Tukey 1957; Osborne 2010). The Box–Cox transformation algorithm was applied individually to the aridity (ari), the precipitation (prec) and the evapo-transpiration (pet) monthly variables based on the equation:

$$ f_{\lambda } (x) = \left\{ {\begin{array}{*{20}c} {\frac{{x^{\lambda } - 1}}{\lambda }} & {\lambda \ne 0} \\ {\log (x)} & {\lambda = 0} \\ \end{array} } \right. $$

where λ power (of x) value is chosen to reduce non-normality by maximizing the l(λ) function:

$$ l(\lambda ) = - \frac{n}{2}\log_{e} \left[ {\frac{1}{n}\sum {\left( {x_{j}^{\lambda } - \bar{x}^{\lambda } } \right)^{2} } } \right] + (\lambda - 1)\sum\limits_{j = 1}^{n} {\log_{e} } (x_{j} ) $$

\( \bar{x}^{\lambda } \) is defined as the average of the newly transformed variables (Box and Cox 1964; Osborne 2010). Through this process the best option is selected from a range of transformations (Osborne 2010).

As a result of the Box–Cox transformation process the aridity variables were transformed with λ values ranging from 0.12 to 0.42, the precipitation variables with λ values ranging from 0.17 to 0.56 and the evapo-transpiration variables with λ values ranging from −0.40 to 0.3 (Table 1). Only the best value of λ, not tabulated here, for each of the 60 variables were used. Figure 1 demonstrates the effectiveness of this transformation. Mean annual temperatures (tmax and tmin) were not transformed since their distribution was approximately normal (Fig. 1).

Fig. 1
figure 1

Frequency histograms of climate (aridity/moisture index) variable before transformation (above) and after transformation with λ = 0.13 (below)

Fig. 2
figure 2

Model performance for both logistic regressions (PCLR and GPLS) versus number of components using AUC (left graph) and Kappa (right graph). PCLR reaches AUC = 0.7 with less components (LVs = 20) while PCLR needs 45 PCs. With fewer components (less than 10) GPLS performs better that PCLR. Both models with LVs and PCs that explain only 95% of variance the performance is below accepted values. PCLR model is represented by the continuous line and GPLS as the dashed line

All variables were standardized to a mean of zero and a standard deviation of 1. After the transformation the data was standardized and a comparison made between the transformed and non-transformed data. This data pre-processing was systematically and automatically carried out through the different models.

Uni-variate analysis (ANOVA) was carried out to detect the discriminating ability for each variable individually. In the ANOVA, the trait data (stem rust scores Y) was used as an independent variable while climate variables (X) were as dependent variables. Most of the variables taken individually showed significant discrimination between the different stem rust trait groups. Collinearly among the variables (X) was expected as indicated by its extremely high condition number value, which is equal to the square root of the largest eigenvalue divided by the smallest eigenvalue (Belsley 1991).

Modeling framework

The modeling framework refers to the context as well as the processes, ranging from training and tuning to testing and assessing the modeling techniques for their predictive power. The models are based on the paradigm that the value of the stem rust resistance state (Y) depends on the climatic variables (X), where X = (x1,…, xn). At first, the assumption is that Y is normally distributed with the mean being a linear function of X with a constant variance of \( \sigma_{\epsilon }^{2} \),

$$ {\text{Y}}_{\text{i}} \sim N\left( {\beta_{0} + \sum\limits_{i = 1}^{n} {\beta_{i} X_{i} } ,\sigma_{\epsilon }^{2} } \right) $$

where β i are coefficients. When the trait state is considered as resistant or susceptible, Y can take 2 possible values; either 0 (susceptible) or 1 (resistant) and thus the distribution of Yi becomes

$$ {\text{Y}}_{\text{i}} \sim {\mathfrak{B}}\left( {1,\Upphi \left( {\beta_{0} + \sum\limits_{i = 1}^{n} {\beta_{i} X_{i} } } \right)} \right) $$

The above equation describes a random Bernoulli function (Gollin et al. 2000), of which the standard normal output is the Probit model and the logistic distribution is the Logit model (Feelders 1999). Such relationships between the trait(s) and the environment can be modeled in a multiple regression framework, including the logistic regression (Webb et al. 2010) where the response variable Y is adjusted to a response vector logit(p) with p = P(Y = 1). The logit stands for the logarithmic equation (Pohlmann and Leitner 2003):

$$ logit(p) = ln\left( {{p \mathord{\left/ {\vphantom {p {1 - p}}} \right. \kern-\nulldelimiterspace} {1 - p}}} \right) = \beta_{0} + \sum\limits_{i = 1}^{n} {\beta_{i} X_{i} } $$

which in turn leads to the mathematical expression of

$$ p = \frac{{\exp \left( {\beta_{0} + \sum\nolimits_{i = 1}^{n} {\beta_{i} X_{i} } } \right)}}{{1 + \exp \left( {\beta_{0} + \sum\nolimits_{i = 1}^{n} {\beta_{i} X_{i} } } \right)}} $$

and this transformation assumes a linear relationship between the logit of the probability of Y = 1 and the climate variables. However, trait-based approach linear regression analysis may be more appropriate for exploratory data analysis (Webb et al. 2010). As a follow up to the early work by Endresen et al. (2011) this study was extended to a non-parametric framework where the phenomena are expected to be non-linear using the original response Y. The non-linear framework refers here to learning based techniques, which aim to overcome the problem of restrictive parametric paradigms on one hand and the prerequisite distribution assumptions on the other (Drake et al. 2006).

The accuracy of the models (or predictive power) was measured based on the ability of a model to accurately predict the number of times the fitted model classifies correctly the response for each of the two descriptor states (resistant or susceptible). The modeling framework also includes a tuning process for optimal accuracy. The quality of the models was measured using cross-validation algorithms where the data is split into two subsets, the algorithms developed on the training test were used to predict for trait states in the test set.

Modeling techniques

The five modeling techniques that were assessed to quantify the hypothesized trait-environment relationship are described in the following sections (Table 2). In the two first techniques the response variable was adjusted to take into account that it is a binary variable and that the trait-environment relationship is described in terms of probability of Y = 1 instead of Y.

Table 2 Models used in the study

Multiple linear regression using principal component logistic regression (PCLR)

A PCLR analysis was performed on the transformed climatic variables and the stem rust trait prediction \( \hat{Y} \) (value estimate of the probability that Yi = 1) was generated from the PCA component scores (Spca) instead of the original transformed climate variables (X). It is based on the PCLR equation logit(p) = SpcaΒ + E, where B consists of maximum likelihood estimates of the logistic regression coefficients (Aguilera et al. 2006). The approach aims to both reduce the number of predictor variables (multi-collinearity) and adjust the outcome or response variable. The prediction was initially carried out using the number of components (PCs) that account for 95% of explained variance. The optimization process of PCLR model was based on the accuracy measures which were examined by adding the components stepwise based on the PCs contribution to the overall explained variance, starting with those that explain a large amount of variance (Fig. 2). The components with high predictive value will lead to an increase of these accuracy measures while those that are largely “noise” will lead to their decrease.

Multiple linear regression using generalized partial least squares (GPLS)

PCLR as an extension of PCA approach eliminates only the collinearity but might not identify the optimum subset of candidate variables that can be used as predictors since their decomposition is carried out independently of the trait dataset. Thus a GPLS regression as an extension of PLS was used since PLS not only retains the original structure but also involves a decomposition into a product of factors and their loadings (regression coefficients), of both the environmental dataset and trait dataset simultaneously (Wold et al. 1984; Abdi 2010). GPLS as an extension to PLS it retains the rationale of PLS (Bastien et al. 2005). While PCA maximizes the variance of the scores PLS maximizes the covariance between the scores and the response variable. GPLS latent variables (LVs) would be thus more relevant for trait prediction as was demonstrated by Arif et al. (2007) for PLS to assess the correlation between morphology and climate variables. The optimization of the number of components selected in the prediction for this model followed the same procedure as that of the PCLR model.

Random forest (RF)

RF is a type of recursive partitioning algorithm where the data is recursively split into groups of observations with similar response values, a procedure that does not require normality assumptions and deals well with a large number of variables (Strobl et al. 2009). It differs from standard tree classifier in that it “grows” many classification trees in the process. An object from an input vector is classified by all trees in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification of a given object having the most votes over all the trees in the forest. This approach has led to higher classification accuracy that can outperform other classifiers (Breiman 2001).

The data in the RF module is split intrinsically into a “training set” as a result of bootstrap sampling with replacement; the data that is not sampled to be part of the training set is referred to as the out-of-bag (OOB) set. The OOB set is used to test the predictive power of the RF module. A number, mtrv, is specified, which is less than the number of input variables (in this case the climatic parameters), such that at each node of the tree a mtrv number of variables are selected at random from the original variable set and the best split on these randomly selected variables is used to split the node. Each tree in the “forest” is grown to the largest extent possible without pruning until there are, ntree, number of trees. Varying the mtrv and ntree values is how the model is optimized. The optimization of the two mtry and ntree parameters is driven by monitoring the magnitude of the mean square prediction error (rate of classification error) observed in the OOB set; that is, the ability of each iteration to correctly classify a site as resistant or susceptible.

Neural network (NN)

In the NN model a neuron is described by its weight w k and transfer function f(x) that receives a set of numbers x k as input, in this case climate variables, and generates a number, y, as output (Golden 1996; Bari et al. 2003), in this case the trait. Similar to a nervous system, NN consists of many processor (PE) units linked to communicate in a many-to-many connection structure where the computations are carried out in parallel by the units independently from each other. This parallelism and high connectivity are the characteristics of neural networks that help in overcoming the assumptions that are usually required in the case of linearity. In comparison to the human brain they were originally designed to identify patterns, even in the presence of noise (Warner and Misra 1996).

If the trait-climate relationship falls within a General Linear Model (GLM) context the NN weight w k would correspond to the βi coefficients. NN do not require an assumption of linearity between the trait variable (dependent variable) and climate variable (independent variables), it is the data that defines the functional form of this relationship (Warner and Misra 1996). After the climate data was transformed and scaled it was fed to the NN model [nnet model R library (Venables and Ripley 2002)]. Prior to the use of the NN model the tuning process was performed to define the number of neurons (NN size) and the decay parameter (ε) that measures the trade-offs between the weights (w k ) and the prediction error. The weights keep changing through the back-propagation iterative process until the reduction in error is optimized.

Support vector machines (SVM)

Support Vector Machines (SVM) is also a learning-based technique that maps input data to a high-dimensional space, and then optimally separates it into respective classes by isolating those inputs which fall close to the data boundaries (Cortes and Vapnik 1995; Principe et al. 2000). In this study SVM was used with a radial basis function (RBF) as the kernel function. RBF uses Gaussian transfer functions, the centres and widths of which are determined by unsupervised learning rules. RBF first carries out unsupervised clustering using a k-nearest neighbor algorithm, and then applies a supervised classification using the cluster number and width (radius, hence the name “radial”). Thus the sites are first split into k clusters and the size of each cluster is obtained from the structure of the input data. The centres of the clusters give the centres of the RBFs, while the distance between the clusters provides their widths (Silipo 1999; Bari et al. 2003).

SVM-RBF first assigns a “score” or a label to the combination of input variables, in this case the site climatic variables, during the unsupervised procedure and then the value of the label is matched with the actual Y value, in this case the trait value. The results of this comparison are fed back to the system and adjustments are made to the labels until the error between the predicted and the actual are minimized. Tuning the SVM involves adjusting the SVM parameters (γ) of the kernel function (RBF) and the cost (C) value chosen to be 0.1 and 1, respectively. SVM model was tuned by supplying parameter ranges for both parameters and the best values above were chosen by minimizing the error on the training data set.

Tuning the models and comparing outputs

Comparisons between models were made using the metric parameters derived from a confusion-matrix table and the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) (Swets et al. 2000; Fawcett 2006). The confusion matrix parameters are derived from a 2 by 2 contingency table (Table 3). The comparison process involves two groups of algorithms for two different types of cross-validation, sensitivity and specificity.

Table 3 Confusion matrix (2-by-2 contingency table)

Sensitivity, defined by a/ (a + c), and specificity, defined by d/(b + d), are indicators of the models ability to correctly classify observations as either susceptible or resistant. The higher the values of sensitivity and specificity the lower the error and thus the better the discriminating power of the model. The errors occur when resistance (R = 1) is classified as susceptible and vice versa. The former is a conditional probability notated by P(R* = 0|R = 1) while the latter is notated by P(R* = 1|R = 0). Thus sensitivity can be defined by P(R* = 1|R = 1) and specificity by P(R* = 0|R = 0).

The Kappa statistic was used to assess improvement over chance and measures the specific agreement in the confusion matrix table. A value of Kappa below 0.4 is an indication of poor agreement and a value of 0.4 and above is an indication of good agreement (Landis and Koch 1977). Thus a high value is an indication that the models performance is adequate for prediction purposes (Scott et al. 2002). A 90% confidence interval for the Kappa statistic was also used since it is asymptotically normally distributed. Since Kappa can be a threshold dependent “metric” parameter we also used the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) plots to measure the models accuracy (Freeman and Moisen 2008). The AUC accuracy assesses improvement over randomness based on the ROC curve.

The ROC curve sensitivity, which is the conditional probability P(R* = 1|R = 1), is plotted against the conditional probability P(R* = 0|R = 1), which complements the sensibility (1 − P(R* = 0|R = 0)). This is also known as the plot of true positive rate versus false positive rate, where the true positive rate is sensitivity and the false positive rate is 1- specificity. A ROC curve that rises nearly vertically at the origin towards the left corner of the graph has high true positive rate and a small false positive rate. Such a plot would have high AUC values indicating favorable model performance (Freeman and Moisen 2008). Thus in this study, the higher the AUC values the better the discrimination between collection sites yielding resistant or susceptible classes. An AUC value of 0.5 represents randomness while1represents peak model performance (Fawcett 2006).

Each model was tuned individually to define the most appropriate parameters to use for better prediction (Table 2). This involved the examination of the different errors (e.g. the root-mean-square error (RMSE) numerically or graphically using optimization and tuning algorithms. These algorithms invoke within cross-validation (10-fold cross-validation) using 10 random segments or folds, where 9 folds are used for learning purposes and the reminder onefold is used for validation purpose.

PCLR and GPLS optimization is based on the accuracy parameters. For RF, NN and SVM the parameters were tuned for optimal accuracy using separate tuning algorithms for each module. A range was provided for each parameter and the tuning algorithms selected the best value for each. The cross-validation was performed 10 times and the averages for the performance indicators are reported in Tables 4 and 5.

Table 4 Model accuracy for PCLR and GPLS model using test data for PCs/LVs representing 95% of variance (PCs = 5 and LVs = 6) and both cases where PCA and PLS error reached the minimum error (PCs = 45 and LVs = 20) with high accuracy values
Table 5 Model accuracy using test data for the 5 optimized and tuned models

The models were then compared based on the AUC and Kappa indicators for best predictive performance. The comparisons were based on the test data, which represents one third of the data (671 sites) while two thirds of the data (1,342 sites) was used to develop the five models. The modeling algorithms were left to run 10 times on the test set and the average with their confidence intervals were reported for AUC, Kappa and the overall correct classification for each model.

Results

The PCLR, GPLS, and RF models were all able to correctly classify sites that yield either resistant or susceptible genotypes with a 76% success rate, SVM and NNs improved this prediction by 1–2% (Table 5). The accuracy of the models is illustrated by the ROC plots shown in Fig. 3. A straight diagonal line is expected when a model is no better than random. While all models yielded plots that demonstrate a better than random performance the curves for the non-linear models (RF, SVM and NN) tended to be skewed more towards the left-hand side of the ROC plots indicating that they tend to classify the resistant trait relatively more correctly with less false positive error and thus will perform relatively better than the parametric models (Fawcett 2006).

Fig. 3
figure 3

ROC plots (left) and density plots of prediction for resistance and susceptible (right) for the 5 models using test set; green curve (discontinuous line) indicates the probability density distribution for resistance and red curve (continuous line) indicates susceptibility. Predictions fall out of range [0, 1] as a result of linearity/interpolation in some of the models

These indications are supported by the Kappa and AUC statistics. The PCLR and GPLS approaches have relatively low and similar Kappa average value (0.41). Further, the AUC values are nearly equal to what can be expected for a robust prediction model (0.7 and above). The SVM and NN approach had Kappa values of 0.44 and 0.45, respectively and AUC values of 0.71 and 0.72 with confidence intervals that indicate SVM and NN are significantly different than those obtained for the parametric approaches, in particular PCLR for either one of the accuracy parameters in Table 5.

The prediction density plots (Fig. 3) further support the inference we draw from the Kappa and AUC values. For the PCLR and GPLS plots there is higher degree of overlap between predictions for the two trait classes, while the non-parametric models yield plots that show a more pronounced degree of separation between the two classes.

The results of the tuning process for the PCLR and GPLS models, summarized in Table 4 and shown in Fig. 3, are noteworthy. The accuracy (AUC) and the rate of agreement (Kappa) values increase with the number of PCs and LVs until they reach a plateau after which the two models converge. However, GPLS reaches higher values with fewer components than the PCLR model. When the models are run with 5 PCs and 6 LVs, which represent 95% of total explained variance, GPLS performs better than PCLR, whilst the PCLR model with an AUC value close to 0.5 is similar to a random outcome. The minimum errors for both models were reached at 42 PCs and 22 LVs, respectively. Kappa values have a similar pattern to the AUC in relation to the number of PCs and LVs. Thus the GPLS model requires fewer components to reach a Kappa = 0.4 and AUC close to 0.7.

Discussion

This study allows us to confidently assert that resistance to stem rust in wheat landraces is not randomly distributed geographically but linked to agro-eco-climatic factors existing within collection site environments. This is supported by the findings of Bonman et al. (2007) who reported that the distribution of stem rust resistance in wheat landraces was linked to regions of origin. Thus, we can hypothesize that the emergence of resistance to other pathogens are also likely to be linked to long term environmental trends acting upon host pathogen systems. This is hardly surprising given that pathogens generally have optimal environmental conditions within which they will thrive, thus placing selection pressure on in situ host populations for the emergence of resistance. The assertion here is that the environment will strongly influence gene flow, natural selection and thus spatial/geographic differentiation for specific traits (Wu et al. 1975; Spieth 1979; Epperson 1990). In the case of powdery mildew, for example, Paillard et al. (2000) report that those populations of winter wheat with the highest level of resistance to powdery mildew originated from sites where powdery mildew pressure was high, due to environmental factors, while the reverse was true of those populations where the pressure was low. A practical application of this was demonstrated by Bhullar et al. (2009) where, after applying a FIGS approach to selecting wheat landraces for an eco-tilling exercise focused on the Pm3 region, found that forty percent of the collection sites chosen yielded genotypes resistant to the isolates used. The study went on to reveal 7 new resistance alleles for the Pm3 gene.

On the other hand, in a similar study aimed at investigating taxonomic and biogeographic predictivity for resistance to 32 pests and diseases of cultivated potato wild relatives, Spooner et al. (2009) report that resistance to only six pests and diseases could be reliably predicted by environmental variables. They concluded that the more efficient strategy to mining genetic resource collections is to carefully screen core collections. They mentioned, however, a number of factors that could impede their results such as the scale of climate grids from which the climate variables were extracted. A recent study by Endresen et al. (2011) indicated higher predictive performance when using finer resolutions for the climate data with grid sizes of 1, 4.5, 9.3, and 18.5 km. The grain or grid size has been an issue in ecological research where it is acknowledged that further studies are needed on the resolution (grid size)-dependency paradigm detailed in the MacArthur and Wilson (1967) bio-geographical “island theory”, which originally led to the inclusion of the size of the area’s paradigm in ecology (Malanson and Armstrongy 1990; Mann and Benwell 1995). Further it is suggested that trait-environment relationships may be influenced by the scale at which both independent and dependent variables are measured (Cushman and McGarigal 2004; Tautenhahn et al. 2008). In addition to the issue of scale, and as illustrated in Fig. 1, climatic variables also tend to be highly skewed and (in preliminary exploration of the modeling reported here) the modeling techniques performed better when the variables were transformed.

Notwithstanding the above, in preliminary work for this study it was clear that the predictability of the models decreased as the number of variables used decreased. This is supported by Stockwell (2007) who suggests that additional data layers are required to produce more efficient ecological models. Stockwell (2007) further asserts that some entities may not be modeled using restricted variables. In the study reported by Spooner et al. (2009), 38 variables were used including 12 each for rainfall, minimum and maximum temperatures. In this study 60 variables were used that include potential evapo-transpiration and aridity, both of which contain information about humidity, a factor which is widely reported to be of critical importance to the development of fungal pathogens such as stem rust.

In this context a discussion on the performance of the models used in this study is warranted. The results demonstrate that modeling techniques, such as those reported here, can provide a predictive framework to quantify trait-environment relationships and as such can be used to efficiently identify potentially valuable germplasm from genetic resource collections. The practical application of this has been clearly demonstrated by El-Bouhssini et al. (2009, 2010) who have discovered multiple sources of resistances to the Syrian biotype of Russian wheat aphid and to Sunn pest in bread wheat for the first time using a FIGS approach after having unsuccessfully screened thousands of genebank accessions. Thus, the availability of environmental data provides an opportunity to improve the use of germplasm through quantifying such trait-environment relationships. In this context Webb et al. (2010) consider that the type and quality of data are so important that they are considered non-trivial limitations to such trait-based approaches.

For the PCLR and GPLS based regression models, the predictions derived by using the same number of PCs and LVs that account for 95% of the variance in the climate variables, are not much different from a random selection. This was unexpected and has implication in terms of setting up subsets of accessions for desirable traits using a PCLR approach to cluster accessions working within the generally accepted levels of variance explanation. Adding more PCs or LVs improved the prediction with both models, GPLS requiring fewer components than PCLR suggesting that some of the environmental variables that correlate with the trait response (resistant or susceptible) were not captured by PCLR. Common and expected is the pattern where PLS type of models needs fewer components than PCA based model as it has been found, for example, by Wu et al. (2008). Moreover, PLS logistic regression is more prone to lead to a coherent model (Bastien et al. 2005). The fact that NN and SVM perform better than PCA and PLS based models indicates, however, that a large amount of variance in the climate data is non-linear (Wu et al. 2008). The out-of-range predictions (presence of negative values to 0 and positive values greater than 1 indicate that the trait-environment relationship is more likely to be non-linear (Fig. 3).

The performance of the different models based on their discriminatory ability indicated by the AUC and Kappa values is also related to the decomposition of variables. When both trait and environmental variables are decomposed together using GPLS, the AUC and Kappa values are relatively higher indicating that the models retain relevant structure and information and thus allow a better discrimination. Similarly, when the prediction is applied directly to the data using machine learning techniques, the results of AUC and Kappa values are higher indicating their improved potential for discriminating between the two groups.

Although both AUC and Kappa values of testing datasets are, as expected, lower than those of the training data, the results show acceptable values. Confidence intervals of SVM and NN model in particular do not overlap with those of PLCR. Overall non-linear models perform better when compared to linear models even if all the data structure is retained for both types of models.

Machine-learning based techniques such as RF, NN and SVM in combination with fuzzy based approaches have the potential to yield better predictions (Kampichler et al. 2010). Such techniques are also more suited for data with a large number of variables. The only limitation, as in the case of SVM, may be due to the range of environmental variables (Drake et al. 2006). RF has the potential to yield better results, however it is computationally expensive as it requires large memory and significantly more run time. NNs are powerful predictive tools although difficult to interpret (Jeschke and Strayer 2008). SVM is more rapid and tends to distinguish optimally between groups and predicts entities while minimizing the loss of information (Guo et al. 2004; Karatzoglou et al. 2006). The advantage of learning-based techniques is that they need fewer assumptions and are more suitable when highly complex non-linear relationships are expected among input variables (Tirelli et al. 2009).

The results also suggest a number of other issues that may improve predictive performance. The data was composed of several taxa of wheat. The partitioning to sub-population or genetic background/ lineages may lead to an increase in model accuracy, since compact or clumped distributions of population are easier to model than those of widespread and scattered distributions (Hernandez et al. 2006; Hanspach et al. 2010). Reclassification of trait states has been shown in previous work by Endresen et al. (2011) to improve predictions. Both the assessment of trait variation and the effect of probability distribution of the trait are to be further investigated. Trait distributions vary across any set of accessions and affect the optimal search process (Gollin et al. 2000). They may also vary depending on the type and degree of virulence of races or biotypes in the case of insects.

When optimally tuned all the models reported here displayed predictive powers superior to a random selection and thus are adequate to explore genetic resource collections for novel sources of resistance to stem rust, and it is argued, to other pathogens. However, the indications are that the SVM and NN algorithms have a higher discriminating power and are more robust and thus will be used in further explorations of this kind. In fact, it is difficult to envisage further predictive gains being made through used of alternate models. Rather it is more likely that gains will come from an appropriate mixture of independent variables that are known to influence the selection pressure in question. For example, this study and that of Spooner et al. (2007) used maximum and minimum monthly temperatures. However, since pathogens are known to respond more to the diurnal temperature variation or average temperatures than to absolute max or min temperatures it is possible that greater resolution would be gained by expressing temperature in these terms.

Another line of research being explored by the authors is to consider climatic variables within the context of site specific growing seasons. For example, the minimum temperature in a given month may be important at one location but in another it may have no relevance because it falls outside of that site’s growing season. It is proposed here that instead of using long term climatic averages expressed as monthly values they could instead be expressed as averages for stages in a crops development. Thus in the modeling process the noise created by differences in phenology between sites would be eliminated facilitating higher resolutions to detect environment—trait linkages. Further variables could be created by counting the number of days in a season that meet certain climatic conditions know to be favorable to the pathogen.

Conclusion

It is concluded here that the FIGS approach will improve the use of germplasm as a priori information becomes more available through improved modeling techniques or other approaches. Gollin et al. (2000) stated that such information (a priori information) is extraordinary valuable, so far it is a specialized knowledge, and the authors expect that technology will gradually provide further substitutes for such valuable information. This study demonstrated that modeling techniques such as those explored here provide a predictive framework to quantify the trait-environment relationship that will help in more effectively using genetic resources. The availability of environmental data is providing the opportunity to improve the use of germplasm through the quantification of such trait-environment relationships. As Webb et al. (2010) point out, the type and quality of data are so important that they must be considered non-trivial limitations of such trait-based approaches (Webb et al. 2010).