Introduction

Crossover interaction refers to changes in the ranking of genotypes caused by the lack of genotypic correlation and negative correlations between environments, which is the most critical source of genotype-by-environment interaction (GEI) for plant breeders (Cooper and Delacy 1994; Crossa et al. 2004). Cultivar development programs for crops evaluate experimental genotypes (i.e., prior to release) in multi-environmental trials (MET) to (i) depict GEI patterns for future cultivar placement and (ii) increase the accuracy of selection. Therefore, analytical methods that fully explore the GEI patterns from MET are needed for decision-making (Malosetti et al. 2013; van Eeuwijk et al. 2016; Dias et al. 2022; Tolhurst et al. 2022).

The first attempt to consider the GEI in plant breeding was proposed by Yates and Cochran (1938), who decomposed the part due to the interaction from the total phenotypic variation. Later, Finlay and Wilkinson (1963) used marginal environmental means as independent variables in the regression analysis to depict GEI, and several approaches were developed within that framework (Eberhart and Russell 1966; Li et al. 2018). Multivariate techniques such as additive main effects and multiplicative interaction (AMMI) (Gauch Jr and Zobel 1997) and the genotype plus GEI (GGE) biplot (Yan et al. 2000) have also been extensively used (Yan et al. 2007; Balestre et al. 2009; Silva et al. 2021). Further model expansions were made possible by the development of the linear mixed model equations (Henderson 1949, 1950), which allowed for the incorporation of covariance between relatives and environments and the relaxation of assumptions such as homogeneous residual variances (Piepho et al. 2008). Factor-analytic (FA) mixed models (Piepho 1997; Smith et al. 2001) can be employed to explore the covariance between environments. These models offer the flexibility to account for heterogeneous genotypic (or genetic) covariances between environments using a few latent variables known as factors (K). In addition to the overall (i.e., across environments) and conditional (i.e., within environments) performance, metrics such as stability and sensitivity can also be computed from FA models to facilitate the decision-making process (Stefanova and Buirchell 2010; Cullis et al. 2014; Dias et al. 2018; Smith and Cullis 2018; Smith et al. 2021).

An extension to statistical models that address GEI involves incorporating environmental information, such as physical and chemical soil properties, as well as environmental features like temperature and rainfall precipitation (Tolhurst et al. 2022). The advantages of integrating environmental features into a prediction model include (i) the capability to untangle environmental determinants and the crossover GEI main drivers and (ii) the ability to predict phenotypic performance in yet-to-be-seen environments (Sae-Lim et al. 2014; Oliveira et al. 2020; Tolhurst et al. 2022). Furthermore, categorizing similar environments into homogeneous groups facilitates resource optimization and the identification of mega-environments (Wood 1976; Denis 1988; Van Eeuwijk and Elgersma 1993; Millet et al. 2019; Costa-Neto et al. 2021c; Krause et al. 2022). Therefore, advances in computational resources, along with the development of geographic information systems (GIS) techniques, are essential for designing novel prediction strategies in MET (Cooper and Messina 2021; Rogers et al. 2021; Cooper et al. 2022; Diepenbrock et al. 2022).

GIS techniques have been defined as computer-based systems used for analyzing and interpreting spatially referenced information and are powerful tools in the integration of genetics and environmental information (Beebe et al. 1997; Guarino et al. 2002; Jarquún et al. 2014; Hernández et al. 2019; Costa-Neto and Fritsche-Neto 2021). For example, Annicchiarico et al. (2006) identified consistent genotype-by-location interactions using GIS-based models to recommend cultivars for durum wheat in Algeria. Costa-Neto et al. (2020) applied a GIS-based tool with factorial regression to analyze spatial trends and create thematic maps of yield performance for upland rice in Brazil. In addition, Costa-Neto et al. (2021b) integrated GIS techniques with nonlinear kernels to model additive, dominance, and GEI effects. All the mentioned techniques fall under the umbrella of “envirotypic-assisted selection,” which integrates genomic and environmental data to improve the accuracy of selection in plant breeding programs (Resende et al. 2021).

The combination of statistics, quantitative genetics, and GIS techniques enabled the introduction of the field of enviromics in the plant breeding community (Cooper et al. 2014; Xu 2016; Costa-Neto and Fritsche-Neto 2021). Coupled with knowledge from plant ecophysiology, this field aims to describe how the environment impacts plant development and the phenotypic plasticity of important agronomic traits (Costa-Neto and Fritsche-Neto 2021). Accordingly, envirotypes are all sources of environmental variations related to plant development that can act as environmental markers in statistical genetics models to predict genotypic effects in non-evaluated environments (Xu 2016; Resende et al. 2021). However, integrating phenotypic and genomic data with environmental features can generate two statistical problems: high correlation among predictors resulting in multicollinearity and the curse of dimensionality when the number of observations is smaller than the predictors. In these situations, methods such as partial least squares (PLS), which combine features from principal components analysis and multiple regression (Wold et al. 2001), and Bayesian factor analytic models (Nuvunga et al. 2019), can be applied to identify linear combinations of predictors that capture the underlying structure of the data (Montesinos-López et al. 2022a,b).

Here, we present a novel predictive breeding approach called GIS-FA that combines FA, PLS, and enviromics to predict the phenotypic performance of experimental genotypes in untested environments. The GIS-FA uses environmental information collected from GIS tools to predict the factor loadings of untested environments via PLS, where the estimated factor loadings from the observed environments are used as the training set. The empirical best linear unbiased predicted values (eBLUPs) of genotypic means in untested environments are then calculated as the linear combination of the predicted loadings via PLS and genotypic scores from the FA models. We hypothesize that the GIS-FA model has higher prediction accuracy compared to a PLS model trained with eBLUPs within observed environments (henceforth called GIS-GGE). We tested this hypothesis using two MET datasets from Brazil: rice trials located in the Brazilian Savanna (Cerrado) and the Amazon rainforest, as well as soybean trials located in the state of Mato Grosso do Sul. Thus, this study aims to: (i) propose the GIS-FA methodology for predicting genotypes’ performance in untested environments and compare its predictive ability with the GIS-GGE methodology; (ii) apply GIS-FA to select the best-ranking genotypes based on their overall performance (OP) and stability using the FA selection tools; and (iii) create thematic maps that illustrate the genotypes’ performance across environments in the breeding zone.

Material and methods

Phenotypic data

We exemplify the GIS-FA model using two datasets from MET covering tropical areas in Brazil. These trials have been used to make decisions regarding the release of cultivars by both public and proprietary breeding organizations. The soybean dataset contains three years of field trials conducted in the state of Mato Grosso do Sul (represented by triangles in Fig. 1), whereas the rice dataset includes two years of field trials conducted across eight states (represented by circles in Fig. 1). It is important to note that the variation in elevation varies across the studied area (Fig. 1b). This factor, along with latitude and longitude, influences changes in both weather and soil conditions, as indicated by the Köppen–Geiger classification (Alvares et al. 2013) in Fig. 1c and the Brazilian Soil Classification System (Santos 2018) in Fig. 1d. Both datasets include field trials planted in the same location and year but during different planting seasons. Thus, henceforth, the term “environment” refers to the combination of location, year, and planting season. Another common characteristic shared by both datasets is that not all genotypes were evaluated in all environments (Supplementary Figure 1). This has three main reasons: (i) seed availability, (ii) discarding low-performing lines at the end of each agricultural year, and (iii) including cultivars/genotypes from partner breeding programs for evaluation in the target population of environments (TPE). It is expected that the inclusion/exclusion of selected candidates in the MET does not yield relevant bias in the variance component estimates (Piepho and Möhring 2006; Hartung and Piepho 2021).

Fig. 1
figure 1

Maps of the studied area. a Shows the map of Brazil, highlighting the states where the rice (circles) and soybean (triangles) trials were conducted. We subset these states in b–d. b Depicts the elevation in meters, c displays the Köppen–Geiger classification (Alvares et al. 2013), and d highlights the Brazilian soil classification adapted to FAO classification (Santos 2018; FAO 2014)

Rice dataset

The rice dataset is composed of 80 pure lines developed by the Brazilian Agricultural Research Corporation (Embrapa Rice and Beans). These pure lines plus three commercial cultivars were evaluated for their value of cultivation and use (VCU) in 21 environments during the cropping seasons of 2009/2010 and 2010/2011. Candidate cultivars that demonstrate high yield and agronomic stability in the TPE will be registered for commercial use. The TPE of the Upland Rice Breeding Program is located within the geographical coordinates of 1\(^{\circ }\) North to 17\(^{\circ }\) South and 42\(^{\circ }\) West to 70\(^{\circ }\) West. It includes eight states from the Mid-West (Mato Grosso and Goiás), the Northeast (Maranhão and Piauí), and the North (Pará, Rondônia, Roraima, and Tocantins). Further details are presented in Supplementary Table 1. Eighteen locations were sampled in the TPE (Fig. 1), where trials were arranged in randomized complete blocks with four replications. Experimental plots consisted of four 5 m rows spaced 0.3 m apart, totaling an area of 6 m\(^2\), with 60 seeds sown per meter. Seed yield (kg ha−1) was measured in the two central rows. Management practices in these regions followed the technical recommendations adopted for upland rice.

Soybean dataset

The soybean dataset comprises 195 pure lines that were evaluated over three cropping seasons (2019/2020, 2020/2021, and 2021/2022) at 13 locations in the state of Mato Grosso do Sul and the Central-West region of Brazil (Fig. 1). Trials were conducted under rainfed conditions and overseen by the Mato Grosso do Sul Foundation (Fundação MS) in 49 different environments. The experimental design involved randomized, complete blocks with three replications. The plots consisted of five 12 m-long rows spaced 0.5 m apart, with a total area of 30 m\(^2\). Seed yield (kg ha−1) was measured in the three central rows and corrected for 13% moisture. Weed and pest control were carried out following the recommendations for the region.

GIS-FA workflow

Here, we will summarize the procedures for applying the GIS-FA methodology. The method was created to evaluate the OP and stability of genotypes in untested environments and to plot the spatial prediction on thematic maps. This enables breeders to define strategies for recommending adaptable cultivars, prospect new target environments that maximize genetic gain through selection, and define breeding zones based on the pattern of environmental features. The procedures to apply the GIS-FA are:

  • Step 1—Geographic data collection from tested and untested environments: To implement the GIS-FA method, it is imperative to acquire geographic information. This includes, but is not limited to, latitude and longitude. For the tested environments, such data can be obtained in situ in the experimental area or via GIS tools. For the untested environments, one can sample pixels (coordinates) of the breeding region (or the area under consideration for prediction). These pixels must be representative of the different environmental conditions found in the breeding region. We detail the sampling process adopted in this study in section “Environmental information.”

  • Step 2—Environmental data collection: This step requires information on the sowing and harvest times for each trial. More detailed results can be achieved by using genotype-specific harvest dates. The process of envirotyping (data collection and processing) is crucial for understanding the environmental factors that drive the G \(\times\) E interaction and shape the development of the plant (Cooper et al. 2014; Xu 2016; Costa-Neto et al. 2021a). Environmental features can be obtained in the form of in situ data (e.g., from sensors attached to drones or high-throughput phenotyping stations) or in raster format (e.g., historical series for a given geographic point stored on online platforms as rasters). Other methods of obtaining these data include meteorological stations, the National Centers for Environmental Information (NCEI) (NOAA 2023), the Climate Forecast System Reanalysis (CFSR) (CFSR 2018), the European Centre for Medium-Range Weather Forecasts (ECMWF) (ECMWF 2023), the Global Historical Climatology Network (GHCN) (GHCNd 2023), the NASA Earth Observing System Data and Information System (EOSDIS) (EOSDIS 2023), WorldClim (Fick and Hijmans 2017), Climatologies at High Resolution for the Earth’s Land Surface Areas (CHELSA) (CHELSA 2023), and the Climate Research Unit Time-Series (CRU TS). Soil data can be collected through analysis conducted in the experiment itself or obtained from databases such as SoilGrids (SoilGrids 2022). We detail the collection of environmental features in both datasets analyzed in section Environmental information. The incorporation of environmental features in statistical-genetic models is based on Shelford’s Law (Shelford 1911), which states that the growth of a species is regulated by environmental factors (within a range of maximum and minimum values). The environmental features can serve as environmental markers, enabling a deeper understanding of phenotypic expression. This concept was introduced in the context of G \(\times\) E analysis for plant breeding by Costa-Neto et al. (2021a), in which more details of its theoretical application are provided in the text. In this case, there is an association between the environmental marker and the evaluated genotype. Environmental features can also be used to characterize both tested and untested environments, allowing for the determination of the similarity of the sampled points to the TPE (see section Environmental similarity and interpolation grid for details).

  • Step 3—Phenotypic data analysis: In this step, we fit FA models with different numbers of factors and choose one based on parsimony and/or explanatory ability (as detailed in section FA model selection). After choosing the model, we use the FA selection tools (Stefanova and Buirchell 2010; Smith and Cullis 2018) to build a selection index and select the best-ranking genotypes across different environments (further details in section Selection tools for overall performance and stability).

  • Step 4—Prediction for the untested environments: The matrix of rotated loadings of the chosen FA model is used to train a PLS regression model with the gathered environmental features. The goal is to predict the factor loading of untested environments only by providing the model with environmental information about these locations. Once the loadings are predicted, they are used in linear combinations with the experimental genotypes’ factor scores to predict the eBLUPs in untested environments. This process is thoroughly detailed in section Spatial predictions in the breeding zone.

  • Step 5—Map-based recommendation: The prediction phase provides the performance of each genotype in the new locations that were sampled in the first step. To extrapolate to the whole breeding region, an interpolation process is required (detailed in section Environmental similarity and interpolation grid). We proposed three types of thematic maps, considering interpolation: (i) adaptation zones, which allow for the identification of adaptation areas for each genotype, i.e., areas where genotypes are expected to have better responses to the local environmental effects; (ii) pairwise comparisons, which compare the performance of two genotypes (or a genotype and a commercial check) in untested environments; and (iii) which-won-where, used to identify the most promising experimental genotypes in the breeding region. At (i) and (ii), one can make a pre-selection of which genotypes to evaluate using the FA selection tools and perform a detailed study about these selection candidates’ adaptation throughout the breeding region.

Environmental information

We used 32 environmental features in this study, including three geographical coordinates (altitude, latitude, and longitude), 16 related to weather conditions, and 13 soil traits (Table 1). The weather variables for each environment were obtained as daily averages for the growing season (i.e., between sowing and harvest dates) and processed using the R (version 4.2.3, R Core Team 2023) package EnvRtype (Costa-Neto et al. 2021c), which retrieves raw data from the NASA database (Sparks 2018; NasaPower 2022). Most of the soil variables for each location (i.e., latitude/longitude combination) were acquired using the geodata package (Hijmans et al. 2023), which downloads rasters from the SoilGrids platform (SoilGrids 2022). Only the raster data for soil temperature, isothermality, temperature seasonality, and mean diurnal range were manually downloaded from the platform of Lembrechts et al. (2022). Soil rasters were downloaded for a depth interval of 5–15 cm with a resolution of 30 arcseconds. Each pixel represents an area of approximately 1 km\(^2\) and was processed using the raster package (Hijmans 2020).

In this study, we aimed to perform spatial predictions using environmental information in a three-step procedure as follows: (i) defining the scope of the prediction area based on the political borders of the Brazilian states where trials were conducted; (ii) implementing a sampling approach to generate a cloud of geographical points (latitude/longitude) for collecting environmental data. Fifty points were sampled from each municipality within states, ensuring an unbiased sampling of possible environmental conditions in the states; and (iii) using the data from (ii), performed a spatial interpolation to cover the entire area of the state(s) and computed the spatial predictions. In (ii), the soil-related environmental features were obtained as previously described for the tested environments. Monthly averages for the weather-related environmental features were obtained from 2000 to 2021. Further details will be provided in the following sections.

Table 1 Summary statistics of the 32 environmental features classified into three groups: geographical, climatic, and soil-related

Environmental similarity and interpolation grid

The package pdist (Wong 2022) was used to quantify the environmental similarity by calculating the Euclidean distances between the observed and unobserved (i.e., sampled points) environments. Let \(\textbf{W}\) be a \(J \times P\) matrix of scaled values representing P environmental features in J observed environments, and let \(\boldsymbol{\Omega }\) be a matrix containing the same information but for U unobserved environments. The environmental features were scaled to variance 1. Then, the Euclidean distance between an observed environment j and an unobserved environment u (\(D_{ju}\)) is given by the distances between the rows of \(\textbf{W}\) and \(\boldsymbol{\Omega }\) that correspond to j and u, respectively:

$$\begin{aligned} D_{ju} = \sqrt{\sum _{p=1}^P (w_{jp} - \omega _{up})^2} \end{aligned}$$
(1)

where \(w_{jp}\) and \(\omega _{up}\) are entries of \(\textbf{W}\) and \(\boldsymbol{\Omega }\) that represent the value of the pth environmental feature for the jth tested environment and the uth untested environment, respectively.

After calculating the distances between all J and U environments, we expanded these results to include all possible environments within the delimited prediction area using the inverse distance weighting (IDW) interpolation method. The IDW was performed using the Spatstat package (Baddeley et al. 2015). Let \(u^\star\) represent an untested and unsampled environment (\(u^\star = 1, 2, \ldots , U^\star\), with \(U^\star \gg U\)). The Euclidean distance between a given j and \(u^\star\) is defined as:

$$\begin{aligned} D_{u^\star j} = \frac{\sum _{u=1}^U \frac{1}{||u^\star -x_u||^\tau }D_{uj}}{\sum _{u=1}^U \frac{1}{||u^\star -x_u||^\tau }} \end{aligned}$$
(2)

where \(||u^\star -x_u||\) represents the Euclidean distance between \(u^\star\) and a given sampled point \(x_u\) within the observation window, and \(\tau\) is a power of the multiplication determined through cross-validation (CV). Values of \(\tau\) ranging from 0.1 to 5.0, with an increment of 0.1, were tested in the CV. The value that yielded the lowest mean squared error between the predicted and observed values at the sampled points was selected.

Once we have performed the interpolation and obtained the Euclidean distances between all tested and untested environments, we consider the environmental similarity between the uth (or \(u^{*}\)th) untested environment and the observed environments of the TPE to be the minimum distance of u (or \(u^\star\)) to any j:

$$\begin{aligned} S_u = \min (D_{uj}) \quad \& \quad S_u^\star =\min (D_{u^\star j}) \end{aligned}$$
(3)

Phenotypic analysis

The phenotypic analyses across environments for both data sets were performed using the following linear mixed model (Henderson 1949, 1950) in the ASReml-R package (version 4.1.2, The VSNi Team 2023). Variance components were estimated using residual maximum likelihood (Patterson and Thompson 1971).

$$\begin{aligned} \textbf{y} = \mu \textbf{1} + \textbf{X}_1 \textbf{s} + \textbf{X}_2 \textbf{r} + \textbf{Z}_1 \textbf{g} + \boldsymbol{\epsilon } \end{aligned}$$
(4)

where \(\textbf{y}\) is the vector of phenotypic records, \(\mu \textbf{1}\) is the intercept, \(\textbf{s}\) is the vector of fixed effects of environments with design matrix \(\textbf{X}_1\), \(\textbf{r}\) is the fixed vector of within-environment block effects with design matrix \(\textbf{X}_2\), \(\textbf{g}\) is the vector of random genotypic effects nested within environments with incidence matrix \(\textbf{Z}_1\), and \(\boldsymbol{\epsilon }\) is the residual term. The distributional assumptions for \(\textbf{g}\) and \(\boldsymbol{\epsilon }\) are detailed below.

Using the available information on the coordinates (row and column) of each plot in the soybean dataset, we implemented a strategy to control the spatial trends in a single step, following the approach proposed by Gogel et al. (2018). In summary, we conducted model testing in each environment, considering spatial analysis. These adjustments included incorporating autoregressive processes in the error term as well as linear and nonlinear effects as fixed or random terms, as previously demonstrated by Gilmour et al. (1997). We identified the best-fitting model for each specific environment. Once we determined the optimal model for each environment, we incorporated the factors from these models into Eq. (4). Each additional factor followed a block diagonal covariance structure, with non-nil effects only for environments where these factors were present in the best within-environment model. Detailed information about this procedure can be found in Supplementary Table 2. For spatially adjusted trials, the residual effects are distributed as \(\epsilon \sim {MVN}(\textbf{0}, \, \oplus ^J_{j=1} \sigma ^2_{\epsilon _j} [\boldsymbol{\Gamma }_{C_j} \otimes \boldsymbol{\Gamma }_{R_j}])\), where \(\boldsymbol{\Gamma }_{C_j}\) and \(\boldsymbol{\Gamma }_{R_j}\) are autocorrelation matrices of dimensions \(C_j \times C_j\) and \(R_j \times R_j\), respectively. Here, \(C_j\) represents the number of columns, and \(R_j\) represents the number of rows in the jth trial. These matrices have a value of 1 on the diagonal, and the off-diagonal elements represent the autocorrelation coefficients that quantify the spatial trends in the column or row directions. For environments where no spatial adjustment was necessary, \(\epsilon \sim {MVN}(\textbf{0}, \, \oplus ^J_{j=1} \sigma ^2_{\epsilon _j} \textbf{I}_{N_j})\), where \(\textbf{I}_{N_j}\) is an identity matrix of order \(N_j\), which corresponds to the number of phenotypic records per environment. \(\oplus\) represents the direct sum, which generates a block diagonal matrix, and \(\otimes\) denotes the Kronecker product. For the rice dataset, since we did not have access to spatial information, \(\epsilon \sim {MVN}(\textbf{0}, \, \oplus ^J_{j=1} \sigma ^2_{\epsilon _j} \textbf{I}_{N_j})\).

Genotypic effects were modeled using the FA covariance structure (Piepho 1997; Smith et al. 2001):

$$\begin{aligned} \textbf{g} = (\hat{\boldsymbol{\Lambda }} \otimes \textbf{I}_V) \tilde{\textbf{f}} + \tilde{\boldsymbol{\delta }} \end{aligned}$$
(5)

where \(\hat{\boldsymbol{\Lambda }}\) is the \(J \times K\) matrix of K loadings for the J environments (\(\hat{\boldsymbol{\Lambda }} = \{ \hat{\lambda }_{k_j} \}\)), \(\tilde{\textbf{f}}\) is a vector of K scores for the V genotypes (\(\tilde{\textbf{f}} = \{ f_{k_v}\}\)), and \(\tilde{\boldsymbol{\delta }}\) is the vector of the VJ lack of fit effects (\(\tilde{\boldsymbol{\delta }} = \{ \hat{\delta }_{v_j}\}\)). \(\textbf{I}_V\) is an identity matrix of order V. \(\tilde{\textbf{f}}\) and \(\tilde{\boldsymbol{\delta }}\) are independent and distributed as multivariate Gaussian with zero means and variances given by \(\textbf{D} \otimes \textbf{I}_V\) and \(\boldsymbol{\Psi } \otimes \textbf{I}_V\), respectively. \(\textbf{D}\) is a \(K \times K\) symmetric positive (semi)-definite factor score variance matrix, and \(\boldsymbol{\Psi }\) is a \(J \times J\) diagonal matrix of environment-wise variances that were not captured by any factor (\(\hat{\boldsymbol{\Psi }} = \{ \hat{\psi }_j \}\)). For more information about the estimation process of \(\hat{\boldsymbol{\Lambda }}\), \(\tilde{\textbf{f}}\), and \(\tilde{\boldsymbol{\delta }}\), refer to Smith et al. (2001), Thompson et al. (2003), and Tolhurst et al. (2022).

Rotation

We followed the rotation process recommended by Smith et al. (2021), where two constraints are imposed for the sake of interpretability: \(\textbf{D}\) is a diagonal matrix with elements arranged in decreasing order, and \(\boldsymbol{\Lambda } \boldsymbol{\Lambda }^\prime\) is an identity matrix, i.e., \(\boldsymbol{\Lambda }\) is composed of orthonormal columns. To address these conditions, we performed the singular value decomposition of \(\hat{\boldsymbol{\Lambda }}\):

$$\begin{aligned} \hat{\boldsymbol{\Lambda }} = \textbf{U} \textbf{L}^{\frac{1}{2}} \textbf{V}^\prime \end{aligned}$$
(6)

where \(\textbf{U}\) is an \(J \times K\) orthonormal matrix whose columns are the eigenvectors of \(\hat{\boldsymbol{\Lambda }} \hat{\boldsymbol{\Lambda }}^\prime\), \(\textbf{L}\) is a \(K \times K\) diagonal matrix with elements given by the eigenvalues of \(\hat{\boldsymbol{\Lambda }} \hat{\boldsymbol{\Lambda }}^\prime\) in decreasing order, and \(\textbf{V}\) is a \(K \times K\) orthonormal matrix whose columns are the eigenvectors of \(\hat{\boldsymbol{\Lambda }}^\prime \hat{\boldsymbol{\Lambda }}\). Note that \(\textbf{U}\) meets the conditions of the second constraint, so \(\hat{\boldsymbol{\Lambda }}^\star = \textbf{U}\), in which \(\hat{\boldsymbol{\Lambda }}^\star\) is the matrix of rotated loadings. By considering \(\textbf{D} = \textbf{L}\), we fulfill the condition of the first constraint. The rotated scores were obtained as \(\tilde{\textbf{f}}^\star = (\textbf{D} \textbf{V}^\prime \otimes \textbf{I}_V) \textbf{f}\), where \(\tilde{\textbf{f}}^\star\) is the vector of rotated scores. After rotation, the conditional distribution of the genotypic effects is \(\textbf{g} \sim {MVN}[\textbf{0}, \, (\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star ^\prime } + \hat{\boldsymbol{\Psi }}) \otimes \textbf{I}_V]\).

FA model selection

FA models with different numbers of factors were fitted and compared in terms of their explanatory ability. We used the average semivariance ratio (ASR, Piepho 2019; Chaves et al. 2023a) as a selection criterion. By calculating the ratio between the average semivariance of \(\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star \prime }\) and the average semivariance of \(\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star \prime } + \boldsymbol{\Psi }\), it is possible to investigate the amount of total covariance that is being captured by the factors of the FA model. The ASR is given as follows:

$$\begin{aligned} \textrm{ASR}&= \frac{\frac{2}{J(J-1)} \sum _{j=1}^{J-1} \sum _{{j^{\prime }}=j+1}^J{\frac{1}{2} \left( \sum _{k=1}^K{\hat{\lambda }^{\star ^2}_{k_j} d_k} + \sum _{k=1}^K{\hat{\lambda }^{\star ^2}_{k_{j^\prime }} d_k}\right) - \sum _{k=1}^K{\hat{\lambda }^\star _{k_{j}}\hat{\lambda }^\star _{k_{j^\prime }}d_k}}}{\frac{2}{J(J-1)} \sum _{j=1}^{J-1} \sum _{j^{\prime }=j+1}^J{\frac{1}{2} \left[ \left( \sum _{k=1}^K{\lambda ^{\star ^2}_{k_{j}}d_k} + \hat{\psi }_j\right) + \left( \sum _{k=1}^K{\hat{\lambda }^{\star ^2}_{k_{j^\prime }}d_k} + \hat{\psi }_{j^\prime }\right) \right] - \sum _{k=1}^K{\hat{\lambda }^\star _{k_{j}}\hat{\lambda }^\star _{k_{j^\prime }}d_k}}}\nonumber \times 100 \end{aligned}$$
(7)

where \(d_k\) is the kth element of the diagonal of \(\textbf{D}\).

We defined an ad hoc threshold of 75% for explanatory ability. As complementary information, we also estimated the proportion of genetic variance explained by the kth factor in the jth environment (\(v_{k_j}\), Smith et al. 2015):

$$\begin{aligned} v_{k_j} = \frac{{\hat{\lambda }^{\star ^2}_{k_j}d_k}}{\sum ^K_{k=1}{\hat{\lambda }^{\star ^2}_{k_j}d_k + \hat{\psi }_j}} \times 100 \end{aligned}$$
(8)

From the best-fit model, we estimated some useful parameters to investigate the experimental precision, such as the environment-wise generalized heritabilities (Cullis et al. 2006) and coefficients of experimental variation (CV), which are given by the following equations, respectively:

$$\begin{aligned} H^2_j&=1-\left( \frac{\overline{v}^{\textrm{BLUP}}_{\Delta }}{2\sigma ^2_{g_j}}\right) \end{aligned}$$
(9)
$$\begin{aligned} \textrm{CV}_j&= \frac{\sigma _{e_j}}{\mu _j} \end{aligned}$$
(10)

where \(\overline{v}^{\textrm{BLUP}}_{\Delta }\) is the average pairwise prediction error variance, \(\sigma ^2_{g_j}\) is the genotypic variance for the jth environment, taken from the diagonal elements of \(\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star ^\prime } + \hat{\boldsymbol{\Psi }}\); \(\sigma _{e_j}\) is the estimated residual standard deviation for the jth environment, and \(\mu _j\) is the mean of the trait for the jth environment.

Genotype-by-environment interaction investigation tools

We investigated the GEI dynamics in the datasets by examining the pairwise genetic correlations between environments and the partitioning of GEI variance into crossover and non-crossover patterns. The pairwise genetic correlation between environments (\(\rho _{jj^\prime }\)) is given as follows (Cullis et al. 2010):

$$\begin{aligned} \boldsymbol{\Upsilon } = \boldsymbol{\Delta }(\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star ^\prime } + \hat{\boldsymbol{\Psi }})\boldsymbol{\Delta } \end{aligned}$$
(11)

where \(\boldsymbol{\Upsilon }\) is a \(J \times J\) matrix of genetic correlations, and \(\boldsymbol{\Delta }\) is a diagonal matrix whose elements are the inverse of the square roots of the diagonal values of \(\hat{\boldsymbol{\Lambda }}^\star \textbf{D} \hat{\boldsymbol{\Lambda }}^{\star ^\prime } + \hat{\boldsymbol{\Psi }}\).

The decomposition of the GEI variance was performed using the following equation, adapted from Cooper and Delacy (1994):

$$\begin{aligned} \sigma _{\textrm{ge}_{\textrm{rank}}}^2= 1 - \frac{\textrm{Var}\left( \sqrt{\sigma ^2_{g_j}}\right) }{\sigma ^2_{ge}} \end{aligned}$$
(12)

where \(\sigma ^2_{\textrm{ge}}\) is the variance attributed to the GEI, which is determined by fitting a compound symmetry model. This model has the same structure as Eq. (4), but the variance–covariance matrix of genetic effects has the form \(\sigma ^2_g \textbf{J} + \sigma ^2_{\textrm{ge}} \textbf{I}_J\), where \(\textbf{J}\) is a \(J \times J\) matrix of ones.

Selection tools for overall performance and stability

The target features of most breeding programs are to achieve high performance and stability across the TPE. Using the best-fit FA model, we estimated metrics to assess the performance and stability of genotypes. The performance was measured using the OP metric (\(\textrm{OP}_v\)), which was obtained as follows (Stefanova and Buirchell 2010; Smith and Cullis 2018):

$$\begin{aligned} \textrm{OP}_v = \frac{1}{J}\sum _{j=1}^J \hat{\lambda }^\star _{1_j} {f}^\star _{1_v} \end{aligned}$$
(13)

Note that only the first factor is used to compute the \(\textrm{OP}_v\). This factor captures the largest portion of the total variance. Thus, it provides a generalized measure of the genetic main effects (Supplementary Figure 2; Stefanova and Buirchell 2010). According to empirical observations by Smith and Cullis (2018), this is valid when the majority of loadings in the first factor are positive, indicating the absence (or insignificance) of crossover GEI in the first factor. Using this principle, the other factors are used to represent stability. Considering that the genetic effect of a given genotype v at the jth environment, disregarding the lack of fit effect, is \(g_{vj} = \hat{\lambda }^\star _{1_j} {f}^\star _{1_v} + \hat{\lambda }^\star _{2_j} {f}^\star _{2_v} + \cdots + \hat{\lambda }^\star _{K_j} {f}^\star _{K_v}\), which is equivalent to \(g_{vj} = \hat{\lambda }^\star _{1_j} {f}^\star _{1_v} + \epsilon _{vj}\), the stability of v is given by:

$$\begin{aligned} \textrm{RMSD}_v = \sqrt{\frac{1}{J}\sum _{j=1}^J \epsilon _{vj}^2} \end{aligned}$$
(14)

in which \(\textrm{RMSD}_v\) is the root-mean-square deviation of v, representing the distance between the point and the slope in a latent regression given by \(g_{vj} = \hat{\lambda }^\star _{1_j} f^\star _{1_v} + \epsilon _{vj}\) (Smith and Cullis 2018).

A desirable genotype i has a high \(\textrm{OP}_i\) and a low \(\textrm{RMSD}_i\). Following these principles, we applied a selection index (\(\textrm{SI}_v\)) with these metrics (Chaves et al. 2023b; Cowling et al. 2023), given as follows:

$$\begin{aligned} \textrm{SI}_v = 2 \times \frac{\textrm{OP}_v - \overline{\textrm{OP}}}{\sqrt{V(\textrm{OP})}} - \frac{\textrm{RMSD}_v - \overline{\textrm{RMSD}}}{\sqrt{V(\textrm{RMSD})}} \end{aligned}$$
(15)

In addition to the selection index, the reliability of the vth genotype (Mrode 2014) was calculated as follows:

$$\begin{aligned} r_v = 1-\frac{\textrm{PEV}_v}{\overline{\sigma ^2_g}} \end{aligned}$$
(16)

where \(\textrm{PEV}_v\) represents the prediction error variance of the vth genotype, and \(\overline{\sigma ^2_g}\) is the average genotypic variance across environments. The reliability metric associated with the selection index is useful for improving the accuracy of selection, especially when dealing with unbalanced data sets. We adopted a selection intensity of 15% for both datasets.

Spatial predictions in the breeding zone

In this study, GIS tools were used to: (1) collect georeferenced data from the evaluated trials, (2) build environmental markers, and (3) perform spatial predictions for a larger area. Here, we used PLS regression (Wold 1966; Aastveit and Martens 1986) to make the predictions. This method is useful when the number of predictors is much larger than the number of observations and when these predictors are correlated. When PLS is used to predict genotypic performances in untested environments, the response variable is the genotypic performance in the testing set. In this situation, the response variable is a \(J \times 1\) vector (\(\textbf{y}\)) of phenotypic records if a genotype-wise PLS model is fitted or a \(J \times V\) matrix (\(\textbf{Y}\)) when a multivariate PLS model is fitted considering all genotypes at once (Monteverde et al. 2019; Costa-Neto et al. 2022). We refer to the multivariate model as GIS-GGE.

We modified GIS-GGE by using the rotated loadings of the tested environments (\(\hat{\lambda }^{\star }_{k_j}\)) as response variables instead of the within-environment phenotypic records of the genotypes. This procedure we called GIS-FA. We obtained these loadings from the previously chosen FA model (section FA model selection). With the predicted loadings and the previously estimated scores for each genotype from the FA model, we can predict the empirical BLUPs of the genotypes in untested environments. The PLS regression model was trained using the rotated loadings and environmental features of the tested environments:

$$\begin{aligned} \hat{\boldsymbol{\Lambda }}^\star = \textbf{W} \textbf{B}^\star + \textbf{E} \end{aligned}$$
(17)

where \(\textbf{B}^\star\) is a \(P \times K\) vector of coefficients, \(\textbf{E}\) is a \(J \times K\) matrix of lack-of-fit effects, and \(\hat{\boldsymbol{\Lambda }}^\star\) and \(\textbf{W}\) are previously described in sections Phenotypic analysis and Environmental similarity and interpolation grid, respectively. We obtained \(\textbf{B}^\star\) using a kernel PLS algorithm (Lindgren et al. 1993; Dayal and MacGregor 1997) implemented in the pls package (Liland et al. 2022). This algorithm is detailed in Appendix A.

After training the model, we substituted \(\textbf{W}\) with \(\boldsymbol{\Omega }\) to predict the K loadings of the U untested environments:

$$\begin{aligned} \hat{\boldsymbol{\Lambda }}^\star _U = \boldsymbol{\Omega } \textbf{B}^\star + \textbf{E} \end{aligned}$$
(18)

Recall from section Environmental information that \(\boldsymbol{\Omega }\) was built using historical weather data from 2000 to 2021, as well as soil environmental features. Once we predicted the loadings of untested environments, we used them in linear combinations with the previously predicted scores of each genotype (see section Phenotypic analysis) to estimate their eBLUPs within untested environments:

$$\begin{aligned} \textbf{g}_U = (\hat{\boldsymbol{\Lambda }}^\star _U \otimes \textbf{I}_V) \tilde{\textbf{f}}^\star \end{aligned}$$
(19)

Note that we use the same scores to predict the eBLUPs of both tested and untested environments. Nevertheless, the scores are predicted based solely on the data collected from the tested environments. In other words, the environments in the data set must accurately reflect the TPE so that the loadings of the untested environments closely match the loadings of the tested ones.

A CV process is required to obtain \(\textbf{B}^\star\). We employed a leave-one-out scheme, where data from a single environment were removed (the testing set), and predictions were made using the information provided by the remaining environments (the training set). The predicted eBLUPs were then correlated with the actual eBLUPs and eBLUEs of each environment to determine the predictive ability of the PLS regression model. The model with the highest number of components demonstrating predictive ability was chosen. We leveraged the same CV scheme to compare the predictive ability of GIS-FA and GIS-GGE. In this study, the PLS regression of GIS-GGE was trained with the within-environment empirical eBLUPs of each genotype as response variables.

Thematic maps

Thematic maps combine cartographic principles and GIS tools to represent and analyze spatial and geographic phenomena. The incorporation of spatial interpolation methods enables the estimation of values in untested locations, resulting in a seamless representation of the phenomenon. This facilitates the identification of patterns and trends, aiding decision-making across various fields of study (Costa-Neto et al. 2020).

Recall that \(\boldsymbol{\Omega }\) has U rows, and the predictions must be extrapolated to all \(U^\star\) untested environments within the targeted area. For this purpose, we used an interpolation process similar to the one described in section Environmental similarity and interpolation grid. The difference is that for the environmental similarity maps, we interpolated Euclidean distances, while for the thematic maps described in this section, we interpolated eBLUPs. Once the spatial prediction was interpolated across the whole breeding region, we built thematic maps to aid in the visualization and interpretation of the results. We created maps with three themes:

  • Adaptation zones: These maps depict the expected spatial prediction of each selection candidate across the breeding zone. The adaptation of a genotype to an environment is assessed by the expected response of that genotype when it is planted in that environment. Thus, in this context, “adaptation” is used as a synonym for specific performance. For improved visualization, we divided the predicted eBLUPs into eight categories (from expected yield lower than 2500 kg ha−1 to expected yield higher than 4000 kg ha−1), and each category was then assigned a specific color.

  • Pairwise comparisons: These maps allow for a direct comparison of the expected responses of different genotypes in specific environments. Two distinct colors, one for each candidate, were used to indicate that the superior selection candidate was superior in each location on the map. This visual representation helps to quickly identify which selection candidate outperforms the other in each pixel, facilitating the interpretation of competitive advantages among genotypes in specific environments.

  • Which-won-where: The genotype that achieved the best performance in each location on the map is highlighted. This map provides a clear depiction of the winning genotype for each specific location, enabling a comprehensive understanding of the distribution of high-performing genotypes across the breeding zone.

These maps, like all the other plots, were built using the ggplot2 package (Wickham 2016), with the addition of the ggspatial (Dunnington 2023) and sf (Pebesma and Bivand 2023) packages. The shapefiles we used are freely available at the Brazilian Institute of Geography and Statistics (IBGE in the Portuguese acronym) website (https://www.ibge.gov.br/geociencias/organizacao-do-territorio/malhas-territoriais/15774-malhas.html), or they can be downloaded using the geodata package. The Supplementary Material has the commented R scripts used to perform GIS-FA, and users can reproduce it using the soybean dataset, freely available at https://github.com/Kaio-Olimpio/GIS-FA/tree/main.

Results

Experimental accuracy

In the rice dataset, \(\textrm{CV}_j\) ranged from 0.11 (E20) to 0.34 (E13), and \(H_j^2\) ranged from 0.31 (E08) to 0.78 (E18) (Fig. 2a). In the soybean dataset, \(\textrm{CV}_j\) ranged from 0.04 (E31) to 0.17 (E42), and \(H_j^2\) ranged from 0.31 (E18) to 0.77 (E31) (Fig. 2b). Spatial trends were modeled in 37 out of 49 soybean trials (Supplementary Table 2).

Fig. 2
figure 2

Scatter plot representing the experimental coefficient of variation (CV, on a decimal scale) in the y-axis and the generalized heritability in the x-axis for grain yield (kg ha\(^{-1}\)) of rice (a) and seed yield (kg ha\(^{-1}\)) of soybean (b) trials

Genotype recommendations for tested environments

The FA model with four factors (FA4) met our criteria for both datasets. It explained more than 75% of the variance (Table 2). This model captured most of the within-environment variance in both datasets (Supplementary Figure 3).

Table 2 Fitted factor-analytic mixed models for each dataset (rice and soybean) and their respective logarithm of the likelihood function (LogL), number of parameters (no. par.), and average semivariance ratio (ASR)

The genotypic correlations ranged from \(-\) 0.0031 (E07 vs. E19) to 0.8936 (E13 vs. E19) for the rice dataset (Fig. 3a) and from \(-\) 0.0010 (E07 vs. E41) to 0.9753 (E031 vs. E32) for the soybean dataset (Fig. 3b). In the rice dataset, environments E17 and E18 exhibited the most contrasting patterns compared to the other environments. Their correlations with the remaining environments were predominantly negative or close to zero. Similarly, in the soybean dataset, negative or negligible correlations were observed for contrasts involving environments E18, E33, E34, E43, E46, and E47. These findings indicate substantial differences between these specific environments and the rest of the dataset. The wide range of correlation magnitudes is reflected in the percentage of crossover GEI in the datasets: 76 and 81% of the total GEI were due to crossover interactions in the rice and soybean datasets, respectively.

Fig. 3
figure 3

Heatmap representing the genetic correlation between pairs of environments in the rice (a) and soybean (b) datasets. The color gradient depicts the direction of the correlation: Red designates a negative correlation, whereas green represents a positive correlation

The selected candidates based on the selection index are highlighted in Fig. 4. Despite the low reliability of the rice dataset, genotypes G23, G18, G29, G31, and G26 stand out for their high stability. Genotypes G10, G09, G03, and G01 presented high \(\textrm{OP}\) and reliability. The check treatment (C83) had the highest \(\textrm{OP}\), but it exhibited low stability and reliability compared to the other selected genotypes (Fig. 4a). Among the soybean genotypes, G178, G031, G101, G052, and G035 exhibited the highest stability. On the other hand, G177, G100, G144, G088, and G016 were notable for their high OP. Genotype G16 showed high OP, stability, and reliability (Fig. 4b). The reliability of the selected candidates was higher in the soybean dataset.

Fig. 4
figure 4

Overall performance (y-axis) and root-mean-square deviation (x-axis) of the experimental genotypes in the rice (a) and soybean (b) datasets. The most productive genotypes are oriented toward the upper part on the y-axis, and the most stable ones are toward the left in the x-axis

Predictions using environmental markers in untested environments

Environmental similarity

The rice trials are spread throughout the breeding region and effectively capture the environmental conditions of the area being studied (Fig. 5a). On the other hand, the trials in the soybean dataset are concentrated in the central part of the state, while there is a region to the west that exhibits low similarity. This area corresponds exactly to the Pantanal biome, which is a protected area with legal restrictions on soybean planting (Fig. 5b). This is probably the reason why there is no trial in this region.

Fig. 5
figure 5

Environmental similarity between tested and untested environments in the target population of environments in the rice (a) dataset and in the soybean (b) dataset. The warmer the color, the higher the similarity, and consequently, the higher the prediction reliability. Colored circles represent the trials’ locations

GIS-FA validation

In comparison with GIS-GGE, our proposal yields a higher prediction accuracy (as measured by the simple correlation between predicted and observed values) for both datasets. For predicting eBLUEs, GIS-FA is 10 and 1% better than GIS-GGE in the rice and soybean datasets, respectively. For predicting eBLUPs, GIS-FA is 9 and 5% more effective than GIS-GGE. A second way to assess the predictive ability of the methods is to check the coincidence between the top 10% of observed and predicted values (Fig. 6). GIS-FA provides more assertive results (Fig. 6a, b) than GIS-GGE (Fig. 6c, d). In other words, when recommending elite candidates based on predicted values, it is more probable that the true top performers will be recommended using GIS-FA than using GIS-GGE. In the rice dataset (Fig. 6a, c), GIS-FA has an accuracy that is 13.15 percentage points higher than GIS-GGE. In the soybean dataset (Fig. 6b, d), GIS-FA is 21.19 percentage points more advantageous than GIS-GGE.

Table 3 Prediction accuracy of eBLUEs and eBLUPs using the proposed method GIS-FA and the conventional method GIS-GGE
Fig. 6
figure 6

Scatter plot of all predicted values (x-axis) in the leave-one-out cross-validation scheme against observed values (y-axis). The dashed lines represent the empirical percentiles (20, 50, 75, and 90%) associated with the trait value. The colored dots represent the coincident selection candidates when selecting the top 10% performers using observed and predicted values. Each color represents a different genotype. “Coincidence” in the lower left corner of each plot depicts the accuracy of selecting the top 10% using the predicted values. a, b Illustrate the results for the GIS-FA method in the rice and soybean datasets, respectively. c, d Represent the results for the GIS-GGE method in the rice and soybean datasets, respectively

Thematic maps of adaptation zones

The spatial prediction done by GIS-FA was useful in assessing the expected performance of the experimental genotypes in untested environments. This helps to define adaptation zones for each genotype, which are the theme of the maps in Fig. 7. For example, G16 of the rice dataset, shown in Fig. 7a, seems to be well adapted only in a small portion of Goiás State (green region), and it responds poorly to the environmental effects of other locations within the breeding region. Conversely, G27 of the rice dataset, shown in Fig. 7b, exhibits a broader spectrum in terms of adaptation in the breeding region. The same interpretation applies to the genotypes in the soybean dataset. G064 (Fig. 7c) is an unstable candidate, with a very restricted area where it is better adapted (in the northern part of the breeding region). On the other hand, G088 (Fig. 7d) is a stable genotype, meaning it possesses alleles that respond favorably to the environmental effects of different locations across the state. In each map, we provide the OP and RMSD of the corresponding genotype. We have deliberately chosen two promising candidates (Rice’s G27 and soybean’s G088, which are among those selected in Fig. 4), as well as two low-yielding genotypes (Rice’s G16 and soybean’s G064), to be included in Fig. 7. Nevertheless, we recommend using OP and RMSD as criteria to choose the genotype for which an adaptation map should be created.

Fig. 7
figure 7

Genotype-wise adaptation map showing the adaptation zones of the genotypes G16 (rice dataset, a), G27 (rice dataset, b), G064 (soybean dataset, c), and G088 (soybean dataset, d). The color scale represents the expected yield classes, from non-adapted (intense red) to more than 4000 kg ha\(^{-1}\) (intense green). The white contour delimits the Pantanal biome. On the upper right of each map, we provide the overall performance (OP) and root-mean-square deviation (RMSD) of each genotype

Thematic maps of pairwise comparison

To support the decision-making process, we developed a second thematic map: the pairwise comparison maps (Fig. 8), which facilitate the comparison of two candidates. Take, for example, G10 and G19 in Fig. 4a and G100 and G177 in Fig. 4b. These candidates have somewhat similar performances, according to their OP and RMSD. However, they are clearly adapted to different zones within the breeding region. G10 shows better responses at lower latitudes, while G19 is more suitable for higher latitudes (Fig. 8b). G100 is better adapted to the central region of the soybean’s breeding region, and G177 is more compatible with the environmental conditions at the breeding region’s horizontal extremes (Fig. 8d).

Fig. 8
figure 8

Pairwise comparison map showing the regions within the rice (a, b) and soybean (c, d) target populations of environments where a selection candidate outperforms a given peer. The colors across the map represent the winning genotype. a, c Are examples of pairwise comparisons between an experimental genotype and a commercial check, while b, d contrast the performance of two promising experimental genotypes along the breeding region. The white contour in c and d delimits the Pantanal biome

Thematic maps of which-won-where

The which-won-where map (Fig. 9) shows the experimental genotype that is most suitable for a specific environment within the breeding zone. In the rice dataset (Fig. 9a), G10 emerges as the most promising experimental genotype in almost all environments in the central and northern portions of the breeding zone, while G19 prevails in the southern and eastern regions. G09, G16, G17, and G20 are the most suitable for specific environments. The breeding region of the soybean dataset is more diverse, with G177, G100, G170, and G088 being the most important experimental genotypes, as they have emerged as the winners in the widest area. The other selection candidates, including a cultivar check (C054), are the top performers in only a few restricted environments (Fig. 9b).

Fig. 9
figure 9

Which-won-where map depicting the most promising genotype at each location across the target population of environments of the rice dataset (a) and the soybean dataset (b) Each color represents the experimental genotype that wins in a specific environment within the breeding region. The white contour in b delimits the Pantanal biome

Discussion

The GIS-FA method represents the integration of modern statistical genetics with GIS principles. We showed how GIS-FA can aid plant breeders in making decisions by considering the observed performance in tested environments and spatial predictions in untested environments. For observed environments, GIS-FA leverages the resources of FA models to provide useful inferences about the dynamics of the GEI and to select candidates with high performance and stability using customized selection tools (Stefanova and Buirchell 2010; Smith and Cullis 2018). In untested environments, GIS-FA allows the recommendation of cultivars based on spatial predictions derived from soil characteristics, climatic conditions, and empirical data parameters (i.e., factor loadings for genotypes). The GIS-FA method allows for data-driven decision-making with the aid of graphical tools such as thematic maps. These maps include (i) adaptation zone maps, which depict the expected spatial prediction of each genotype within the entire breeding zone; (ii) pairwise comparison maps, which facilitate the comparison of performance between two selection candidates (or a candidate and a commercial check); and (iii) which-won-where maps, which show the most promising experimental genotype (the winner) in each location within the breeding zone.

Genotype-by-environment interaction and selection in tested environments

Increasing crop yield and adapting to different growing conditions are important goals in plant breeding. These traits are the outcomes of a plethora of small quantitative trait loci (QTLs) effects that are highly influenced by the environment (Lynch and Walsh 1998; Crossa 2012). In terms of cultivar recommendation in the TPE, the most concerning source of the GEI is the lack of genotypic correlation between environments (Cooper and Delacy 1994), as observed in both data sets (Fig. 3). As a consequence, it is unlikely that the same set of experimental genotypes will exhibit similar performance across uncorrelated environments. In this case, if a global (i.e., across environments) recommendation is needed, metrics such as the selection index, which combines performance and stability, might be employed. The weight of each metric in the selection index is determined by the breeder (Chaves et al. 2023b). Here, we prioritized performance over stability.

In the GIS-FA method, we leverage the resources of FA mixed models (Piepho 1997; Smith et al. 2001) that explore the complexity of the GEI while handling highly unbalanced data sets. Furthermore, FA models allow for a parsimonious estimation of environment-wise genotypic variances and pairwise covariances. These covariances can be used to investigate the dynamics of GEI, as fully described in this study. The efficiency of the GIS-FA method depends on the choice of the number of factor loadings in the FA model, i.e., a poor choice will provide erroneous results. In GIS-FA, it is important to note that the factor loadings of observed environments are used as the training set. This allows for the prediction of the loadings of untested environments in the testing set. Thus, when selecting the best-fit FA model, selection criteria such as the ASR should always be considered. Naturally, using more factors will provide greater explanatory ability. Nevertheless, it will hinder parsimony and computational efficiency, especially in large data sets.

Assuming that the observed environments accurately represent the expected environmental conditions throughout the breeding zone, the most promising genotypes in the tested environments are probably the best ones in the untested environments. Thereby, the idea is to prioritize selected experimental genotypes when drawing the thematic maps “genotype-wise adaptation” and “pairwise comparisons.”

Spatial interpolations in untested environments

Like molecular markers, environmental feature similarity can be used for both inference and prediction purposes. Inference models aim to determine the effect of each environmental feature on phenotypic expression and the GEI, which is analogous to QTL mapping models (Denis 1988; Van Eeuwijk and Elgersma 1993; Crossa et al. 1999; Costa-Neto et al. 2021c; Heinemann et al. 2022). In this work, we focused on environmental-wide predictions, regardless of the particular effect of each EF on phenotypic expression and GEI. As polygenic models are used to perform whole-genome regressions (Meuwissen et al. 2001), we assumed that the core of ecophysiological effects captured by the environmental feature could be sufficient to generate genotype-wise predictions across the spatial grid. The benefits of incorporating environmental features into predictive breeding are advantageous in most cases, whether integrated with genomic information or not (de los Campos et al. 2020; Buntaran et al. 2021; Jarquún et al. 2021; Costa-Neto et al. 2022). However, recent work from Crossa et al. (2023) demonstrated that the inclusion of environmental covariates could either increase or decrease prediction accuracy, depending on the specific case. Techniques such as feature selection (Crossa et al. 2023) and exhaustive search (Li et al. 2018) can be considered when selecting environmental features.

Environmental similarity

Environmental similarity maps revealed a need to perform an adequate sampling of the different environmental types within a given target breeding region (Fig. 5). This entails including samples from various climatic conditions and soil traits that may be encountered in future predictive environments. Essentially, these maps illustrate a metric of reliability for spatial predictions by benchmarking the similarity between observed and unobserved environments. They demonstrate the environmental similarity between tested and untested environments. In other words, the more similar an untested environment is to a tested environment, the higher the chances of making an assertive prediction. The results depicted in the maps of Fig. 5 can be attributed to the geographical distribution of trials in relation to the Brazilian biomes [refer to Figure 1 of Chaves et al. 2023b for a map with the Brazilian biomes]. The soybean breeding region comprises two biomes, namely the Pantanal (wet lowlands) and the Cerrado (highland savanna conditions). All trials were conducted in the Cerrado, which explains the lack of similarity between the TPE and the environments in the Pantanal biome. Consequently, the prediction for this particular region is likely to be compromised. The rice breeding region also includes two biomes: Amazonia (a wet tropical rainforest) and Cerrado. Unlike the soybean dataset, there are representative trials from both biomes, providing comprehensive coverage of the relevant environmental conditions.

Predicting using partial least squares regression

The association between PLS regression, GEI, and environmental features was introduced by Aastveit and Martens (1986) for inference purposes. Their aim was to address challenges related to the curse of dimensionality and multicollinearity in explaining the dynamics of GEI using two datasets. Their model was later expanded to include information on molecular markers to investigate QTL-by-environment interactions (Crossa et al. 1999; Vargas et al. 2006). Nevertheless, employing environmental features in statistical models to explain and predict GEI has not gained significant popularity among plant breeders (Vargas et al. 2001; Ortiz et al. 2007; Ramburan et al. 2012; Porker et al. 2020). With the advancement of computational technology and the democratization of “enviromics” resources, PLS has emerged as a suitable method for exploring big data and performing spatial predictions of experimental genotypes in new environments (Monteverde et al. 2019; Rincent et al. 2019; Guo et al. 2021; Costa-Neto et al. 2022). In fact, PLS has emerged as a relevant alternative for prediction purposes, even when breeders do not specifically incorporate environmental data into the model (Ortiz et al. 2023).

In most studies that employed PLS regression for prediction purposes, the training set typically consisted of the performance per se of genotypes and environmental features from the tested environments (Monteverde et al. 2019; Costa-Neto et al. 2022). Our study demonstrated that associating environmental features with the rotated factor loadings of the tested environment yields superior results. Through GIS-FA, we achieved higher prediction accuracy (Table 3) and enhanced the ability to distinguish high-performance experimental genotypes when relying solely on predicted values (Fig. 6). By predicting the factor loadings for untested environments, we establish a connection between the observed environmental feature values and the underlying causes of GEI, as well as the genetic covariance that exists between environments. A prior study by Rincent et al. (2019) also utilized PLS models to predict latent factors of the AMMI components for untested environments. This approach enabled them to construct an appropriate covariance structure that improved the accuracy of their predictions. The findings of Rincent et al. (2019) and the results of this work provide evidence of the potential of using PLS models to indirectly perform spatial predictions by initially predicting the latent elements that contribute to a particular performance. A similar strategy was proposed in a single-step model by Tolhurst et al. (2022), who demonstrated the efficiency of combining known and latent environmental features to predict both tested and untested environments.

Thematic maps

An important feature of GIS-FA is the illustration of the spatial predictions from selection candidates using thematic maps (Figs. 7, 8, and 9). Figure 7 offers information on the areas within the breeding zone where the experimental genotypes are expected to thrive. Figure 7 allows the evaluation of the merit of a certain candidate cultivar based on its ability to outperform a commercial cultivar used as a reference or another promising experimental genotype. Figure 9 provides a straightforward solution for genotype recommendations across the breeding region, indicating which candidate is more suitable for a specific environment within the breeding zone. Thematic maps serve as valuable tools in decision-making, assisting in the allocation of genotypes in the breeding region (Costa-Neto et al. 2020; Bustos-Korts et al. 2022). In addition, the thematic maps provide information on the genotypes’ stability and adaptation from a geographic perspective. Costa-Neto et al. (2020) suggested that, in a GIS context, “stability” means lower variability in spatial patterns, while “adaptation” refers to the expected performance in a specific environment in the breeding region.

One advantage of this approach is the possibility of integrating high-quality satellite images from diverse platforms. Here, we used freely available geographic databases on online platforms to achieve an efficient prediction method without incurring any additional costs. Furthermore, implementing partial geographic visualizations can optimize resource allocation when defining the experimental network of trials. The higher resolution of the satellite-based data could enable the delivery of spatial predictions at the farmer’s level. This could benefit the product development and placement stages by extending this methodology to accommodate satellite-based enviromics while also accounting for historical agronomic records.

Future directions

The statistical models of GIS-FA can be improved by integrating molecular information to leverage the covariance between relatives and employing more informative environmental features in the PLS model (Dias et al. 2018; Monteverde et al. 2019; Crossa et al. 2023). The utilization of ecophysiological environmental features in crop growth models could enhance our understanding of the link between phenotypic expression and environmental factors (Rincent et al. 2019; Costa-Neto et al. 2021a). GIS-FA can also be benchmarked with other enviromic-based approaches fit for predicting genotypes in untested environments (Jarquún et al. 2014; Tolhurst et al. 2022; Costa-Neto et al. 2020). Other statistical resources and even artificial intelligence methods can replace the PLS in the prediction step (Guo et al. 2021; Heinemann et al. 2022). Finally, future research can explore the potential risks associated with assigning genotypes to specific environments using GIS-FA. This can be done through the application of probabilistic methods (Dias et al. 2022).