Introduction

The climate change scenario challenges agricultural research to provide intelligent solutions in a fast and economical way (Tigchelaar et al. 2018). Characterizing the conditions of crop growth is crucial to achieving this purpose (Xu 2016), allowing a deeper understanding of how the environment shapes phenotypic variations (for example, (Costa-Neto et al. 2021a; de los Campos et al. 2020; Heinemann et al. 2019; Ramirez-Villegas et al. 2018). Since 1960, several researchers have suggested the use of environmental information to explain the differences caused in cultivars due to the genetic-environment interaction (GxE) (Perkins and Jinks 1968; Crossa et al. 1999; Vargas et al. 1999). The environmental information used in these models of genomic selection usually focuses on the use of the information such as temperature, rainfall, and solar radiation, defined as co-variables within the models (Jarquín et al. 2014).

For research in plant breeding, especially for the selection of better-evaluated soybean genotypes for a target region, this approach is proven to be advantageous to discriminate genetic and non-genetic sources of culture adaptation (Costa-Neto et al. 2021c). In this context, new technologies available such as the historical description of the environment (Enviromics) (Costa-Neto et al. 2021b, c; Resende et al. 2021; Rogers et al. 2021) are crucial to improving conventional models, but bring the challenges of changing the already established systems. The integration of this new technology allied to the already established models allows the selection of cultivars with high yields in the face of the environmental conditions caused by climate changes and the consequent increase of the occurrence of abiotic stresses (Crossa et al. 2021).

The indication of genotypes may vary according to the macro-environment, climate and soil changes, different latitudes and longitudes, and years (Bourret et al. 2015; Gray et al. 2016), and it may also vary with changes within a micro-environment (Resende et al. 2016; Soares et al. 2016). Thus, the concept of envirotyping emerges to establish the quality of a certain environment (Cooper et al. 2014; Xu 2016); it uses multiple techniques to collect, process and integrate environmental information in genetic and genomic studies (Costa-Neto et al. 2021b), in addition to fostering breeding strategies to understand and deal with future scenarios of climate changes (de los Campos et al. 2020; Gillberg et al. 2019).

This information can be affected by many factors, such as the great amount of data, because they have complex structures, they are non-linear and because of the presence of redundancies and outliers (Gianola et al. 2011). Thus, non-linear methodologies are preferable to deal with a set of complex environmental data (Calus et al. 2004; Gianola et al. 2011). The use of these non-linear methodologies associated with environmental data has become more and more popular in recent years (for example Friedel 2012; Liukkonen et al. 2013; Strebel et al. 2013). However, the use of new technologies such as the envirotype and the use of neural networks associated with georeferenced data are crucial to improve conventional models and selecting high-yielding soybean cultivars in the face of environmental changes caused by climate change and abiotic stresses.

This study presents the use of the technique of neural networks associated with georeferenced data to implement the processing of climate and soil data, to describe and categorize this information with basis on the dissimilarity caused by the environmental variables, and subsequently to apply this in models of reaction norms in order to study and quantify genetic behaviors and genotype × environment interactions in soybean genotypes.

Materials and methods

Environmental data collection

This study used climate and soil information of 32 municipalities located within the Brazilian macro-region of soybean culture called MRS 3, in the state of Goiás. The municipalities chosen are part of the network of trials of value for cultivation and use (VCU) of the GDM Genética do Brasil S.A. company (GDM). The daily meteorological information that was given to this work is part of the collection of the Agrymet company. All data was kindly provided by GDM.

In order to characterize the sites being analyzed, a historical series of climate characteristics were used (Table 1), evaluated between the years of 2018 and 2020, from November to February. This time series was defined to capture all the climate variations throughout the whole development of the soybean culture in the region.

Table 1 List of the environmental variables considered in the study, obtained by the Agrymet company

Soybean data

This study makes use of a great set of yield data formed by VCU trials of soybean varieties; the data set of this study was kindly provided by GDM Genética do Brasil S.A. company. The phenotypic data were the reports on grain yield (kg ha−1). This set of trials was carried out in multi-environmental conditions (MET) from 2018 to 2020 and standardized by the GDM company, in which each trial was composed of 17 genotypes. The trial was formed by four randomized blocks with three replications. Each plot is formed by a line of 4 m, with 19 seeds per meter.

Definition of the target population of environments (TPE)

The methodology of self-organized maps (SOM) of Kohonen, according to (Kohonen 2013), was used to characterize the patterns of the spatial distribution of the environmental variables. SOM is formed by two layers; one is the input, which transfers the data to the map, and the second one corresponds to the process of competitive learning of neurons, forming, this way, a topological structure (Chen et al. 2019). During the learning process of the network, the climate variables were informed as the input vector. According to the learning process, each input vector is attributed to an output neuron, attributing a weight associated with the input information. Based on these weights, the distance between neurons was calculated. The Euclidean mean distance standardized with a number of 1000 interactions was used for the processing. The construction of networks used the package “Kohonen” (Wehrens and Kruisselbrink 2018).

After the SOM learning process, the classification of the target population of environments (TPE) was carried out by using the procedures of discriminant analysis (DA) and the principal component analysis (PCA). Successive K-means were used for an interval of K-neurons, and the values of Bayesian Information Criterion (BIC) of the corresponding models and the coefficients of variation were calculated until the ideal number of clusters was found. The DA and PCA functions were implemented by using packages “ade4” (Dray and Dufour 2007) and “MASS” (Ripley et al. 2018). All the analyses were carried out on Software R version 4.2.1 (R Development Core Team 2022).

Components of variance

The components of variance were estimated according to the residual maximum likelihood method (Patterson and Thompson 1971), and the genetic values were predicted through the best linear unbiased predictor (Henderson 1975), according to (Gilmour et al. 2015). Random regression models were adjusted through the Legendre polynomials, considering all the possible levels of adjustment for each random effect, by using the following model:

$${Y}_{ijk} ={R}_{k}+ {b}_{M}{\phi }_{ijM}+\sum _{m=0}^{M}{g}_{ikm}{\phi }_{ijm} +{\varepsilon }_{ijk}$$

where \({Y}_{ijk}\) is the ith individual (i = 1, 2,…, n) in the jth cluster (j = 1, 2,…, 7) in the kth replication (k = 1, 2,…,10); \({R}_{k}\) is the fixed effect of the replication; \({b}_{M}\) is the fixed coefficient of regression adjusted through the sixth degree of the Legendre polynomial for the common average trajectory of genotypes. The random effect, \({g}_{ikm}\) is the regression coefficient for the Legendre polynomial of degree m for the genetic value. ϕijm is the mth Legendre polynomial for the jth cluster of the ith individual; m is the adjustment of the degree of the Legendre polynomial, varying from 0 to 6, for the genetic and environmental effects, respectively; and \({\epsilon }_{ijk}\) is the residual random effect associated with \({Y}_{ijk}\).

In the matrix notation, the model above is described as follows:

$$y=X\tau + Zg + e$$

where y is the vector of phenotypic observations; \(\varvec{\tau }\) is the vector of the effects of repetition (assumed as fixed); g is the vector of genetic effects (assumed as random); e is the error vector (random). X, Z refers to the incidence matrices for these effects.

In this model, g ~ N (0, Kg\(\otimes\)I) and e ~ N (0, R); where Kg is the matrix of co-variance for genetic effects; \(\otimes\) denotes the Kronecker product; I is an identity matrix with an appropriate order for the respective random effect; and R refers to the matrix of residual co-variances. Different structures of residual co-variance (homogeneous, diagonal and unstructured) were tested.

The polynomial order in models of random regression was selected by using the Akaike information criterion (AIC) (Schwarz 1978), as follows:

$$AIC=-2LogL+2p$$

where LogL is the logarithm of the maximum value of the likelihood function (L), and p is the number of estimated parameters.

The estimates of the components of variance (\({\sigma }_{g}^{2}\)) and the predicted genetic values (\({ \tilde{g}}_{{ij}}\)), in the original scale, were obtained through the following expressions (Kirkpatrick et al. 1990):

$${\sigma }_{g}^{2}={\phi }_{ijm}{k}_{g}{\phi }_{ijm}$$

´

$${ \tilde{g}}_{ij}=\sum _{m=0}^{M}{\alpha }_{im}{\phi }_{ijm}$$

The genetic correlations (\({\rho }_{g}\)) between each pair of environmental clusters were obtained through the following expression:

$${\rho }_{g}=\frac{{\widehat{\sigma }}_{g\left(ij\right)}}{\sqrt{{\widehat{\sigma }}_{g\left(i\right)}^{2}{\widehat{\sigma }}_{g\left(j\right)}^{2}}}$$

where \({\widehat{\varvec{\sigma }}}_{\varvec{g}\left(\varvec{i}\varvec{j}\right)}\) is the genetic co-variance between the genotypes for the pair of environmental clusters i and j; \({\widehat{\varvec{\sigma }}}_{\varvec{g}\left(\varvec{i}\right)}^{2}\) and \({\widehat{\varvec{\sigma }}}_{\varvec{g}\left(\varvec{j}\right)}^{2}\) are the genetic variances between the genotype and environmental clusters i and j, respectively. The statistical analyses were carried out by using the software ASReml 4.1 (Gilmour et al. 2015) and R (R Development Core Team 2022).

Results

The topological formation of the SOM is represented in Fig. 1A. The scale of colors represents the synaptic weights of each variable in the 90 neurons of the map; this scale varies from blue colors, with lower weight values, to yellow colors, with greater synaptic weights. When evaluating weight distribution in the network, the effect of the variables on the different neurons is seen, as well as the similarity among them. In this stage, the neurons have not been divided into environmental clusters yet. In short, the network training was efficient, since it brought those neurons that presented similar weights closer, despite the use of different climate variables with different behaviors. It can be observed that, at first, solar radiation presented a greater differentiation among neurons, and that with variables altitude and latitude the first distributions of well-structured clusters are formed, since they presented the greatest synaptic weight values in the network. Variables of rainfall, wind speed, and relative humidity had the same behavior in the distribution of the neural network, just like variables temperature and AWC.

Fig. 1
figure 1

 A Graphic representation of the variation of the synaptic weights in the 90 neurons formed by the methodology of self-organized maps of Kohonen for each climate and soil variable. B Mean coefficient of variation (CV) of the mean Euclidean distance and Bayesian Information Criterion (BIC) estimated for a growing number of TPE.

Figure 1B shows the BIC values and the coefficients of variation of the distances among the clusters for the growing values of k TPE. A clear decrease of BIC is seen up to value k = 5, after which the BIC value increases, clearly indicating that the best number of clusters is equal to five. The same can be observed in the trajectory of the CV values, which, when reaching the values of five clusters, shows no significant reduction of the coefficient of variation of the distances among the clusters with the increase of the k TPE value.

Figure 2A is the geographic representation of the classification of the municipalities in the state of Goiás, in which the municipalities used in the analysis are represented by different colors that form each cluster. For the principal component analysis, with the basis of the estimates of the mean Euclidean distance among the 90 neurons, only one discriminant function was enough to explain 99% of the variance, separating them into five TPEs (Fig. 2B). Among all the climate and soil variables, Altitude was the one that presented the greatest linear dependence (LD%) in the formation of the TPEs (97.43%), while the lowest as Rain (0.41%), SR (0.40%), and RH (1.70%) (Table 2). When observing Fig. 2C, it is possible to visualize the membership probability of each evaluated municipality in the three years in their respective clusters. In general, all the evaluated municipalities had a membership probability above 80%, even though TPE 1, TPE 3, and TPE 4 are geographically close and presented little chance of belonging to another cluster. Only TPE 1 presented a lower mean of membership (70%), which was on average 30% similar to TPE 2.

Fig. 2
figure 2

 A Geographic disposition of the 17 municipalities of Goiás (GO) belonging to the five clusters formed by the method of discriminant analysis of principal components. B Graphic dispersion of the density of the first discriminant function for the five clusters formed. C Graphic representation of the membership probability of the 17 municipalities in the 3 years of evaluation for the five clusters formed. Axis x represents the observations of the municipalities in the 3 years and axis y represents the membership probability in their respective clusters

Table 2 Mean of the water and soil variables for each Target-Population of Environments (TPE) and Linear Dependence (LD) in percentage of the participation of each variable in the formation of the TPEs.

The average temperatures of the TPEs were between 25.3 °C (TPE 2) and 26.4 °C (TPE 4) (Table 2), while relative humidity had the same behavior from 76.18% (TPE 2) to 77.5% (TPE 4). Rainfall was lower in TPE 3 (8,22 mm.day−1) and greater in TPE 5 (9.02 mm day−1). Solar radiation was between 18.4 W m−2, day−1 (TPE 5) and 19.0 W m−2, day−1 (TPE 2). Wind speed (WS) was not greater than 1.5 m s−1 in the five TPEs. Available water capacity (AWC) was well balanced among the TPEs, among which TPE 5 presented the lowest values (0, 82% day−1). Regarding Altitude and Latitude, the orders of the TPEs had a similar pattern (TPE 4 < TPE 3 < TPE 1 < TPE 2 < TPE 5).

The Legendre polynomial was chosen according to the Akaike information criterion (AIC), in which model 2 had the best result (lowest value) of 5469.9 (Table 3). This model presents a heterogeneous residual structure, that is, it estimates a component of the residual variance for each TPE. Thus, this model was adopted to estimate the components of variance and to predict the genetic values for the tested soybean cultivars.

Table 3 Convergence of the different regression models tested through the Akaike information criterion (AIC) for the genotypes tested in the different environmental clusters

The behavior of the 17 cultivars across the TPEs is described in Fig. 3. Figure 3 A describes the average behavior of the phenotypes across the TPEs. As a whole, the yield had its maximum peak at 5500 kilos in TPE 2 and a minimum of 3000 kilos in TPE 3. It is possible to denote that the distribution of the means of the phenotypes across the environments had the same variation, but the ranking of the cultivars changed over the TPEs. The trajectories of the genetic effects (Fig. 3B) show a linear relationship with the complex genetic × environmental interaction, in which TPE 1 had the greatest variety of genetic values and decreased until TPE 5. Among the 17 genotypes evaluated through the model of random regression, genotype G39 stood out in first place for all the TPEs (Fig. 3B). When comparing phenotype behavior, G39 and G44 behaved similarly, but G44 had a medium genetic effect across the TPEs. The phenotype behavior of G16 had the worst classification in only two TPEs, but the genetic value observed is the lowest in almost all TPEs, and only in TPE 5 it was not classified as the lowest genetic effect. Thus, the genotype ranking changed in the environmental gradient very differently from the effect of the phenotypes.

Fig. 3
figure 3

Curves of the behaviors of the cultivars in the different TPEs. A Average behavior of phenotypes in the TPEs. B Reaction norms of the model of random regression of the 17 cultivars in the different TPEs. The colors highlight the cultivars that had the greatest genetic effect (G39) in blue, medium (G44) in yellow, and the smallest one (G16) in red, in the TPEs.

Along the environments, trait heritability varied between 0.25 (TPE 1) and 0.02 (TPE 4) (Fig. 4A), having a descending behavior from the first TPE to the fourth one, and a growing behavior until the fifth TPE. The genetic variances followed the behavior of heritability, where the greatest value was in TPE 1 (51,204) and the smallest in TPE 4 (4476). The greatest value of phenotypic variance was in TPE 3 (372,639) and the smallest in TPE 1 (208,303). The same distribution was found in the residual variance, with TPE 3 (361,759) and TPE 1 (157,099).

Fig. 4
figure 4

Genetic parameters through the TPEs. A Estimate of heritability, genetic variance, phenotypic variance, and residual variance across the TPEs. B Heat map representing the environmental genetic correlation among the TPEs.

The greatest genetic correlations (\({\rho }_{g}\)) were between TPE 1 and TPE 2, with a value of 0.99, suggesting a low reordering between the genotypes on these sites (Fig. 4B). The smallest genetic correlation occurred between the extreme environments, TPE 1 and TPE 5 (\({\rho }_{g}\)= − 0.39). The greatest correlations were between TPE 1, TPE 2, and TPE 3, which were the ones with the greatest genetic yield potential, and the smallest among these TPEs with TPE 5, indicating a reordering of the classification of the genotypes in this TPE. Also, TPE 4 presented a medium genetic correlation among the TPEs, ranging from 0.48 for TPE 1, and 0.77 for TPE 3, being, thus, a TPE of transition between TPEs (1, 2, and 3) with TPE 5.

Discussion

Learning about the climate and soil conditions of a region is of major importance for the soybean breeding since certain genotypes are more stable in different environments; these materials are selected because they do not present undesirable changes in yield and are more resilient to local climate changes (Eberhart and Russell 1966). In addition, some genotypes are more adaptable, responding positively to the improvement in environmental conditions (Brawner et al. 2014).

This study sought to classify and analyze, through the methodology of self-organized maps, a time series of data under the scenario of a dynamic change of the climate in the Brazilian macro-region 3 of the soybean culture. The use of artificial neural networks (ANNs) proved to be highly efficient to interpret the climate dynamics in the region, where, after the formation of the TPEs, the discriminant analysis was able to explain 99% of the variation of the synaptic weights of the network. The model of self-organized maps is efficient to analyze climate and soil data since the way the information is dealt with by the network creates the possibility of a better performance if compared to conventional models (Bustos-Korts et al. 2022). In contrast with conventional approaches, this study sought a sensitive approach to the dynamic environment. Therefore, in the interpretation of environmental data, the information on topography, such as altitude and latitude, and the information on solar radiation are important for an initial interpretation of the network, since they presented greater synaptic weights (Fig. 1A), while the most sensitive changes in the network are caused by the dynamics of continuous climate variables (temperature, relative humidity, AWC, wind speed, and rainfall) over time.

Although there are different approaches in the study of the G × E interaction, there are still a few studies in the literature that describe a recommendation according to continuous environmental change in soybean culture. Environmental variables are usually attributed as discrete phenomena, generating clusters with similar environmental traits, so that the environments are treated as levels of categorical variables (Alexandre Bryan Heinemann et al. 2022). The modeling of the spatial variation and of temporal dynamics is a challenge for studies of interaction in the soybean culture. Here, the use of ANN as an environmental descriptor guaranteed that the quality in the formation of the environmental clusters was balanced in the face of the complexity of climate information. Given this, this study represents an important contribution to the better understanding of the G × E interaction in soybean crops, allowing a more accurate recommendation of cultivars according to continuous environmental changes. Furthermore, the approach used in this study, using artificial neural networks as an environmental descriptor, can be applied in other crops and a climate change scenario, providing valuable information for the selection of more adapted and resistant genotypes.

In soybean breeding, random regression is very useful, since it allows the prediction of genetic values of individuals evaluated in different years, sites, and common environments, with an effect of ordering and selection (Schaeffer 2004). The functions of co-variance can express, in a more realistic way, the phenomena associated with longitudinal data, being superior to models of repeatability and multi-traits (Meyer 1998). In addition, Legendre polynomials have been used to model curves of the behavior of perennial plants (Li et al. 2017).

The genetic trajectories of the reaction norms reinforce the presence of the genotype x environment interaction since their trajectories are non-linear and cross with each other, which implies a different classification for each environment. Besides that, the trajectories can also be interpreted as genetic variability. The more distant trajectories are from each other, the more genetically distinct the genotypes (Gomulkiewicz and Kirkpatrick 1992). The advantage of this strategy is that the response of selection can be predicted, not only in the expression of the genotype submitted to any environment, but also in the quantification of the environmental sensitivity through the genetic trajectories, that is, based on the capacity of response to the changes of the environment (Alves et al. 2020).

In addition, reaction norms describe the genetic values of each cultivar across the environmental gradient. The model of random regression can predict the genetic value for any cultivar of any environmental cluster (between the first and the last TPE). The trajectories demonstrated that the cultivars had similar performances from TP1 to TPE 3 (Fig. 3B), which reveals that the recommendation of cultivars for these regions can be similar. Genetic correlations reinforce the efficiency of the recommendation (Fig. 4B). Although these three environmental clusters have a high environmental genetic correlation, only TPE 1 presented a greater value of heritability, and it is a more propitious environment in the practice of selection of cultivars. The high correlation among these TPEs can support the idea of grouping them in the same region as done by RESENDE et al. (2021); however, it was seen that even if there is no difference in the ranking of the genotypes for these environments, the genetic variance was greater in TPE 1 (Fig. 3B), corroborating with the idea that the practice of selection in this environment will lead to greater genetic gains.

Even though the number of sites in this study does not provide full coverage of the Brazilian macro-region M3 of the soybean culture, the study allowed the identification of well-defined TPEs. The results of this study indicate that although altitude is the main descriptive variable, the climate dynamics caused by continuous variables play an important role in the formation of environmental clusters. When the focus is selecting genotypes for specific environments, this model can benefit by predicting genotype performance for the site, taking into consideration the behavior of an average environment, as long as there is enough climate information for the categorization, as seen by Chenu et al. (2013). This approach can also be used in a scenario of climate change, in which the frequency of hot and dry climates is expected to increase in the future (Rattis et al. 2021).

Conclusion

The use of artificial neural networks (ANNs) proved to be highly efficient to interpret the climate dynamics in the region, where it was possible to discriminate and classify these environments into well-defined TPEs by using dynamic information on the climate. With the classification of the TPEs, it was possible to study the GxE interaction and visualize what the soybean genetic behavior is like for this macro-region, in the form of reaction norms. The genetic trajectories reinforce the presence of the GxE interaction and allowed us to quantify the response of the genotypes to changes in climate. This methodology can be useful to optimize time and resources in soybean breeding programs since the choice of the most adequate genotypes can made based on sensitive changes in the environment.