Introduction

Genomic selection (GS) represents a well-known innovation in the plant breeding process, allowing for the prediction of genotypic values without the need to grow and evaluate crops in the field. It proves particularly successful in recurrent selection programs tailored for allogamous species, characterized by elevated genomic heterozygosity rates and, in some cases, protracted selection cycles (Zhang et al. 2017a; Grattapaglia 2022). The efficacy of GS is closely tied to the Linkage Disequilibrium (LD) phenomenon, a pattern that affects loci across the genome. This type of non-random association suggests that the inheritance of loci may not strictly adhere to the pattern of independent Mendelian segregation across generations (Liu et al. 2015; Skelly et al. 2016). Thus, even a locus not directly linked to the expression of a phenotypic trait can provide a significant understanding of it. This fundamental feature of segregative genetics ushered in a new era of plant breeding, known as genomic analysis, enabling strides in developing more productive crop varieties.

GS heavily relies on the genetic relationships among individuals. When applying a model developed for a specific population to predict genotypic values in another unrelated population, the likelihood of success is expected to be significantly reduced (Ramstetter et al. 2017; Merrick et al. 2022). The base population represents the initial group of plants from which breeding endeavors originate. Therefore, thoughtfully choosing the base population is essential when applying GS approaches to segregative allogamous individuals (Labroo et al. 2021; Grattapaglia 2022). This careful choice serves as the cornerstone for the accuracy and efficacy of genomic prediction along the breeding cycles of the program, leading to more pertinent outcomes in the context of genetic improvement.

It is common in the literature to find studies that perform the training and validation of GS models within the same population (as will be discussed in the present Review and also been discussed by Taylor 2014, Merrick et al. 2022, and Berro et al. 2019). While this may seem a reasonable approach, given the optimization of models in a fixed founder population, many of these studies group training and validation populations in one or a few specific experiments, often utilizing samples from a genetic panel of the same generation (see Ferrão et al. 2017, Simiqueli & Resende 2020, and Simiqueli et al. 2023 for more details on this). In this context, it is important to consider the relevance of training and validating models across different generations, such as between parents and offspring, or even other combinations of relatedness (Simiqueli et al. 2023); and also, between different environments and types/phases of breeding populations (Resende et al. 2021). This practice not only ensures a more authentic validation but also strengthens the robustness and reliability of genomic predictions.

We are immersed in an era of advanced information systems and big data analysis, where the integration of information and the use of Artificial Intelligence (AI) prevail (Harfouche et al. 2019; Xu et al. 2022; Montesinos-López et al. 2021). However, despite the abundance of technologies and automation, certain processes have been neglected due to the pursuit of quick results. Indeed, GS cannot be considered a low-cost technique, and it is dependent on hard quantitative genetics skills. To calibrate the models, reliable phenotypic data and SNP markers are required. This may explain the hesitancy in adopting GS in plant breeding programs in some sectors, either due to its potential cost or the inapplicability of the model to operational reality (Wartha and Lorenz 2021). Therefore, progress is needed to seize adoption opportunities in both the short and long terms, with considerations about cost-based feasibility, disruption of current practices, and associated risks (Bernardo 2021).

Genomics demonstrates its great efficiency in the genetic characterization of individuals and populations (Huisman 2017; Dwiningsih et al. 2020). On the flip side, establishing DNA causality in the expression of complex phenotypes is a finicky field of study, susceptible to producing false positives in statistical models if not approached with scrupulous precision (Wu et al. 2018). The secret to the effective functioning of GS lies in the ability to unravel genetic relatedness among evaluated individuals through LD. It is worth noting that, in many cases, information about the pedigree of these individuals is already available. However, if reliable pedigree information is accessible for the target breeding population, the traditional approach of phenotypic selection not only proves equally effective but, in most cases, surpasses GS (Henryon et al. 2019; Michel et al. 2020). Selection-based exclusively on pedigree becomes a choice that can be adopted without hesitation. However, in scenarios where genealogical information is unavailable, or when seeking to maximize selection gains through the combination of pedigree and SNP data, the application of genomic selection techniques proves highly valid.

This brief review article offers planning perceptions into the field of Genomic Selection (GS), aiming to optimize time and resource investments during its implementation. It consistently centers around the driving force of the technique, the Linkage Disequilibrium (LD). The topics encompass high-throughput SNP-based genotyping of the breeding samples, GS design regarding sample sizes (both phenotyped-and-genotyped individuals), and predictive accuracies. The primary emphasis is on allogamous crops like maize (an annual crop) and eucalyptus (a perennial crop). The objective is not to compare predictive models or propose statistical protocols, but rather to aggregate strategies for the effective application of the technique and explore projections of predictive abilities based on available data.

SNP-based genotyping of the breeding samples

In the field of molecular genetics, the choice of the most suitable genetic marker is pivotal for precise and effective analysis. Studies indicate that single nucleotide polymorphisms (SNPs) can provide the most informative estimates of genetic differentiation and structure (Dwiningsih et al. 2020). These SNPs, being the most common form of DNA variation, allow for the simultaneous genotyping of hundreds to thousands of loci. Advances in sequencing technologies have contributed to the creation of extensive SNP datasets, expanding the viability of this approach. Today, it is easy to access public phenotypic and genomic data, for example, in wheat (an autogamous crop) and eucalyptus (Scheben et al. 2019; Resende et al. 2017). However, it is worth noting that some high-impact scientific journals request authors to share phenotypic and genomic data to facilitate publication.

In any genomic analysis, especially in the context of allogamous plant breeding, Linkage Disequilibrium (LD) plays the main role in relating DNA markers to agronomically relevant traits. LD represents the relationship between two or more loci along the genome, resulting in their dependent segregation (Skelly et al. 2016). Therefore, even a locus that may not be a direct expression of interest can provide relevant information if it is in LD with the target gene. This phenomenon is of great importance since, without the occurrence of LD, genomic technologies enabling rapid haplotype identification and SNP marker detection would be severely compromised. Associating DNA with phenotypic traits would, in this context, be akin to the challenge of “looking for a needle in a haystack”. Establishing the magnitude of LD is fundamental for conducting studies of assisted selection and genomic selection, although the identification of causal variants underlying phenotypic variation remains a considerable challenge.

It is not surprising that next-generation sequencing (NGS) technologies represent transformative genomic tools (Kim et al. 2020). Initiated between 2004 and 2005 with the Roche GS20 model, NGS platforms continue to evolve, enabling the generation of data with billions of base pairs and high nucleotide-level accuracy (Kchouk et al. 2017). The practical application of this process involves the fragmentation and preparation of DNA from plant samples, followed by the construction of genomic libraries. The NGS sequencer reads DNA sequences, and the resulting data are aligned to a species’ reference genome, allowing for the identification of variants such as SNPs and indels.

Genotyping arrays, such as SNP chips, are widely used in plant breeding. Operationally, these chips consist of glass substrates containing thousands of oligonucleotides representing specific SNPs, enabling high-throughput genotyping. With high reproducibility and faster result analysis, SNP chips integrate molecular data from different studies, becoming a routine research approach (Rasheed et al. 2017). Figure 1A, adapted from Resende et al. 2017, provides a detailed illustration of the distribution of genotyped SNP markers on an Illumina BeadChip platform (Silva-Junior et al. 2015), specifically for a hybrid population of Eucalyptus grandis × Eucalyptus urophylla. Delving deeper into Fig. 1-B, we examine the patterns of allogamous heterozygosity across the genome, represented by \(2pq = 2p(1-p)\), where ‘\(p\)’ signifies the frequency of one of the bialleles, and ‘\(q\)’ is the frequency of the other, or simply \(1-p\). This illustration highlights the applicability of SNP chips in species with relatively complex genomes, such as eucalyptus, which has several native species in Oceania. For instance, E. grandis has a genome of approximately 697 Mb, while in E. urophylla, it is approximately 626 Mb.

Fig. 1
figure 1

Adapted from Resende et al. (2017)

Distribution of 24,806 polymorphic SNP markers along the 11 chromosomes of a eucalyptus population. Part “A” of the figure displays the concentration of SNPs per 1 Mb window. Part “B” of the figure shows the average heterozygosity of SNPs in a 100 kb window (in terms of the allele frequency ‘\(p\)’ and ‘\(q=1-p\)’).

To ensure high-quality genomic data, it is necessary to conduct data mining and cleaning. Genotyping platforms typically offer tens or hundreds of thousands of markers. However, in the sample used to calibrate the GS models, some marker polymorphisms may not be present. This implies that while the genotyping library identifies thousands of SNPs, only a portion of them may exhibit polymorphism in the sample. Therefore, we need to exclude markers with low or no frequency (Minor Allele Frequency, MAF), ensuring a high Call Rate (indicative of high-quality data), among other quality control measures. Please note that the sample in Fig. 1 has ~ 25 K markers, while the chip has 60 K (Silva-Junior et al. 2015). In addition, markers should be parameterized to capture both additive and non-additive genetic effects (Vitezica et al. 2013; Muñoz et al. 2014). Detailing these issues is not the focus of this text. For this purpose, the utilization of R packages like {snpReady} (Granato et al. 2018) and {AGHmatrix} (Amadeu et al. 2023) are very good options.

GS efforts planning: sample sizes and the predictive accuracy

Through high-throughput genotyping techniques, the availability of large quantities of SNPs per sample allows us to fully explore the genetic variance within breeding populations (Yang et al. 2017; Grattapaglia 2022). As previously mentioned, genomic selection is made possible by Linkage Disequilibrium (LD) among markers, which is, more directly put, the correlation that exists between two markers in a genotyped population with SNP markers. This means that even if a marker is not directly linked to a QTL, many markers adjacent to the QTL can provide information about the segregation of genes related to the expression of the trait of interest (e.g., crop yield, time-growth, and biotic/abiotic stress tolerances).

By applying genomic-wide selection, or simply genomic selection (GS) techniques, it is possible to predict the “phenotypes” (i.e., genotypic values) of experimental breeding trials, even before these trials are conducted. In other words, based solely on the DNA of individuals who would hypothetically be planted in the field. Although this may initially seem unrealistic, it makes sense to remember that a significant portion of the information leading to the final phenotype originates from genes. By appropriately managing and/or correcting the environmental component of the phenotype (as is well known, Phenotype = Genotype + Environment), GS models will demonstrate strong predictive abilities (Montesinos-López et al. 2018).

It is important to note that, in terms of predictive ability (i.e., accurately ranking the best genetic materials), GS may not necessarily surpass the phenotypic selection carried out in the field (Heffner et al. 2011). This is because GS is generally applied as an “extremely-early-indirect” selection method, to approximate the direct selection (a “benchmark”) conducted in the field (see preliminary Fig. 2A). However, GS can indeed be more advantageous than direct field selection, primarily for five reasons: (i) time savings, as early genomic selection can be conducted from initial plant propagules; (ii) resource savings, including labor and inputs that would be expended in the entire process of establishment, phenotypic measurement, and harvest/transportation in experimental breeding trials; (iii) the ability to evaluate a greater number of genetic materials that may not eventually go to the field and therefore would not be tested—for example, instead of taking 2000 genetic materials for field evaluation, 4000 could be genotyped and their phenotypes predicted genomically; (iv) predicting traits that are difficult to measure, such as root volume in cassava and wood volume in highly branched trees; and (v) correcting potential pedigree errors in the construction of the relationship matrix (\(A\)), with this information being recovered through the genomic matrix (\(G\)), or even concatenating matrices \(A\) + \(G\) into a “super” matrix called “\(H\)” (Legarra et al. 2014).

Fig. 2
figure 2

Relationship between observed phenotypic values (averages or BLUP of hundred genetic materials) in the experiments versus genomically predicted values. In part “A,” the so-called "true” genotypic ranking based on experimental field-measured values is shown. In part “B,” only genomic prediction is displayed, yet analogously to the ranking of genetic materials seen in “A.” Part “C” illustrates the relationship between parts “A” and “B.” The dotted blue line represents the selection sieve of the GS, with selected genetic materials on the right and discarded ones on the left. The gray dotted line in part “C” provides an indication of the predictive ability of the GS model. Adapted from Vianello, Resende & Brondani (2023)

The GS model performance can be understood with a didactic example involving a hundred genetic materials and a predictive ability of approximately 50% (Fig. 2). Some phenotypically good individuals may be left out of the genomic selection sieve and still demonstrate satisfactory predictive abilities. Figure 2B shows a shuffle of the best and worst phenotypes when predicted genomically, while Fig. 2C illustrates the relationship between observed phenotypic values (or genetic values estimated from means or Best Linear Unbiased Prediction—BLUP—of an experiment). It is observed that the term “selection” in GS can be confused with genomic “exclusion,” which is actually what happens in most cases when GS is used to eliminate the worst individuals, rather than to effectively select the best.

Much is said about the appropriate number of markers to be used in a genomic selection process, as well as the ideal quantity of individuals/genetic materials to be phenotyped and genotyped (Merrick et al. 2022; Werner et al. 2020). The truth is that there is no one-size-fits-all approach to this process. It will depend on various factors inherent to the crop species, the population to be improved, the breeding objective (such as developing lines, hybrids, and clones), and the phenotypic trait targeted for breeding (Silva et al. 2021). Among some equations used to plan the breeding program coupled with genomic selection, Resende (2008) proposed the equation:

$${\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}=\sqrt{{\left\{1+\frac{4{{\varvec{N}}}_{{\varvec{e}}}{\varvec{L}}}{{{\varvec{n}}}_{{\varvec{m}}}}+\frac{2{{\varvec{N}}}_{{\varvec{e}}}{\varvec{L}}{(4{{\varvec{N}}}_{{\varvec{e}}}{\varvec{L}}+{{\varvec{n}}}_{{\varvec{m}}})}^{2}}{{\varvec{N}} {{\varvec{h}}}^{2}[\mathbf{ln}\left(2{{\varvec{N}}}_{{\varvec{e}}}\right)]{{\varvec{n}}}_{{\varvec{m}}}^{2}}\right\}}^{-1}}$$

where \({\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}\) is the projected, or theoretical, predictive accuracy (attempting to anticipate the predictive ability achieved in GS practice); \({\varvec{L}}\) is the size of the species genome (in morgans, M); \({{\varvec{n}}}_{{\varvec{m}}}\) is the number of SNP-type markers; \({{\varvec{h}}}^{2}\) is the coefficient of heritability for the phenotypic trait (which can be broad-sense or narrow-sense heritability, depending on the breeding phase, we will delve further into this shortly); \({\varvec{N}}\) is the actual size (i.e., sampled individuals) of the population (taking into account an equal number of individuals per family); \({{\varvec{N}}}_{{\varvec{e}}}\) is the effective size of the population, representing the number of genetically contributing individuals to future generations, factoring in inbreeding effects that lower genetic diversity and can decrease Ne in smaller populations by increasing the likelihood of mating among close relatives. It is advisable to plan and have an understanding of the possible outcomes before starting any genomic selection process, as failing to do so may lead to significant resource and time losses. It is suggested to take a look at the alternatives to this equation found in Resende et al., (2012) at pages 139–146.

Therefore, the GS requires careful planning, with meticulous accounting of all resources to ensure its effectiveness. Figure 3 illustrates the application of the \({\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}\) equation proposed by Resende (2008). It highlights a genus with a relatively compact genome (Eucalyptus spp., please referee to Fig. 1, wherein \({\varvec{L}}\) ≈ 10 M, with approx. 700 Mb length) (Bartholomé et al. 2015; Silva-Junior & Grattapaglia 2015) and another with a substantially larger genome (Zea mays, \({\varvec{L}}\) = 19.96 M, with approx. 2,300 Mb length) (Dell’Acqua et al. 2015). In both cases, a feasible effective population size (\({{\varvec{N}}}_{{\varvec{e}}}\)) of 50 was considered for crop improvement programs.

Fig. 3
figure 3

Genomic selection projected scenarios using the \({\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}\) equation by Resende et al. (2012). It addresses genome sizes of Eucalyptus spp. (\({\varvec{L}}\) ≈ 10 Morgans, M) and Zea mays (\({\varvec{L}}\) = 19.96 Morgans, M). The effective population size (\({{\varvec{N}}}_{{\varvec{e}}}\)) is fixed in 50 for both. Predictive accuracies (\({\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}\)) consider 1–100 K SNPs. Three scenarios of trait heritability (\({{\varvec{h}}}^{2}\) = {Low, Moderate, High}); and phenotyped-and-genotyped individuals (\({\varvec{N}}\) = {1000, 2000, 5000}) were evaluated

The \({\widehat{{\varvec{r}}}}_{{\varvec{g}}\widehat{{\varvec{g}}}}\) values presented at Fig. 3 assume that all events unfold as planned, with the minimization of any unforeseen contingencies not already accounted for in the calculation of the phenotypic trait heritability (\({{\varvec{h}}}^{2}\)). Generally, genotyping technologies availability ranges from 1 K to over 600 K SNP markers for various crop species (Rasheed et al. 2017). Notice that, the quantity of utilized SNPs (\({{\varvec{n}}}_{{\varvec{m}}}\)) plays a significant role in predictive capacity. It is also worth noting that, in general, GS demonstrates effectiveness for traits with both high and low \({{\varvec{h}}}^{2}\), with predictions being more accurate for traits with higher \({{\varvec{h}}}^{2}\). For traits with high \({{\varvec{h}}}^{2}\), the number of phenotyped-and-genotyped individuals (\({\varvec{N}}\)) is not a major limiting factor. Conversely, this is the case for traits with low \({{\varvec{h}}}^{2}\), where even populations of 5000 individuals may not provide substantial predictive accuracy.

At this point in the article, it is important to emphasize that the statistical-mathematical validation of broad GS models is necessary to assess the predictive ability of the models. Among the methods available for computing predictive abilities, the simple correlation between observed phenotypic values and values predicted by the model is a direct measure of the model’s predictive reliability and is considered a trustworthy way to assess the performance of the GS model. Accuracy can also be calculated by weighting based on the heritabilities of the phenotypic and/or genomic traits, with the premise of correcting for potential prediction shrinkage effects (Müller et al. 2015). However, caution must be exercised when including these quantities in accuracy equations, as this may lead to an overestimation of GS accuracy in traits with low heritability or an underestimation of predictive ability in traits with high heritability. To address this issue, a reasonable option is to use Pearson’s correlation, a well-known measure of correlation, or even Spearman correlation (for genotypic rankings).

Regarding the sample size for GS, in general, it has been experienced that well over the stereotypically indicated 1000 individuals are needed to fit good GS models (see Fig. 3). Furthermore, after observing many efforts to fit various types/approaches of predictive models (such as Bayesian—Bayes A, B, Cπ, LASSO—and those via Artificial Intelligence or Machine Learning), there is little incremental gain in predictive ability over the initially described GBLUP or RRBLUP methods by Meuwissen et al. (2001). In fact, there are many situations where better fits can be achieved with more elaborate methods compared to the classic GBLUP/RRBLUP mentioned here (Montesinos-López et al. 2021). However, it is important to highlight those other efforts, such as managing breeding populations and strategies for feeding and validating the models, are vital to the success of GS. In addition, properly exploring the additive and non-additive components of the intended phenotypic traits, as well as adapting GS to the specific phase of the breeding program, are factors that generally provide greater benefits compared to striving for higher predictive abilities among Frequentist × Bayesian × AI methods.

Practical inferences on genomic selection in allogamous plant breeding

Numerous modeling approaches could be applied in genomic prediction, and Fig. 4 presents the basic mixed linear model \({\varvec{y}}={\varvec{X}}{\varvec{\beta}}+{\varvec{Z}}{\varvec{g}}+{\varvec{e}}\) as a single illustration. In this case, \({\varvec{y}}\) is the vector of phenotypic data, \({\varvec{\beta}}\) is the vector of fixed effects (such as: experimental replicates/blocks, locations, repeated measurements over time, among others); \({\varvec{g}}\) is the vector of random genetic effects (the genetic materials, which can be lines, hybrids, among others); and \({\varvec{e}}\) is the random vector of residuals; \({\varvec{X}}\) and \({\varvec{Z}}\) are incidence matrices on the fixed and random effects, respectively. It is not the focus of this article to discuss when to assign certain effects as fixed or random in nature, but it is a consensus that genetic materials should be considered as random in order to enable the execution of mixed models for genomic selection (Resende et al. 2008).

Fig. 4
figure 4

Simplified diagram of a genomic selection (GS) process, using corn as an example. Part ‘A’ of the figure illustrates the process of fitting and validating predictive genomic models. Part ‘B’ of the figure depicts possible schemes for use, both for the prediction of inbred lines and for the prediction of hybrids based on the best inbred lines or validated models with hybrid information. Adapted from Vianello, Resende & Brondani (2023)

In the example procedure illustrated in Fig. 4, GS is employed for two purposes: first, in the development of improved inbred lines (L1, L2, …), and second, in the creation of improved heterotic hybrids (H1×2, H1×2, …). These genetic materials may eventually become registered cultivars, following appropriate field testing, such as in “Value for Cultivation and Use” (VCU) trials. It is important to note that a genomic predictive model will predict based on the data it has been trained on. If you feed it data from progeny tests, it will yield predicted values for maize progeny, just as it will for clonal eucalyptus tests if provided with clonal test data. Therefore, great care must be taken in selecting the base population for model training (Resende et al. 2022b).

The requirement for complete or partial genetic relatedness between training, validation, and application populations is a drawback of genomic selection. This is because LD, the driving force behind genomics, is easily lost and created between distinct populations, or even after several generations within the same population (Liu et al. 2015; Simiqueli et al. 2023). However, it is advisable to manage the model training data according to the program's objectives. Initially, it is need to comprehensively map the genetic base of the program (i.e., the germplasm bank), including the addition of new materials and, importantly, the removal of less desirable materials to eliminate unnecessary noise from the analysis to come (Resende et al. 2022b). It is worth noting that genetic materials developed using GS will always necessarily be related to the initial genetic base (Grattapaglia 2022). This is reasonable considering that companies typically have well-defined genetic bases. A good GS model is unlikely to predict the performance of genetic materials from other genetic bases (such as those from different companies, countries, or regions).

The same applies when using only one or a few environments in the genomic predictive model, as it will only be capable of predicting the performance of genetic materials for those few environments it has been trained on. The lack of representativeness in the input data for GS models, coupled with validation using a subset of the same data, can falsely inflate the perceived model performance, as the validation population will be entirely related to the training population. In addition, if the phenotypic data is collected in only one or a few environments, the model may not perform well in unobserved environments. Models that account for genotype–environment interactions (G × E) have been shown to be more effective than the traditional GBLUP model in terms of prediction accuracy (Montesinos-López et al. 2018). Two strategies can be employed to address this issue: (i) develop a multi-environment genomic model capable of providing predictions with high stability, meaning the genetic material's expected value is good regardless of the environment; (ii) use models that incorporate Genotype × Environment × Management interactions (G × E × M)—notable among these are models within the scope of Enviromics (Resende et al. 2021, 2022a; Costa-Neto et al. 2023), which can provide predictions for individuals with high stability and adaptability across different locations. These models can predict improved materials on a site-specific scale.

In this context, it is essential to select the key traits for operational or industrial model feeding. Some traits are easier to measure than others, but this can lead to a low genetic correlation with the actual trait of interest, a serious problem that is often overlooked. For example, phenotypic traits from genetic breeding tests (e.g., progeny, hybrids, clones, among others) may not correlate well with actual field performance. While it is important to feed the GS model with operational data, there is often limited data available for commercial genotypes. In such cases, integrating test and commercial data can help compute the genetic correlation between the two types of data. This approach can effectively address the issue and yield better results in grain production, forestry, horticulture, fruit farming, and other sectors (Resende et al. 2021).

Genomic selection for multiple traits is an approach in which plant breeders make selection decisions considering a variety of trait characteristics, such as yield, plant height, flowering time, and disease resistance, aiming to optimize genetic gain over multiple generations. However, while index selection is a common practice, it presents challenges in optimizing non-linear breeding objectives and in experimenting to determine the ideal weights for each trait (Moeinizade et al. 2020). Thus, in any breeding program incorporating genomic selection, various types of phenotypic traits will be improved, preferably simultaneously. Large-scale phenotyping data can also be included in GS models, such as data collected by sensors on drones or those predicted via Near Infrared Spectroscopy (NIRS) (Robert et al. 2022). Each phenotypic trait has its own characteristics, as well as genetic nature, inheritance, and so forth. The traits will be directly related to the type of genetic material being worked on, as well as the program's objectives. However, one thing is certain: the higher the heritabilities of the trait (whether additive in the narrow sense—\({h}_{a}^{2}\), or broad-sense heritability—\({h}^{2}\)), the better the GS model will work and deliver good results (see Fig. 3, where the theoretical results will be approximate for both \({h}^{2}\) and \({h}_{a}^{2}\)), i.e., in a recurrent selection program, in the initial phases, genetic variability tends to be greater, and due to the evaluation of many materials, likely with low replication, heritabilities tend to be higher (Zhang et al. 2017b). The paradox is that later in the program, although the genetic base may narrow (after several cycles of selection), the quantity of genetic materials is much greater and, therefore, they are more experimentally replicated.

Therefore, GS models should be managed considering that heritabilities should be maximized as much as possible to optimize selection efficiency. This is generally achieved in two ways: (i) by increasing genetic variation; and/or (ii) by reducing environmental variance through better residual control or increasing the number of replications. In the early stages of the program, whether it is autogamous or allogamous, annual or perennial, additive effects (α) are valued, as these involve crossing, recombination, and selection stages (non-additive effects tend to be less influential in these stages). However, in the later stages of the programs, non-additive effects () are also desired, since the generated cultivars are usually hybrids with some degree of heterosis (a dominance phenomenon) and genetically more uniform materials (Labroo et al. 2021). While inbreeding is a desirable, even indispensable, process in the stages of composing pure lines, if not well managed, its consequent inbreeding depression can pose a significant problem during the hybridization stages, particularly in F1-segregating (outcrossing) crosses in perennial species, such as forest and fruit-bearing species. Its critical impact includes decreasing genetic diversity and vitality—a reduction in variability that is the opposite of heterosis—increasing susceptibility to diseases, and reducing reproductive success and overall population fitness.

In this regard, using Fig. 4B as a reference, one can employ a model for additive genomic prediction, which will be effective in predicting segregating individuals in the early stages of the program, even if the goal is to obtain pure lines. It is also possible to use models that incorporate additivity and dominance when considering only the materials at the end of the program. A modern approach involves integrating data from all stages, maximizing and interconnecting the entire selection process. But, it is advised that when adopting AI models for GS to capture pure additive effects, caution is necessary. GS AI-based models may inherently capture non-additive effects (i.e., non-linear) unless their architectures are ultra-simplified, but in such cases, traditional linear models can provide the similar results.

The combination of different sources of phenotypic data offers several advantages, starting with the use of various environments, which allows for predicting behavior in terms of stability and adaptability of genotypes. Furthermore, since it involves the same genetic base, the effective population size (\({{\varvec{N}}}_{{\varvec{e}}}\)) usually does not change drastically, but the total population size (\({\varvec{N}}\)) increases, allowing for the best of both worlds: high variability in the initial populations of the program and a greater number of experimental repetitions in the final stages. This has a direct impact on increasing the predictive capacity of the model.

The incorporation of multi-omic approaches, often combined with genomics, is also a powerful tool in the genetic prediction of plants. These approaches can integrate transcriptomic, proteomic, metabolomic, phenomics, genomic, and other omic data to predict phenotypes of the plant individuals under study. Aggregating information at a higher level of intimacy with the final phenotype can improve the predictive capacity of the models, as demonstrated in the use of exomic markers, which effectively translate into phenotypic proteins (Hashmi et al. 2015). With other molecular structures, the logic is similar. For example, the prediction of flavor compounds (sugars, acids, and volatiles) in blueberries and tomatoes, based on metabolomics, shows very promising results (Colantonio et al. 2022), as does genomics in coffee taste (Ferrão et al. 2023). Approaches that combine transcriptomics, proteomics, metabolomics, and functional genomics were also conducted for the study of abiotic stress in vegetables (Zhuang et al. 2014). Performing “genomic” prediction but with phenomics data (using NIR) is also something astonishing (Robert et al. 2022). The joint analysis of these different layers of information increases the predictive accuracies of the models and enables the prediction of characteristics highly influenced by the environment or even subjective, such as the taste of agricultural products.

Final considerations

Advancements in allogamous plant genomics are characterized by the ongoing evolution of molecular tools and the gradual yet significant reduction in genotyping costs. This trend widens the applicability of these technologies to an even broader range of samples within genetic improvement programs. While we observe a leaning towards the use of an increasing number of genomic markers in breeding analyses, potentially involving hundreds of thousands of SNP markers, there are also researchers advocating for a framework with a lower number of markers, however, supplemented with advanced data imputation techniques.

Genomics also plays a role in establishing more sustainable and cost-effective production processes, especially through the development of cultivars with increased disease resistance, particularly when these diseases follow an oligogenic pattern of expression. In addition, the growing integration of artificial intelligence and machine learning techniques in genomic analysis aims to automate and optimize the interpretation of extensive datasets.

There is also a growing trend in incorporating large-scale phenotyping data, such as those obtained through drones and NIRS, into genomic selection models. Alternatively, the use of analog models to genomics, also for predictive purposes, but with data (characterizing similarities between samples) derived from other omics, is being considered. Furthermore, the integration of envirotypic or enviromics data is also emerging as a rising practice, to better address variations stemming from genotype–environment interactions.

The GS is a statistical technique for crop breeding. Its application requires careful planning and pragmatic evaluation, especially across generations and at the base-population model management. Selection strategies should take into account the program phase and model complexity. In some cases, a pedigree-based model can be as effective as GS, but genomics excels when kinship information is limited. Undeniably, genomic models represent an elegant and advanced tool in crop breeding.