Introduction

The steady growth of the world population, expected to reach 9–11 billion by 2050, along with climate change and soil deterioration, are major challenges to achieving world food security (Kopittke et al. 2019; Röös et al. 2017). Biotic and abiotic stresses caused by pathogens, animals, weeds, drought, extreme temperatures, flooding, salinity, acidic conditions, and nutrient starvation all reduce global agricultural productivity (Tyczewska et al. 2018). Plant breeding represents one of the main ways to alleviate these problems and improve both crop production and productivity (Bhat et al. 2016). Plant breeding uses two main approaches, conventional and molecular breeding. Conventional breeding mainly uses phenotypic data (Borrelli et al. 2015) and has several limitations, including the long time (> 10 years) needed to release a new variety, confounding environmental effects leading to low heritability for many traits of interest, particularly the most complex ones, like yield. Molecular plant breeding using DNA markers includes quantitative trait loci (QTL)-based marker-assisted selection (MAS) that can greatly increase the speed, efficiency, and precision of breeding compared to conventional methods (Gupta et al. 2010). However, QTL-based MAS is efficient only for traits controlled by a few QTLs that have a major effect on trait expression, whereas for complex quantitative traits governed by a large number of minor QTLs, such as yield, it may be less efficient than conventional phenotypic selection (Bhat et al. 2016). For complex traits, the most efficient molecular breeding strategy available today is genomic selection (GS) (Hickey et al. 2019). GS is a form of MAS in which genetic markers covering the whole genome are used so that all QTL are in linkage disequilibrium (LD) with at least one marker (Goddard and Hayes 2007; Heffner et al. 2009; Isik 2014; Meuwissen et al. 2001). GS has emerged as one of the most promising selection strategies to enhance genetic gain per unit time and/or unit cost for both plant and animal breeding programs (Fugeray-Scarbel et al. 2021; Merrick et al. 2022; Mrode et al. 2019; Voss-Fels et al. 2019; Wartha and Lorenz 2021; Xu et al. 2020). In dairy cattle, GS doubled the rate of genetic progress (Wiggans et al. 2017). In plants, GS is progressively integrated into breeding schemes and is now routinely used for major crops, in particular in the private sector (Merrick et al. 2022; Varshney et al. 2017; Voss-Fels et al. 2019). For instance, GS played a key role in the development of drought-tolerant maize hybrids that gave higher yields under both favorable and water stress conditions in the western US Corn Belt (Merrick et al. 2022; Voss-Fels et al. 2019). GS has also been applied on a large scale at the International Maize and Wheat Improvement Center since 2010, where it is used in spring wheat to discard low-performing lines (Merrick et al. 2022).

The first step in GS is creating a training set (or training population). The training set is genotyped and phenotyped for the targeted traits, and a prediction model is then built using these genotypic and phenotypic data. Several high-throughput next-generation sequencing (NGS) technologies such as SNP arrays (LaFramboise 2009; Wang et al. 1998), genotyping-by-sequencing (Elshire et al. 2011), and whole-genome sequencing (Ni et al. 2017) platforms have facilitated the production of large amounts of single nucleotide polymorphism (SNPs) markers to use in GS, at an affordable cost. The target population is also genotyped but not phenotyped, and the prediction model calculates the genomic estimated breeding values (GEBVs) or, when non-additive effects are taken into account, the total genomic estimated genotypic values (GEGV) of the selection candidates (Grattapaglia et al. 2018). The efficiency of GS is determined, in particular, by its accuracy, which is defined as the correlation between the predicted and the true (unknown) genetic value of the selection candidates (Lorenz et al. 2011). GS accuracy is affected by the effective size of the population, marker density and type, the size and structure of the training population, the genetic architecture of the traits, relatedness between the training and target population, LD between markers and QTLs, trait heritability, imputation method, etc. (Grattapaglia and Resende 2011; Isik 2014; Robertsen et al. 2019).

Tropical perennial crops and plantation trees are of huge importance for the human population, in particular for use as food, timber, pulp, and stimulant crops (Jamnadass et al. 2016). However, their productivity is generally well below their potential, in particular, due to biotic and abiotic constraints, as shown, for example, in Eucalyptus (Elli et al. 2019), oil palm (Pirker et al. 2016; Woittiez et al. 2017), coffee (Wang et al. 2015), and cocoa (Aneani and Ofori-Frimpong 2013). Applying more efficient breeding approaches to these species will help fill production gaps. Genomic selection is particularly attractive for perennial plant species as they have long generation intervals and low selection intensity. Isik (2014) showed that the impact of GS could be much greater in perennial forest trees than in any other crop or livestock breeding program. A significant number of articles on GS have already been published on a variety of traits of interest in several tropical perennial crops and plantation trees, for instance, yield in oil palm (Cros et al. 2017, 2015), rubber tree (Cros et al. 2019) and guava (Silva et al. 2021), growth in eucalyptus (Bouvet et al. 2016; Denis et al. 2012; Resende et al. 2012) and rubber tree (Souza et al. 2019), fruit quality in citrus (Minamikawa et al. 2017), resistance to diseases in cocoa (McElroy et al. 2018; Romero Navarro et al. 2017), etc. (Supplementary Table S1). However, a review of GS in these species is lacking. The objective of the present article is therefore to review the results of GS research in tropical perennial crops and plantation trees, to discuss the main factors affecting GS accuracy and to highlight the genetic gains expected in these species using this approach. We focus on perennial crops defined as such according to the FAO indicative crop classification (FAO 2015) and on plantation trees both grown in the tropics. The production of the corresponding species include fruit, timber, pulp, latex, oil, nuts, and stimulants. To our knowledge, the species covered by published articles on GS so far are banana, guava, citrus, Eucalyptus species (E. urophylla, E. grandis, E. benthamii, E. pellita, and E. robusta), rubber tree, oil palm, jatropha, cacao, and coffee.

Factors affecting the accuracy of genomic selection

The correlation between the GEBVs and true breeding values is known as GS accuracy (\({{{r}}}_{{{G}}{{S}}})\), and it is a key parameter for breeders due to the linear correlation between selection accuracy and annual genetic gain Ry (Eq. (1)) (Grattapaglia et al. 2018):

$${{{R}}}_{{{y}}}=\frac{{{i}}\times {{r}}\times {{{\delta}}}_{{{A}}}}{{{y}}}$$
(1)

where i is selection intensity, r is selection accuracy, δA is the additive genetic standard deviation, and y is the generation interval in years.

GS accuracy is usually obtained by k-fold cross-validation within a single experimental design (with each fold repeatedly used as a validation set and the remaining folds as the training set) or between experimental designs (with one site used for training and the other for validation), the latter being preferable as cross-validations may overestimate accuracy (Lorenz et al. 2011).

Below, we present sequentially the major factors that affect the accuracy of genomic predictions, although most factors are interconnected and their effects are not independent.

Statistical models for genomic prediction and trait genetic architecture

The whole-genome regression models used for genomic predictions deal with the “large p, small n” problem that, in GS, concerns the number of markers that usually (largely) exceeds the number of data records, in contrast to multiple linear regressions that cannot be used without variable selection, which conflicts with the original goal of GS, i.e., avoiding marker selection and overfitting. Multiple linear regression results in an insufficient degree of freedom leading to poor prediction due to the inability to estimate all marker effects at the same time, which is exacerbated by multicollinearity. A wide range of statistical methods has been developed for GS to alleviate this constraint (Campos et al. 2013; Jannink et al. 2010; Montesinos-López et al. 2021; Morota and Gianola 2014; Tong and Nikoloski 2021; Wang et al. 2018). They represent two broad categories: (i) parametric approaches, which mainly include methods that rely on the best linear unbiased prediction methodology (genomic BLUP [GBLUP] and random regression BLUP [RRBLUP]) and various Bayesian methods (Bayesian LASSO, BayesA, BayesB, etc.), and (ii) semi- and non-parametric approaches that fall into the machine learning category (reproducing kernel Hilbert spaces [RKHS], artificial neural networks, etc.). These methods differ in several ways: in terms of genetic assumptions and modeling of the genetic architecture of the traits (e.g., purely additive models, models that explicitly model dominance and/or epistatic effects, models with marker effects sampled from a common statistical distribution [RRBLUP, GBLUP], models with marker effects sampled from specific distributions [Bayesian LASSO, BayesB, etc.], models that implicitly model non-additive effects [e.g., RKHS]), in terms of computational approach (relationship-based methods and marker effect-based methods, single trait and multi-trait models, etc.), and in terms of the genomic information used in the model (type of polymorphisms, use of a priori information on markers, a combination of omics data, etc.).

The most widely used statistical approach for GS is GBLUP (Heslot et al. 2015; Montesinos-López et al. 2021), which combines linear mixed model analysis and genomic relationships. GBLUP derives from the first BLUP analyses applied in animal breeding to implement selection based on phenotypes and pedigree and that estimated the breeding values of individuals using the pedigree-based relationship matrix (A) (Henderson 1975), with a model of the form:

$${{Y}}={{X}}{{\beta}}+{{Z}}{{u}}+{{e}}$$
(2)

where \({{Y}}\) is an n × 1 vector of data records, X is an n × p incidence matrix relating data records with fixed effects, β is a p × 1 vector of fixed effects, and Z is an n × q incidence matrix. u is a q × 1 vector of random effects (i.e., breeding values), associated with A, and e is an n × 1 vector of residual effects. This initial approach we term pedigree-based BLUP (PBLUP) paved the way for GBLUP, which uses the genomic relationships (G) matrix, thus capturing existing relationships among individuals rather than expected relationships (Bernardo 1994; VanRaden 2007). An alternative approach to GBLUP is RRBLUP (Meuwissen et al. 2001), which yields GEBVs by estimating marker effects. GBLUP and RRBLUP are equivalent when there are many QTLs, when there is no major QTL, or when the QTLs are evenly distributed along the genome (Bernardo 2020). RRBLUP uses a model of the form:

$${{Y}}={{X}}{{\beta}}+{{Z}}\boldsymbol{^{\prime}}{{m}}+{{e}}$$
(3)

where Z’ is an n × k incidence matrix giving the genotypes at k SNPs and m a k × 1 vector of random SNP effects.

The relative performance of the different statistical methods is expected to vary depending on the genetic architecture of the trait considered (Lebedev et al. 2020). Genetic architecture corresponds to the genetic characteristics that determine the genotype–phenotype relationship, in particular, the number of genes that control the trait, the number of alleles per gene, the distribution of the genes along the genome, the distribution of the gene effects, and the mode of gene action (additive, dominant, epistatic) (Momen et al. 2018). Thus, methods in which marker effects are sampled in distributions where variance is the same for all markers (e.g., GBLUP, RRBLUP, Bayesian random regression) are expected to be more suitable for traits following the infinitesimal model, while methods with marker-specific variances (e.g., Bayesian LASSO, BayesB) are expected to be more suitable for traits whose genetic architecture includes major QTLs. Consequently, many GS studies, including those on tropical perennial fruit crops and plantation trees, use a range of statistical prediction methods to identify the most appropriate one for a specific trait. Overall, few variations have been found among statistical approaches, for example, in oil palm yield components (Cros et al. 2015; Kwong et al. 2017a), in eucalyptus growth (Durán et al. 2017; Müller et al. 2017), and in rubber tree latex yield (Cros et al. 2019). This confirms results obtained in empirical evaluations in other species, in which GS statistical methods were seen to perform similarly (Heslot et al. 2015); however, in some cases, differences were found: e.g., BayesB performed best for several traits including vegetative growth, production, and disease resistance in banana (Nyine et al. 2018) and vegetative growth and oil yield in oil palm (Ithnin et al. 2017). This could mean that, in the populations considered, QTLs with large effects were segregated for these traits.

Similarly, when non-additive effects play a significant role in genetic variation, models that account for non-additive effects are expected to increase GS accuracy. In a simulation study, Denis and Bouvet (2013) showed that modeling dominance for the genomic predictions of the genetic value of eucalyptus clones improved accuracy when dominance effects were preeminent (ratio of dominance to the additive variance of 1.0) and heritability was high (H2 = 0.60). With empirical data, also in eucalyptus, Resende et al. (2017), Tan et al. (2018), and Paludeto et al. (2021) showed that the use of GS models that account for dominance increased the accuracy of prediction for growth traits, which had high levels of dominance variance, whereas this was not the case for wood traits. In citrus, Minamikawa et al. (2017) showed that considering both additive and dominance effects improved prediction accuracy for acidity and juiciness.

When considering traits correlated with a sufficient magnitude but with contrasting levels of heritability, the use of multi-trait models can increase prediction accuracy for low heritability traits (Tong and Nikoloski 2021). In tropical perennial crops and plantation trees, the results obtained in oil palm (Marchal et al. 2016) and Eucalyptus robusta (Rambolarimanana et al. 2018) agreed with this principle. Multivariate models thus offer the opportunity to improve prediction accuracy at no extra cost (apart from increased computational resources), and they should therefore be systematically evaluated when correlations exist among the traits of interest, or between the traits of interest and secondary traits.

Machine learning methods are complex black-box approaches that are of growing interest for genomic predictions as they have several desirable features. They avoid the use of assumptions that are often violated and cannot be verified (Gianola and Van Kaam 2008), and they are particularly suitable to account for non-additive effects in particular in polyploids (Bayer et al. 2021) and to integrate data from different biological sources for multi-omics predictions (Montesinos-López et al. 2021; Tong and Nikoloski 2021). RKHS is the most often evaluated machine learning approach for GS in tropical perennial crops and plantation trees. In bananas, RKHS was slightly more accurate than parametric approaches for a few traits (Nyine et al. 2018). In a study analyzing eight traits in E. urophylla × E. grandis eucalyptus hybrids, RKHS proved to be slightly more accurate in predicting low-heritability traits but less accurate in predicting pulp yield (Tan et al. 2017) and performed similarly to GBLUP for three traits in E. grandis (Rambolarimanana et al. 2018). A few other machine learning methods have been implemented in tropical perennial crops and plantation trees. Maldonado et al. (2020) compared several parametric prediction models, RKHS and two artificial neural network approaches, deep learning and Bayesian regularized neural networks, in E. globulus and maize, and found that predictions made with deep learning methods were significantly more accurate for all the traits considered. Sousa et al. (2020) compared several machine learning approaches and a parametric model to predict resistance to leaf rust in Coffea arabica and obtained the best accuracy with artificial neural networks. Several authors used random forest in oil palm and citrus and found that, on average over several traits, random forest performed no better than parametric approaches (Kwong et al. 2017b; Minamikawa et al. 2017). In oil palm, the support vector machine was found to be slightly better on average than other methods (Kwong et al. 2017b). Despite these uneven results in tropical perennial crops and plantation trees, machine learning should be further investigated, in particular as the training populations used so far were possibly not large enough for the optimal training of this type of approach (Montesinos-López et al. 2021). Particular attention should also be paid to artificial neural networks, which have produced promising results.

One limit to the differences among statistical methods and models in perennial fruit and tree crops reported so far is that they were not always supported by a statistical test indicating whether the differences were significant or not. This can be done, for example, using the Hotelling-Williams t-test (Steiger 1980).

Linkage disequilibrium and effective size

Linkage disequilibrium (LD) between markers and QTL and effective size (Ne) have interrelated effects that strongly influence GS accuracy (Heffner et al. 2009; Isik 2014; Lebedev et al. 2020). LD is defined as the non-random association of alleles at two or more loci in haplotypes (Slatkin 2008; Weir 1979). LD between two loci is measured based on the frequency of alleles, using indexes like D, D’, and r2 (Collins and (Ed.) 2007). A key assumption in GS is that there is LD between QTLs and markers, such that, with dense genome marker coverage, every QTL controlling the phenotype of interest would be in LD with at least one marker. Good knowledge of this parameter in the target population is therefore of particular interest to define the marker density required for GS. It is thus useful to explore historical events, such as bottlenecks, genetic drift, and natural and artificial selection, that may have shaped the LD profile in the target population (Flint-Garcia et al. 2003; Gupta et al. 2005; Mackay and Powell 2007; Slatkin 2008). The LD profile is largely determined by the past Ne, which can be described as the number of randomly mating individuals in a population that would give rise to the observed rate of inbreeding (Falconer and Mackay 1996). There is an inverse relationship between Ne and LD, with high rates of genetic drift and inbreeding in low Ne populations leading to strong LD between markers and QTLs compared to high Ne populations (Grattapaglia 2014; Lin et al. 2014; Thistlethwaite et al. 2020). As Ne decreases and LD increases, pairs of individuals within the population tend to share longer haplotypes, enabling good genomic prediction accuracy (Clark et al. 2012; Heffner et al. 2009; Isik 2014; Lebedev et al. 2020). For a given marker density, training population size, and trait, LD and GS prediction accuracy is higher in populations with low Ne than in populations with high Ne (Grattapaglia 2014; Lin et al. 2014; Solberg et al. 2008).

The crucial role of LD and Ne in GS accuracy has also been underlined in studies on tropical perennial crops and plantation trees. Several studies investigated the LD profile to evaluate whether the marker density was high enough in citrus (Gois et al. 2016; Minamikawa et al. 2017), cocoa (McElroy et al. 2018), eucalyptus (Denis and Bouvet 2013; Durán et al. 2017; Müller et al. 2017), and oil palm (Kwong et al. 2017a). Many studies in tropical perennial crops and plantation trees also investigated the efficiency of GS in populations with high LD/low Ne. This was possible using populations obtained through specific mating designs among a reduced number of parents (Denis and Bouvet 2013; Resende et al. 2012). In this way, Resende et al. (2012) found that in a population of eucalyptus where Ne = 11 was obtained with an incomplete diallel, GS accuracy was higher for the four growth and wood quality traits studied than in the population where Ne = 51, despite a slightly larger number of training individuals in the latter population. In other studies, high LD/low Ne was obtained in full-sib families GS (Cros et al. 2017; de Souza et al. 2018; Gois et al. 2016; Kwong et al. 2017b). This strategy is also applied in other crops as it maximizes GS accuracy, although at the cost of only applying to families comprising the training population (Crossa et al. 2017; Lebedev et al. 2020; Lin et al. 2014).

The fact that GS accuracy reaches a plateau when marker density reaches a certain level (see below) suggests that an appropriate strategy to filter the markers would increase the cost-efficiency of GS. Filtering SNPs on LD has been investigated in several studies, as the SNPs that show very high LD values provide redundant information. In oil palm, Kwong et al. (2017a) evaluated the impact of marker density reduction by LD filtering and noted that, for some traits, it was possible to reach the same GS accuracy as using all the SNPs.

Marker density and marker type

As marker density strongly affects the extent of LD, it also plays a major role in GS accuracy. In GS studies of both plants and animals, increasing the number of markers was shown to improve prediction accuracy until a plateau was reached (Isik 2014; Lin et al. 2014; Meuwissen et al. 2001; Robertsen et al. 2019; Solberg et al. 2008). The same trend was observed in tropical perennial crops and plantation trees, where the density of markers required to reach maximum prediction accuracy depends in particular on the type of population, trait, and marker. Romero Navarro et al. (2017) found increasing prediction accuracy for yield and disease traits in cocoa with increasing marker density before a plateau was reached at around 1000 markers. In the rubber tree, the prediction accuracy for rubber yield plateaued at around 300 SRRs (Cros et al. 2019). In eucalyptus, the prediction accuracy among five growth and wood property traits reached a plateau between 5000 and 20,000 SNPs (Tan et al. 2017). Among seven production traits in oil palm hybrids, the plateau was reached with 500 to 2000 SNPs (Cros et al. 2017).

GS accuracy is also affected by the type of marker. Thus, in oil palm, GS accuracy for bunch number and average bunch weight plateaued at 160 SSRs in heterotic group A and at 90 SSRs in group B (Marchal et al. 2016) versus 3000 SNPs in group A and 350 SNPs in group B (Cros et al. 2017). This likely resulted from the fact that, as SNPs are biallelic, they are less informative than SSRs. However, in practice, SSRs cannot be used for genomic predictions, as GS relies on dense genotyping of large populations of selection candidates and therefore requires high throughput genotyping approaches at a reasonable cost. If marker density is constrained by the genotyping approach, the GS accuracy may be reduced. Thus, Kwong et al. (2017b) obtained mean GS prediction accuracies of 0.21 over palm oil yield components using 135 SSRs, versus 0.31 with 200 K SNPs.

Two primary options are available to reach the high marker density required for GS: methods that reduce genome complexity and SNP arrays (Edwards et al. 2013; Wiggans et al. 2017). They were made possible by the development of NGS technologies, which became available between 2004 and 2006 (Hu et al. 2021). Less expensive and with much higher throughput than the Sanger method (Sanger and Coulson 1975; Sanger et al. 1977), NGS methods have made it possible to carry out high-density and high-throughput genotyping, i.e., with good genome coverage in large populations, at an affordable cost. SNP arrays have been developed in several tropical perennial crops and plantation trees, with, for example, a 200 K array in oil palm (Kwong et al. 2016), a 60 K array in eucalyptus (Silva-Junior et al. 2015), and a 15 K array in cacao (McElroy et al. 2018). Most SNP genotyping methods based on reducing genome complexity consist of restriction enzyme-based approaches and sequence capture (Uitdewilligen et al. 2013; Zhou and Holliday 2012). These methods do not require specific preliminary investment and can be applied directly to any population. Given their relative simplicity and lower cost compared to SNP arrays, they became widely used, in particular for introgression breeding, genome-wide association mapping (GWAS), and QTL mapping (see, e.g., Kitony et al. (2021) and Reyes et al. (2021) in rice, Pootakham et al. (2015) in oil palm, or Chia Wong et al. (2022) in cacao). However, they are associated with a higher rate of missing data and genotyping errors than SNP arrays. Despite these differences, it seems that the choice between these two types of approaches has no impact on GS accuracy: The accuracy of genomic prediction of 13 wood quality and growth traits in eucalyptus using SNP genotypes obtained with sequence capture and a 60 K SNP array was similar (de Moraes et al. 2018).

Training and validation population relatedness

The accuracy of GS is positively correlated with the relatedness between the training and test population (Daetwyler et al. 2013; Isidro y Sánchez J, Akdemir D 2021; Pszczola et al. 2012; Wientjes et al. 2013). This is because when pairs of genotypes are closely related, they tend to share long haplotype blocks in the same linkage phase. To limit allele duplication and redundancy, relationships within the training population should be minimized (Isidro y Sánchez J, Akdemir D 2021). The accuracy of GS in tropical perennial crops and plantation trees was also found to be affected by the relatedness between the training and test population. In two eucalyptus species, E. benthamii and E. pellita, Müller et al. (2017) found that prediction accuracy declined strongly for three growth traits when individuals were randomly assigned to the training and validation populations compared to when they were assigned using a principal component analysis to minimize relatedness between training and validation populations. Similarly, considering eight wood growth and quality traits in Eucalyptus urophylla × E. grandis, Tan et al. (2017) obtained the worst prediction accuracies when minimizing the relatedness between the training and validation populations using k-means clustering. In another study, a significant positive correlation was found between GS accuracy and the relationship between training and validation populations for various production traits in oil palm (Cros et al. 2015).

Size and design of the training population

The size of the training population is one of the most important factors that determine GS accuracy. Several GS studies have reported that increasing the size of the training population improves GS accuracy (Calleja-Rodriguez et al. 2020; Cericola et al. 2018; Combs and Bernardo 2013; Isidro et al. 2015; Liu et al. 2018; Nielsen et al. 2016; Tan et al. 2017). In a family of full-sibs of Hevea brasiliensis, Cros et al. (2019) reported an increase in the accuracy of GS for rubber yield with an increase in the size of the training population up to a plateau of 200 individuals. In Eucalyptus, Denis and Bouvet (2013) also reported an increase in GS accuracy as a result of increasing the size of the training population, and Tan et al. (2017) reported an increase in GS accuracy that followed a diminishing return trend with increasing size of the training population.

The possibility of assembling large training populations among tropical perennial crops and plantation trees is contrasted. Thus, training populations comprising more than 1000 individuals were used in eucalyptus (Mphahlele et al. 2021), cacao (McElroy et al. 2018), and oil palm (Kwong et al. 2017a), whereas only small populations (< 600 individuals) have been used so far in banana (Nyine et al. 2018), rubber tree (Cros et al. 2019; Munyengwa et al. 2021; Souza et al. 2019), coffee (Fanelli Carvalho et al. 2020; Ferrão et al. 2019; Sousa et al. 2020, 2019, p. 2), jatropha (Peixoto et al. 2017), and guava (Silva et al. 2021). However, the size of the training population must be considered in relation to the relatedness between training and validation populations. Thus, for GS predictions in a biparental cross, it is better to use a relatively small but highly related training population of full-sibs or half-sibs than a large training population comprising distantly related or unrelated individuals (Brandariz and Bernardo 2019a; Brauner et al. 2020).

For some of the species considered here, breeding relies on a large number of phenotyped individuals, e.g., thousands of individuals for yield components and tolerance to ganoderma disease in oil palm (Cros et al. 2017; Daval et al. 2021) and thousands of individuals for tolerance to pests and diseases in Eucalyptus grandis (Mphahlele et al. 2021). In this case, genotyping a sample of the phenotyped population and making the genomic predictions using the single-step GBLUP approach (Lourenco et al. 2020), i.e., using a training population combining the genomic data of the genotyped individuals and the genealogical data of the others, is an efficient way to maximize the cost-efficiency of GS; see Mphahlele et al. (2021) in E. grandis, Cappa et al. (2019) in a complex eucalyptus population, and Imai et al. (2019) in citrus.

The cost of phenotyping is a major constraint in GS, especially now that sequencing costs have dramatically decreased thanks to next-generation sequencing (Akdemir and Isidro-Sánchez 2019). This financial constraint is particularly applicable to perennial crops, as their phenotypic evaluation requires large surface areas over several years. Thus, training populations need to be optimized to improve the cost-effectiveness of GS in these species. Training population optimization is the process of selecting, within a pool of individuals that could be used to train the GS model, a sample of individuals that will best predict the genetic value of the selection candidates (Isidro y Sánchez J, Akdemir D 2021). Several methods have been developed to optimize the training population, including CD-mean, PEV-mean, stratified sampling, or EthAcc (Isidro y Sánchez J, Akdemir D 2021). This aspect has received little attention in tropical perennial crops and plantation trees, although in oil palm, Cros et al. 2015 confirmed the efficiency of training population optimization to improve GS accuracy.

Trait heritability

The broad-sense heritability of a trait (H2) is defined as the proportion of the phenotypic variance that is genetically controlled. Narrow-sense heritability (h2) considers only variations due to additive gene action and ignores non-additive (dominance and epistasis) genetic effects (Falconer and Mackay 1996). In GS studies, the heritability of the trait affects the accuracy of GEBV, with higher h2 leading to greater GS accuracy (Hayes et al. 2009; Lin et al. 2014; Meuwissen et al. 2001, p. 2). This was illustrated by studies in tropical perennial crops and plantation trees where positive correlations were found between h2 and GS prediction accuracy for a set of disease resistance and yield traits in cacao (Romero Navarro et al. 2017), eight palm oil production traits in the B heterotic group used in oil palm breeding (Cros et al. 2015), 18 Arabica coffee agronomic traits (Sousa et al., 2019), and 15 vegetative growth, disease resistance, and fruit production traits in banana (Nyine et al. 2018). When simulating GS in eucalyptus, Denis and Bouvet (2013) noted that the prediction accuracy was higher with H2 = 0.6 than with H2 = 0.1, regardless of the ratio of dominance to additive variance, modeling dominance or not, or the breeding cycle. However, some studies detected no effect of trait heritability on GS prediction accuracy, but the effect may have been masked by other factors with stronger effects on prediction accuracy than heritability, in particular variations in the size of the training population, among traits, like in Durán et al. (2017).

Genetic gain from genomic selection

Genetic gain from the selection is defined as the improvement in the average genetic value of a population under the effect of selection over breeding cycles (Hazel and Lush 1942). GS has substantially increased genetic gain in animal breeding and plays a central role in many commercial plant breeding programs (Fugeray-Scarbel et al. 2021; Voss-Fels et al. 2019; Wartha and Lorenz 2021; Xu et al. 2020). The main advantages of GS over conventional phenotypic selection are its ability to (i) increase selection intensity and/or to shorten the generation interval by replacing all or part of the phenotyping activities by genotyping in selected breeding cycles and (ii) increase accuracy for traits that are difficult to phenotype (Fugeray-Scarbel et al. 2021; Wartha and Lorenz 2021).

When GS is used to increase selection intensity or to shorten the breeding cycle, an increase in annual genetic gain can be obtained even though GS is less accurate than conventional phenotypic evaluation. This has been illustrated in studies of tropical perennial crops and plantation trees that are promising for GS due to their long generation intervals and challenging phenotypic evaluations. Thus, based on the relative accuracy of GS and phenotypic selection, Resende et al. (2012, 2017) demonstrated that GS could significantly increase annual genetic gain for growth and wood quality traits in eucalyptus, i.e., from + 50% to + 300%, thanks to the fact that GS can be implemented at the seedling stage (< 1 year), i.e., much earlier than phenotypic selection, which cannot be carried out before at least three years old. Additionally, the possibility of increasing selection intensity by using a bigger population of selection candidates should further increase the advantage of GS over conventional selection. Based on 17 years of E. grandis breeding, Mphahlele et al. (2021) reported that the accumulated genetic gain with GS would be from 1.53 to 3.35 times higher than with conventional phenotypic selection, depending on the trait, because GS allows three breeding cycles in a 17-year period versus two with phenotypic selection. In coffee, it was also shown that with GS, 3-year breeding cycles would lead to a higher annual genetic gain in traits for growth, production, and tolerance to biotic stresses than the conventional 6-year phenotypic breeding cycles in Coffea arabica (Sousa et al. 2019) and in Coffea canephora (Alkimim et al. 2020). Similarly, an increase in annual genetic gain through a reduction in the generation interval with GS has been reported in citrus (Gois et al. 2016) and in rubber tree (Souza et al. 2019).

However, in many cases, the advantage of using GS over phenotypic selection in terms of genetic gain did not concern all the traits of interest. In this case, the interest of GS is its ability to increase selection intensity. This leads to a two-stage breeding scheme, starting with genomic selection, followed by phenotypic selection. In this case, the limiting factor for GS is the number of selection candidates that can be genotyped. In oil palm, using GS for bunch production before conventional phenotypic progeny tests was estimated to improve the performance of the selected A × B hybrids by more than 10% when 4000 A and 4000 B were genotyped (Cros et al. 2017). Similarly, in a full-sib rubber tree family, applying GS to 3000 individuals before clonal trials would have increased the selection response for rubber production by around 10% (Cros et al. 2019).

Some studies on tropical perennial crops and plantation trees also compared GS and QTL-based MAS approaches and the genetic gain expected from GS. For instance, in cacao, McElroy et al. (2018) found that GS largely outperformed GWAS in genetic gain for most of the disease resistance traits considered. In breeding populations of eucalyptus under selection, Müller et al. (2017) showed that GS outperformed GWAS for growth traits, as GS accounted for large proportions of the heritability, whereas GWAS captured very few significant associations. In a study simulating several cycles of within-family oil palm breeding, Wong and Bernardo (2008) found that GS enabled higher annual genetic gains than marker-assisted recurrent selection for all the family sizes, number of QTLs, and heritability considered.

Future prospects for genomic selection in perennial tropical crops and plantation trees

Promising results have already been obtained with GS in tropical perennial crops and plantation trees. However, different aspects require further investigation to take full advantage of the approach. As mentioned above, statistical approaches for predictions still require attention; in particular, single-step GBLUP and multivariate models need to be more widely used and artificial neural networks need to be investigated in greater detail. Training populations also need optimization. Other promising aspects have hardly or not been studied at all so far for use with GS in tropical perennial crops and plantation trees, and these aspects are discussed below.

High-throughput phenotyping

High-throughput phenotyping (HTP) platforms allow faster phenotyping and reduced labor costs compared to conventional methods (Persa et al. 2021). HTP allows analyses at the field scale with outdoor platforms that use remote sensing and imaging, mostly based on visible/near-infrared and far-infrared spectroscopy, and analyses of the harvestable part of the crop using near-infrared reflectance spectroscopy (NIRS). The use of HTP has already led to significant results in model species such as rice, maize, and wheat, for a wide range of traits, like adaptation, quality, and vegetative growth (Asaari et al. 2019; Blancon et al. 2019; Chattopadhyay et al. 2019; Juliana et al. 2019; Sun et al. 2019; Wu et al. 2019). For GS, HTP is an efficient way to characterize large training populations (Wartha and Lorenz 2021). This is particularly useful for perennial species that require phenotyping over extended periods of time. HTP has already been used in different tropical perennial crops and plantation trees. For instance, multispectral data collected from an unmanned aerial vehicle were used to estimate the height and diameter at the breast height of eucalyptus trees (Borges et al. 2021). NIRS has also been used for rapid quantification of flavor-related components of cocoa and beverage quality components of Arabica coffee (e.g., Álvarez et al. 2012; dos Santos Scholz et al. 2014). In eucalyptus populations used for GS, NIRS was used to measure chemical and physical wood quality traits (de Moraes et al. 2018; Durán et al. 2017; Rambolarimanana et al. 2018).

In addition to enabling the phenotyping of large populations, HTP data can be used in GS models as covariates associated with the trait of interest to increase prediction accuracy (Persa et al., 2021). To our knowledge, this aspect has not been investigated so far in GS studies on tropical perennial crops and plantation trees, but such studies would be of interest.

Phenomic selection is another approach that relies on spectral data that are usually obtained by NIRS (Rincent et al. 2018). In this case, the prediction of the genetic values is based on spectral data instead of molecular markers, meaning genomic data could no longer be needed. Phenomic selection has been investigated in a few crops, particularly in two temperate perennial species, poplar and grapevine. In poplar, the expected genetic gain using phenomic selection was higher than or the same as using genomic selection, depending on the trait (Rincent et al. 2018). In grapevine, phenomic predictions were reported to be a possible alternative to genomic predictions (Brault et al. 2022).

Longitudinal traits

Longitudinal traits are traits recorded repeatedly over the period of interest in the lifetime of individuals. This is a common case in perennial species. In tropical perennial crops and plantation trees, longitudinal traits are, for instance, growth and production, which are evaluated on each plant at different ages. The random regression model, a standard approach used for the genetic analysis of such traits (Oliveira et al. 2019), is a mixed model that makes it possible to model individual genetic values as a continuous function of time (or environmental covariates, see below), which can lead to more accurate estimates of the genetic values and facilitate the selection of genotypes with an optimal profile over the period of interest. Random regression can link genetic effects and time with complex functions, including nonlinear patterns, without making assumptions about the shape of the curve (Mrode 2014; Oliveira et al. 2019). The parameters that characterize these functions (e.g., slopes and intercepts for linear functions) are treated as random effects, and the analysis yields genotype-specific parameters. Random regression has already been used for genomic predictions of longitudinal traits in different species, in particular in animals (Oliveira et al. 2019). Surprisingly, even though many traits in tropical perennial crops and plantation trees are longitudinal, random regression has rarely been used in these species. One example is Jatropha curcas, where random regression was used to analyze grain yield over the years (Peixoto et al. 2020). However, to our knowledge, this approach has not been used in the context of GS in tropical perennial crops and plantation trees so far.

Leveraging multi-environment trials

Multi-environment trials and GS models that account for environmental effects make it possible to predict the genetic value of new genotypes in known environments, known genotypes in new environments, and new genotypes in new environments (Bustos-Korts et al. 2016; Malosetti et al. 2016). The ability to predict the performances in new environments is of major interest in the context of climate change, in particular for perennial crops where breeding suffers from inertia due to the length of the breeding cycles. Analysis of genotype-by-environment interactions (GEI) helps select genotypes that are stable across environments and can identify the best genotypes for specific target environments. In particular, this has been extensively studied in cereals (Crossa et al. 2017). Considering GEI in GS models can significantly increase prediction accuracy when data from multi-environment trials are available (Tong and Nikoloski 2021; Xu et al. 2020). A variety of approaches have been developed to incorporate environmental data in GS models (Bustos-Korts et al. 2016; Crossa et al. 2017; Malosetti et al. 2016; Tong and Nikoloski 2021; Xu et al. 2020). The most attractive methods enable predictions in new environments using reaction norms (Costa-Neto et al. 2021; Costa-Neto and Fritsche-Neto 2021; Crossa et al. 2021) or crop growth models (CGM) (Crossa et al. 2021; Van Eeuwijk et al. 2019; Xu et al. 2020).

Reaction norms are linear or nonlinear functions that describe the phenotypes produced by a single genotype across an environmental gradient (Li et al. 2017). They can be incorporated into genetic analyses using random regression (Marchal et al. 2019; Mrode 2014; Oliveira et al. 2019), leading to genotype-specific coefficients that characterize random norms for each environmental covariate. Equivalently, the environmental covariates can be used to build an environmental relationship matrix that identifies putative similarities among the environments considered (Costa-Neto et al. 2021), rather like using SNPs to build the relationship matrix.

CGM relies on plant physiology, soil science, and climatology principles to model plant development. CGMs use equations involving genetic parameters that are specific to the genotypes under consideration and are assumed to be independent of the environment and environmental variables (Boote et al. 2013). Several methods have been developed to incorporate CGM in the context of GS (Crossa et al. 2021; Rincent et al. 2017). CGM can be implemented to predict developmental stages that – along with daily weather data – will be used to compute climate stress covariates according to the plant development stage. CGM can also be used to compute environmental stress covariates that include the response of the crop to environmental conditions. These environmental covariates can then be incorporated in the GS model using, for example, random regression. Alternatively, the genetic parameters of the CGM can be estimated for the genotypes that comprise the training set and the genetic parameters of the selection candidates predicted by a GS model. Using the CGM and environmental covariates makes it possible to predict the phenotype of the selection candidates in the target environment. This approach has been termed gene-based modeling. Another method consists of incorporating a CGM in the GS prediction framework for the joint estimation of marker effects and CGM genetic parameters. This is referred to as CGM-WGP (whole-genome predictions) and relies on the use of approximate Bayesian computation or Bayesian generalized linear hierarchical models.

Ideally, the use of reaction norms or CGM requires the identification of all the environmental covariates that affect the trait of interest and the availability of environmental data at the plant level. This refers to the concept of envirotyping (Xu 2016) and its extension to large scale across time and space and enviromics (Resende et al. 2021). To our knowledge, only two GS studies have considered multi-environment trials in tropical perennial crops and plantation trees so far. Souza et al. (2019) made genomic predictions obtained with multi-environment data and modeling approaches including environmental effects and GEI applied to rubber trees grown in two environmental conditions. These authors showed that multi-environment models captured a larger proportion of the genetic variance than single-environment approaches. In Coffea canephora, Ferrão et al. (2019) used multiplicative models in which genetic and environmental effects were handled in a common random effect associated with a variance–covariance matrix obtained by the Kronecker product of genetic and environmental variance–covariance matrices. These authors showed that this approach resulted in more accurate GS than traditional GBLUP, as the latter did not account for environmental information. This area of GS needs further study in tropical perennial crops and plantation trees, and particular attention should be paid to the use of CGM, reactions norms, and enviromics. This could leverage tools and skills that are already available in these species. Thus, crop growth models have already been developed, for example, in cocoa (Zuidema et al. 2005), oil palm (Huth et al. 2014), and eucalyptus (de Freitas et al. 2020), and reaction norms were constructed in arabica coffee (Bertrand et al. 2015) and used with random regression for GEI analysis in conventional eucalyptus breeding (Alves et al. 2020).

Beyond single-locus genotype data

Different types of molecular information can now be exploited by the GS model, which could lead to an increase in the accuracy of predictions by better modeling the genotype–phenotype relationship (Fig. 1).

Fig. 1
figure 1

Overview of possible molecular information for optimizing GS models. Genomic features can be defined in various ways: location in QTL, functional and structural annotations, etc. (Sørensen et al. 2013). Two genomic features were considered here for illustration

The use of haploblocks made of two or more adjacent SNPs instead of single SNPs was investigated for genomic predictions, as it could increase GS accuracy by better capturing identity-by-descent between individuals, giving higher LD between QTLs and haploblock alleles, or capturing epistatic effects between SNPs in the same haploblock (Bhat et al. 2021; Goddard and Hayes 2007; Hess et al. 2017). Ballesta et al. (2019) explored the advantages of using haplotypic data for GS in Eucalyptus globulus and showed that prediction accuracy was significantly higher for low heritable traits when haploblocks were used instead of single SNPs. However, the relative efficiency of using haploblocks or single SNPs for genomic predictions is affected by many parameters, in particular the size of the training population, the level of LD, the method used to define the haploblocks, and the phasing accuracy (Bhat et al. 2021; Goddard and Hayes 2007; Hess et al. 2017). This aspect requires further investigation in tropical perennial crops and plantation trees.

The use of pangenomes is another possible avenue of GS research. Progress in sequencing techniques has enabled the comparison of individual genomes within species and shown that structural variations (SV) represent a significant proportion of polymorphism (Yuan et al. 2021). SVs consist of deletions, insertions, copy number variations, inversions, or translocations, with size > 50 bp. In particular, SVs include variations in gene presence/absence, with core genes that are found in all individuals and variable genes that are absent in some individuals. SVs cannot be represented by single reference genomes, and pangenomes are thus required to harness the whole genetic diversity of the breeding population (Bayer et al. 2021; Scossa et al. 2021). So far, very few studies have considered using structural variations for genomic predictions. In wheat, Würschum et al. (2017) obtained a slight increase in GS accuracy when markers specifically targeting a CNV contributing to the genetic control of the target trait were included in the model. Similarly, in maize and cattle, the use of CNV information in the GS model increased prediction accuracy in some cases (El Hamidi et al. 2018; Lyra et al. 2019). The use of SV information for genomic predictions deserves greater attention, and this will be greatly facilitated by pangenomes. Several reference genomes are already available for certain tropical perennial crops and plantation trees (e.g., cocoa and oil palm), and the next step should be the construction of pangenomes. The biggest impact could be on polyploid crops, such as bananas, as SV may represent an even higher proportion of polymorphisms in polyploids (Schiessl et al. 2019).

Another way of improving GS accuracy is to incorporate existing information concerning polymorphisms, particularly that obtained from studies of QTL detection, in the prediction model (Xu et al. 2020). Different modeling approaches have been developed for this purpose, and their efficiency has been demonstrated in animal and plant studies, including temperate perennial fruit trees (Nsibi et al. 2020). However, very few studies have investigated this aspect in tropical perennial crops and plantation trees so far. In oil palm, Kwong et al. (2017a) applied RRBLUP using only SNPs with the highest GWAS association score, which made it possible to reduce marker density while achieving better or the same accuracy as using all the SNPs. A similar result was obtained in eucalyptus (Tan and Ingvarsson 2019). However, these approaches depend on a careful definition of the training and application populations. Thus, in cocoa, the inclusion of the SNPs detected by GWAS as fixed effects in the GS model did not improve prediction accuracies, which likely resulted from a too high genetic differentiation between the training and application populations, making the detected SNPs irrelevant (McElroy et al. 2018).

Incorporating endophenotypes, or intermediate phenotypes, in prediction models is another promising feature of GS research. Endophenotypes, and in particular transcriptomic and metabolomic data, have been used jointly with genomic data in a few crops (Scossa et al. 2021; Tong and Nikoloski 2021; Xu et al. 2020). These multi-omics prediction approaches are expected to better capture minor and non-additive effects and to better model the relationship between genotypes and phenotypes. Multi-omics prediction produced promising results in rice and maize, where they outperformed single-omic predictions. This requires specific statistical approaches, like machine learning (Montesinos-López et al. 2021; Tong and Nikoloski 2021). Investigating these aspects would be of interest to tropical perennial crops and plantation trees.

GS aided re-domestication and introgression breeding

Some perennial tropical crops have breeding populations with narrow genetic bases, and hence, only a fraction of the genetic diversity of the species is exploited, for instance, in Coffea Arabica (Tran et al. 2016), cacao (Lanaud et al. 2001; Zhang and Motilal 2016), and rubber (Priyadarshan 2011). This usually resulted from choices and constraints dating back to the beginning of the breeding of these crops, or even before. In addition, the criteria originally used to select individuals might differ from the criteria that are of interest today, and current breeding populations may no longer correspond to current needs in terms of diversity. For example, in oil palm, the Deli breeding population, which today is used as one of the two heterotic populations mated to produce the vast majority of the oil palm cultivars, originated from four individuals collected in Africa and planted in Indonesia in 1848, decades before the establishment of the first commercial plantations (Corley and Tinker 2016). The other oil palm breeding populations derived from a small number of founders selected among individuals collected in restricted regions during prospections, usually in the first half of the twentieth century. Although this led to reduced effective sizes (Cros et al. 2014), which is advantageous for GS accuracy, it constrains the long-term genetic gain. Also, for the La Mé oil palm breeding population, the founder individuals were selected in the 1920s, giving less importance to the proportion of pulp in the fruits than breeders do today (Cochard 2008). Although this has not prevented significant genetic progress (e.g., in oil palm, genetic progress is considered to be 1–1.5% per year (Rival and Levang 2014), and in rubber tree, yield increased from 500 kg ha−1 in primary clones developed in the 1930–1960 period to 2500 kg ha−1 in the best clones today (Priyadarshan 2011)), broader genetic diversity of the crops concerned would help maintain the rate of the genetic progress and likely increase it. This could be achieved through the re-domestication of existing crops (Tian et al. 2021), which consists in initiating breeding afresh from a renewed and broader diversity comprising ancestors and/or natural populations of existing crops. Introgression breeding could also play an important role in increasing genetic diversity by transferring exotic alleles from the related species of cultivated crops (Gramazio et al. 2021). GS is an attractive way of implementing these processes efficiently (Crossa et al. 2017). Indeed, re-domestication or introgression breeding of perennial tropical crops and plantation trees would normally require many decades of phenotypic selection, making GS a particularly attractive option. One example is already available in a temperate perennial fruit tree, apple (Kumar et al. 2020), a study which suggested that, for the introgression of monogenic traits into a superior germplasm by backcrosses or pseudo-backcrosses, GS would be efficient for the background selection implemented among the individuals that inherited the trait of interest from the exotic donor germplasm, as it would accelerate the elimination of the unwanted alleles of the donor, compared to conventional phenotypic background selection. The use of GS for this purpose should be considered in perennial tropical crops and plantation trees where introgression breeding from wild species has already been shown to be of interest, including citrus, banana, and cacao (Scossa et al. 2016).

Combining profiles of predicted marker effects and targeted recombination

As mentioned above, one limiting factor in breeding perennial crops is the constrained size of the population of selection candidates, as the larger the population, the more exhaustive the search for elite individuals within the diversity generated by meiosis. GS makes it possible to increase the population of selection candidates by replacing phenotyping with genotyping. Controlling the gametes generated at meiosis could further increase the efficiency of the breeding scheme. This could be made possible by combining genome-wide profiles of marker effects estimated using GS models and targeted recombination (Bernardo 2017). The profiles of marker effects along the chromosomes of heterozygote individuals could be used to identify sites in the genome where recombinations would maximize the genetic value of their gametes by aggregating blocks of favorable alleles. Recombinations could be obtained at these sites through genome editing, and the progenies of the regenerated edited individuals were screened to identify the best ones. This approach has great potential to increase genetic progress (Bernardo 2017; Brandariz and Bernardo 2019b). Genome editing tools are under active development in perennial tropical crops and plantation trees, for example, in cacao (Fister et al. 2018) and oil palm (Yeap et al. 2021). However, further studies are required in these species to develop efficient, targeted recombination approaches and to evaluate the relative efficiency of breeding schemes involving targeted recombinations and conventional schemes.

GS-based breeding consortia

Breeding for perennial crops is highly complex and very costly, and only limited resources are available for breeding many tropical perennials. Furthermore, as we have seen throughout this review, using GS requires expertise in a range of scientific and technical fields, including quantitative genetics, biostatistics, bioinformatics, genomics, computer programming, and, in particular, with the growing interest in machine learning, mathematics. GS also often requires a large training population which, in the context of climate change, will need to be evaluated in multiple environments. This puts tropical perennial crops in a completely different situation than many other crops including temperate cereals and legumes that can rely on a dynamic private sector to bring together the required human resources, phenotyping and genotyping capacities, etc. and to make rapid progress in innovative methods, resulting in the release of cultivars that have benefited from these methods. One possible solution for tropical perennial crops would be to strengthen international collaboration by sharing the efforts required for the practical implementation of GS, i.e., multi-environment phenotyping, high-throughput genotyping, and statistical analyses for genomic predictions. Sneller et al. (2021) called for the construction of GS-based breeding consortia, which would allow each member of a consortium to share the overall GS costs while predicting the genetic value of its selection candidates using a large training population comprising genetic material from all the consortium partners. Another advantage of such consortia would be the possibility to evaluate genetic material in different environments through the exchange of plant material among the consortium partners. Even so, there would have to be some relatedness between the plant material shared by the members of the consortium, and sufficient genotypes would have to be evaluated in different partners' environments (Sneller et al. 2021). Such a consortium is a possible solution for the implementation of GS for tropical perennial species on which, to our knowledge, no GS studies have been published so far, including coconut, papaya, avocado, mango, or teak, despite their major economic importance. Projects in this sense are currently being set up for some perennial tropical crops and plantation trees, like coffee (World Coffee Research 2022), while others could emerge by building on existing networks, like MusaNet (https://musanet.org/) and CacaoNet (https://www.cacaonet.org/).

Conclusion

Genomic selection (GS) should revolutionize the breeding of perennial tropical crops and plantation trees as it has already produced promising results in terms of an increase in the rate of genetic progress. GS will (i) enable increased selection intensity and/or a shorter generation interval by replacing all or some phenotyping by genotyping in selected breeding cycles and (ii) increase accuracy for traits that are difficult to phenotype. Overall, the main factors that affect GS accuracy have been well studied in perennial tropical crops and plantation trees. However, the level of studies on GS varied in the following species: Some, like eucalyptus and oil palm, can be considered as models for GS including an in-depth assessment of its practical potential; in others, like banana and guava, GS studies were recently initiated, while in other species, like coconut, papaya, avocado, mango, and teak, despite their economic importance, no GS study has been conducted so far.

The results obtained in the plant and animal species where GS has been investigated to date suggest that optimal GS predictions could be achieved through joint analysis of all available information concerning genotype-to-phenotype relations, possibly including multiple omics and phenotypic data on multiple traits in several well-characterized environments, using prior information available on markers and all types of polymorphisms present in the populations concerned. For perennial crops, in which phenotyping is particularly complex and resource-consuming, there is an urgent need for increased international cooperation in the form of GS-based consortia to be able to gather such large datasets at a reasonable cost. The optimal implementation of GS will also require going beyond the standard GS technologies and methodologies used today. In particular, high-throughput phenotyping is a key approach to gathering the required amount of phenotypic data on such large populations at a reasonable rate and cost. Statistical methodologies able to handle large multidimensional heterogeneous datasets are also required, and machine learning approaches are crucial, particularly artificial neural networks.

Future GS research in tropical perennial crops and plantation trees should systematically consider the use of single-step GBLUP when phenotypic data are available on ungenotyped individuals, the use of multivariate models when the traits of interest comprise correlated traits with contrasting levels of heritability, and random regression models for longitudinal traits. Training population optimization should also be undertaken. Targeted recombinations on sites identified based on the profiles of predicted marker effects should be investigated. Furthermore, GS has the potential to make re-domestication possible as well as to boost introgression breeding.