Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Tree breeding programs are resource- and time-dependent endeavours. The selection and testing phases are often conducted over vast geographic areas with large trials, requiring frequent and long-time monitoring and assessment. The lowest-intensity approach to tree improvement is a reciprocal transplanting-like approach known as provenance testing (Callaham 1964) for the identification of superior seed sources for reforestation. Provenance testing allowed evaluating several seed sources originating from multiple locations within the species’ natural range through their field-testing over potential target planting areas. This process aided in identifying superior seed sources and their adaptability for the safe transfer of their seed to the new planting sites (Rehfeldt 1983). Provenance testing focused on acquiring precise knowledge of the seed sources and their performance over testing sites (Konig 2005). This process is a simple population improvement method, as the pedigree or genealogy of the tested material is often unknown. The main achievement of provenance testing is the delineation of areas for safe seed transfer, known as seed zones (Campbell 1986).

The first and simplest pedigree-known testing utilized wind-pollinated/open-pollinated families (also known as half-sib families because their offspring share the seed donors’ genotype). Wind-pollinated testing, as a partial pedigree method, permits within and among family selection, thus it is expected to yield greater gains than provenance testing. The New Zealand radiata pine tree improvement program is the most notable program for adopting this approach (Burdon and Shelbourne 1971). The main attractive feature of this method is its simplicity and suitability for testing large number of families; however, it is often considered as a spring-board to full pedigree testing (Jayawickrama and Carson 2000). It should be stated that wind-pollinated testing is fraught with assumptions that cannot be either tested or fulfilled, and often leads to inaccuracies in estimates of individual breeding values (Namkoong 1966).

The utilization of a full pedigree (i.e., individuals with known genealogy) is the most common testing mode in tree breeding programs (White et al. 2007). The formation of a structured pedigree, created through the implementation of a mating design of controlled pollinations, provides greater control of the genealogy and the eventual accurate estimation of genetic parameters such as trait heritabilities and parent and offspring breeding values (Namkoong et al. 1988). It should be stated that the successful completion of structured pedigree is an elaborate process requiring time and substantial painstaking effort. The recurrent selection scheme is the most common breeding framework used when full pedigree is used (Allard 1960).

2 Pedigree Reconstruction

Structured pedigree designs (full- and half-sib families) constitute the backbone for most tree breeding programs, resulting in impressive gains and better management of inbreeding and genetic diversity (White et al. 2007). Lambeth et al. (2001) introduced an idea of polymix breeding and pedigree reconstruction. El-Kassaby and Lstibůrek (2009) further implemented this idea via the posterior analysis of naturally-occurred crosses among a group of parents. They coined the method “Breeding without Breeding (BwB)” and proposed the utilization of molecular markers, SSRs in this case, and pedigree reconstruction models (see Jones and Ardren 2003 for review) to by-pass the costly and time consuming breeding phase. The disconnected partial diallel mating scheme is often employed to create the structured pedigree for generating the offspring needed for testing (Namkoong et al. 1988). The BwB concept is illustrated using bulk seed sample from a 63-parent lodgepole pine seed orchard (El-Kassaby, unpublished), and can be compared with the disconnected partial-diallel design. With this number of parents and the implementation of a six-parent scheme, 153 full-sib families are expected to be generated (seven 6-parent and three 7-parent partial diallel units). However; when pedigree reconstruction was implemented, a total of 446 full-sib families were assembled without making any controlled crosses (Fig. 1). The resulting mating is far more efficient as many more crosses were created as compared to the classical disconnected partial diallel.

Fig. 1
figure 1

Distribution of posteriorly assembled naturally-occurred crosses among 63-parent lodgepole pine seed orchard revealed by full pedigree reconstruction of bulk offspring (i.e., unknown maternal and paternal parentage) using DNA microsatellite markers (nine nuclear and six chloroplast loci) and pedigree reconstruction (El-Kassaby, unpublished)

Furthermore, El-Kassaby et al. (2011) extended the BwB concept and increased the method’s efficiency through the application of two distinct steps: (1) the use of simplified half-sib progeny testing with large sample size per parent and (2) restricting offspring sampling for DNA fingerprinting and pedigree reconstruction to a random sample of offspring from a subset of parents rather than the entire parental population. The use of half-sib families in testing is expected to simplify the progeny test design as compared to multiple full-sib families. A random sample of offspring from a subset of seed parents is expected to capture most of the un-sampled parents as fathers (i.e., paternal half- and full-sib families) and therefore their breeding values can be estimated. Finally, the inclusion of all the offspring phenotypic information from both full- and half-sib families is expected to increase the estimated genetic parameters’ precision; however, it should be stated that the breeding value of the half-sib individuals will be estimated with lesser precision as compared to those of full-sib families. El-Kassaby et al. (2011) empirically tested this concept and assessed offspring generated from only 15 seed-donors (i.e., half-sib families) out of a 41-parent western larch seed orchard. In this experiment, each half-sib family was represented by 400 seedlings bringing the total experiment sample size to N ≈ 6,000. They randomly sampled 1,500 individuals, irrespective of their half-sib family designation, for DNA fingerprinting and pedigree reconstruction. As expected, an unbalanced mating structure was produced reflecting variation in parental reproductive output (Fig. 2).

Fig. 2
figure 2

Pedigree reconstruction from natural mating produced from seed collected from 15 seed donors growing in a 41-parent western larch seed orchard showing the formation of full-sib families nested within the maternal and paternal half-sib families with selfing presented as black bars (After El-Kassaby et al. (2011))

It is interesting to note that the assembled matings produced offspring sired by all 41 parents in the orchard, indicating that the pedigree reconstruction successfully captured the un-sampled parents as pollen donors even when the offspring sampling was restricted to 15 seed-donors only. The most interesting observation from the data analyses is the congruence between height breeding values from the combined analysis (1,500 FS + 4,500 HS) and that based on the conventional full-sib families alone (1,500 individuals). This was observed for both parents and offspring (Fig. 3). The great advantage of the FS and HS combined analysis is the role played by the 1,500 FS individuals in linking the remaining 4,500 HS to the paternal and maternal parents and their half- and full-sib families (Fig. 3). Furthermore, El-Kassaby et al. (2011) demonstrated that individuals’ breeding values precision did not change drastically if the random sampling of individuals for fingerprinting and pedigree reconstruction was reduced to approximately one third (i.e., less fingerprinting efforts).

Fig. 3
figure 3

Scatter plot of predicted breeding values for parents (left) and offspring (right) from the incomplete (combined HS + FS) and complete (FS) pedigree models. Pearson correlation (r) is in the left corner of each graph (After El-Kassaby et al. (2011))

Pedigree reconstruction is an effective method in situations where the posterior determination of offspring genealogy is needed or for species that do not lend themselves to controlled pollination. Using pedigree reconstruction for trees from plantation blocks that originated from seed orchards or breeding arboreta can instantaneously convert them to progeny test trials (Hansen and McKinney, 2010). While this approach requires good GIS tracking of plantations polygons over the landscape (see Ding et al. 2012), it also requires rigorous spatial analysis to account for site heterogeneity (see Cappa et al. 2011).

3 Pedigree-Free Models

Fundamentally, Breeding without Breeding is anchored to the utilization of pedigree reconstruction to assemble half- and full-sib families needed for conducting standard intra-class correlation analyses for estimating quantitative genetics parameters such as traits’ heritabilities and parental and offspring breeding values (Falconer and Mackay 1996). In situations where pedigree reconstruction is not feasible, molecular genetic markers offer an alternative approach for estimating quantitative genetic parameters. Molecular markers can be used to estimate “marker-based pairwise relationships” among any group of individuals irrespective of their genealogy, based on the assumption that markers identical by state are also identical by descent (Li et al.1993; Queller and Goodnight 1989; Lynch and Ritland 1999; Wang 2002). The use of “marker-based pairwise relationship” created an opportunity to studying domesticated and undomesticated species in experimental or natural setting with and without the availability of pedigree, thus permitting the estimation of genetic parameters in an unstructured population. Efficient methods have been developed for the use of high-density marker information for a group of individuals to estimate their realized relationship matrix (vanRaden 2008). This matrix is used in place of the classical pedigree-based numerator relationship matrix required in quantitative genetics analyses. This approach allows estimating quantitative genetic parameters such as narrow sense heritability and breeding values using the genomic best linear unbiased prediction method, as described in more detail below (Zapata-Valenzuela et al. 2011; El-Kassaby et al. 2012; Porth et al. 2012).

The realized relationship matrix was successfully used to estimate narrow sense heritability, breeding value and genetic and phenotypic correlations in an unstructured black cottonwood population (El-Kassaby et al. 2012; Porth et al. 2012). More interesting is the study of Klápště et al. (2013) in which a pedigree-free model was compared to a marker-based pairwise relationship model. Surprisingly, Pearson’s product moment and Spearman’s rank correlations between western larch offspring breeding values produced from the two approaches were highly significant, indicating that the generated DNA-based pair-wise relationship matrix is indeed a valid substitute for the classical pedigree matrix (Fig. 4). This approach was further extended to accommodate a mixture of information generated from both genetic markers and conventional pedigree by Korecký et al. (2013). This approach is unique as the combination of historical and contemporary co-ancestry generated by the genetic markers and pedigree, respectively, could not be attained by either approach individually. Thus, combining both data sets is expected to improve the accuracy of the estimated genetic parameters as the often ignored Mendelian sampling term in structured pedigree is precisely accounted for when molecular markers are used.

Fig. 4
figure 4

Correlations of individuals’ breeding values produced from pedigree-based full-sib (FS) and four molecular genetic markers-based pairwise relationship estimation methods (W: Wang (2002); LR: Lynch and Ritland (1999); L: Li et al. (1993); QG: Queller and Goodnight (1989)) (After Klápště et al. (2013))

The availability of molecular markers is expected to effectively increase breeding efficiency. The use of densely well dispersed SNP data to estimate the realized relationship among individuals is expected to result in a greater kinship resolution and offers an opportunity improvement to classical breeding efforts.

4 Marker-Trait Association

The availability of cost-effective molecular genetic marker systems opens the door to analysis of the genetic basis of phenotypic traits measured in breeding populations. Classical quantitative genetics approaches, whether based on provenance, pedigree, or realized relationship matrices, are based on the ‘infinitesimal model’ proposed by Fisher (1918). Fisher’s model reconciled the disparate views of geneticists who studied quantitative traits that show continuous variation, and geneticists who studied discrete characters controlled by single genes, by hypothesizing that continuous variation is the cumulative effect of many different genes, each with a small and approximately equal additive effect on the phenotype. This model has been extremely useful for close to a century, and recent publications have reviewed the substantial body of evidence supporting the main features of the model (Hill et al. 2008; Stranger et al. 2011). This model has important implications for efforts to understand the molecular genetic mechanisms that underlie phenotypic variation in forest tree breeding programs, and for breeders interested in accurately predicting genetic merit of individuals based on genotype information.

The analytical approach called “association genetics” was described over 15 years ago (Lander and Schork 1994; Risch and Merikangas 1996) as an alternative to family-based linkage mapping approaches to characterize the genetic basis of human disorders. Much more work has been done using association genetics in the field of human biomedical genetics than in any other area, and much has been learned about the strengths and weaknesses of the approach (reviewed by Stranger et al. 2011; Rowe and Tenesa 2012). Neale and Savolainen (2004) reviewed key requirements for association genetics, and proposed that populations of conifers (and by extension, other wind-pollinated forest tree species) would be suitable experimental materials for association genetics. Applications of association genetics in tree breeding were described by White et al. (2007, pp. 543–547) and Wilcox et al. (2007); a brief overview will be given to set the stage for discussion of the current status.

The fundamental concept in association genetics is to test for a statistical association between the allelic state at a genetic marker locus in an individual and the phenotype of that individual, for many individuals in a population. The value of such associations is that they can help to identify the molecular basis for phenotypic variation, which in turn may provide molecular markers useful for marker-assisted breeding (Neale and Savolainen 2004). The power to detect associations is a function of several parameters, including the presence of population structure (Neale and Savolainen 2004), the extent of linkage disequilibrium in the test population, the size of the test population, and the proportion of phenotypic variation accounted for by each causative genetic variant involved in the phenotype of interest. The genetic variants tested for association with phenotype may be in known genes that are believed to play a role in controlling the phenotype under study (the ‘candidate gene’ approach), or they may be chosen on the basis of the allele frequencies in the population and distribution in the genome (the ‘genome-wide’ approach). As with any statistical testing procedure, if multiple tests of the same hypothesis are conducted, false positive (Type I) errors are likely unless the significance threshold is corrected for the number of tests made. Risch and Merikangas (1996) proposed a threshold of 5 × 10−8 for genome-wide significance in an experiment testing associations of one million single-nucleotide polymorphism (SNP) loci in the human genome; more recent publications have refined this estimate slightly for different sets of human SNP loci (Li et al. 2012). Linkage disequilibrium (LD), the non-random association between allelic states at different loci, affects the independence of multiple tests, and so correction for multiple testing should take into account patterns of LD among the loci analyzed.

An early study of linkage disequilibrium in Douglas fir, based on a relatively small sample of 18 genes from 32 haploid megagametophyte samples, concluded that each gene contained 2–3 independent “haploblocks” of genetic variation, and 4–5 SNP loci per gene would be required to adequately sample the genetic variation in each gene (Krutovsky and Neale 2005). This study focused on transcribed regions, because relatively few resources were available at the time for analysis of non-transcribed regions of genomic DNA in any conifer species. The majority of SNPs identified as significantly associated with target traits in human GWA studies are in non-coding sequences (45 % in introns and 43 % in intergenic regions; Hindorff et al. 2009), suggesting that efforts to model the genetic variation underlying phenotypic variation must include analysis of non-coding genomic DNA sequences. Fortunately, reference genome sequencing projects are now underway for loblolly pine, white spruce, and Norway spruce (searchable abstracts available on-line at https://pag.confex.com/pag/xx/webprogram/start.html), and reference genome sequences are already available for poplar (Tuskan et al. 2006) and eucalyptus (available on-line at http://phytozome.net/), so genomic sequence information will be more readily available for future efforts to model genetic variation.

Determination of the appropriate sample size and number of genetic loci to test in order to achieve a specific level of power in an association study requires evaluation of several population parameters that affect power (Ball 2005; Spencer et al. 2009). The magnitude of the genetic effect of a locus, the frequency in the population of the allele that causes an effect, and the extent of LD between the causative allele and nearby genetic markers (e.g. SNPs) are some of these parameters. Association studies in humans primarily focus on disease-related phenotypes, and the magnitude of the genetic effect is often expressed as a ratio of the likelihood of disease occurrence in a heterozygous individual to the likelihood of disease in an individual homozygous for the most common allele (genotypic risk ratio, Risch and Merikangas 1996, or relative risk per allele, Spencer et al. 2009). The structure of linkage disequilibrium in the human genome is complex enough that simulation is the most general approach to modeling the dependence of experimental power on sample size, relative risk, and allele frequency (Spencer et al. 2009). Such simulations indicate that power is lower for lower risk allele frequencies, for lower risk per allele, and for lower numbers of genetic variant loci tested; for a relative risk per allele of 1.5, an array that assays one million SNP loci provides only about 50 % power in a sample size of 5,000 when the risk allele frequency is less than 10 % (Spencer et al. 2009). A relative risk per allele of 1.5 is roughly equivalent to accounting for 5 % of phenotypic variation, although that equivalence is affected by allele frequency in the population; relatively few loci detected to date in human genome-wide association studies have effects that large (Stranger et al. 2011). This suggests that association genetics studies will not be powerful enough to detect individual genes that account for a significant proportion of phenotypic variation in complex traits in forest trees, if the infinitesimal model is accurate. Some traits of interest to tree breeding programs, such as resistance to fusiform rust disease in Pinus taeda, are controlled by individual genes with major effects (Wilcox et al. 1996); association genetics approaches are well-suited to analysis of such traits.

Height growth is an important phenotype in many tree breeding programs, so results of association genetics analysis of height in humans are of interest. Yang et al. (2010) reported that joint analysis of all SNPs as random effects in a mixed linear model that incorporated relationship information derived from marker genotypes explained almost half the genetic variation in height in a sample human population of less than 4,000 individuals, although all 180 loci identified by meta-analysis of association studies in a combined population of 183,727 individuals (Lango Allen et al. 2010) together explained about 14 % of the genetic variation in height. The difference between the analytical approaches taken by these two groups is that Yang et al. focused their attention on creating a predictive model, without concern for identifying specific loci, while Lango Allen et al. followed a more classical association approach using rigorous statistical methods to reduce the likelihood of false positive results and identify loci and pathways mechanistically related to height growth. Many of the loci identified by Lango Allen et al. can be grouped into biological pathways with recognized effects on growth and development, and in many cases, multiple genetic variants were identified per gene (Lango Allen et al. 2010). This phenomenon, referred to as allelic heterogeneity, reduces power in association analyses, because the same phenotype can be due to multiple different genetic variants, even at the same functional gene. Occurrence of multiple genetic variants within genes that affect the same phenotype creates the possibility for epistatic interactions; epistatic interactions within genes or between tightly-linked genes can result in differences between the heritability estimated from closely-related individuals versus distantly-related individuals (Haig 2011; Würschum et al. 2012; Zuk et al. 2012). The approach of analyzing association genetics data by grouping variants into functional genes, organizing genes into pathways, and integrating genetic pathways with gene expression data may provide additional power for understanding phenotypic variation, if modeling approaches that can take pathway structure and gene expression patterns into account can be developed (Cookson et al. 2009; Bennett et al. 2012; Kreimer et al. 2012; O’Hagan et al. 2012). Another approach, similar to that used by Yang et al. (2010), is to incorporate all SNP loci as random effects in the association analysis; this approach has been reported to overcome disadvantages of both traditional linkage analysis and association analysis methods in livestock (Kemper et al. 2012). This type of analysis has much in common with genomic selection, discussed later in the chapter.

Allele frequency of the minor allele at biallelic SNP loci has a major impact on the power of association genetics studies (Spencer et al. 2009; Stranger et al. 2011). Most SNP loci in a sample of over 3,000 SNPs assayed in over 900 loblolly pine trees had minor allele frequencies of less than 15 % (Eckert et al. 2010). Such low minor allele frequencies in samples of unrelated populations contributes to a requirement for extremely large sample sizes to achieve significance in traditional association genetics studies; only alleles with relatively large effects can be detected unless sample sizes exceed 5,000 and marker allele frequency is close to causative variant allele frequency (Ball 2005; Stranger et al. 2011). Structured populations descended from a smaller number of parents can reduce this problem by increasing the frequency of rare alleles that occur in that sample of parents. This strategy has been used to develop the maize Nested Association Mapping (NAM) population (Yu et al. 2008; McMullen et al. 2009), and methods to deal with the population structure that arises in populations produced from mating designs have also been developed (Yu et al. 2006). The combined use of the NAM population and a more typical association population of 282 inbred lines allowed identification of several SNPs that affect maize kernel composition (Cook et al. 2012). Similar strategies may become feasible in forest tree breeding programs, once reference genome sequences are available and haplotype information can be readily developed for the parents of elite breeding populations.

Understanding of molecular mechanisms underlying phenotypic variation is not the primary objective of breeding programs – instead, the objective is to create models of genetic variation in breeding populations that have predictive power to identify individuals of high genetic merit. Studies that increase understanding molecular mechanisms can contribute to development of predictive genetic models in the long term, while studies that focus on developing models of inheritance of complex traits in breeding populations have more immediate value in the short term. Understanding molecular mechanisms can be challenging in human biomedical genetics (Peters and Musunuru 2012), and will be even more challenging for most trees of interest to breeding programs. The association genetics approach can contribute fundamental understanding of mechanisms underlying traits controlled by relatively small numbers of genes, but traits controlled by many genes of equal and small effects will be very expensive to analyze using this method.

5 Genomic Selection

5.1 Background

Many traits of interest to breeders are polygenic, being controlled by many genes each with small effect (Hill et al. 2008). These small-effect genes are crucial for the success of complex trait improvement (Crosbie et al. 2003). For many decades plant and animal breeders relied on phenotype and resemblance among relatives to capture genetic variance explained by these small effect genes. The methods used to improve complex traits were ‘black box’ as breeders did not know the underlying genetic architecture of complex traits, such as the number of genes controlling the trait and their location in the genome. Tree breeders have adopted these methods since 1950s. The success in improvement of tree characteristics has been relatively modest because breeding-testing-selection cycles for forest trees take many years to complete and tree breeding is logistically complex. Breeders have long looked to molecular markers to overcome challenges and improve the efficiency of selection (Neale and Savolainen 2004).

Beginning in late 1970s quantitative trait loci (QTL) mapping and later candidate gene approaches have been explored as tools to explain gene architecture of complex traits. The idea was that if alleles with large effects on the trait are traced (oligogenic model) with the markers, they could be used for selection of superior genotypes in breeding populations. This concept is called marker aided selection (MAS). However, QTL mapping and candidate gene approaches have had limited use to improve quantitative traits in most plant and animal breeding programs. Major reasons include the cost of producing large number of markers, and the observation that most quantitative traits are controlled by many QTLs, each with small effect, as predicted by the infinitesimal model. Individual QTLs often explained only a small percent (<5 %) of total variance and marker-trait associations discovered in individual families were not repeatable across the population (Goddard and Hayes 2009; Neale 2007).

QTL mapping experiments have been useful in discovering the genetic architecture of quantitative traits important in agricultural and forestry, but the focus is on identifying genetic loci associated with phenotypes. In breeding, on the contrary, the emphasis is on predicting genetic merit of individuals or lines rather than on discovering individual genes. A good predictor of genetic merit does not have to identify the underlying genes (Goddard and Hayes 2009). What is needed is a large number of markers to populate the genome and to explore the LD between these markers and the many QTL with small effect. This approach is called genomic selection (GS) or genome-wide selection. Since the introduction of the concept by Meuwissen et al. (2001), GS has shifted the paradigm, driven by the increased efficiency in DNA sequencing technologies and computing power.

GS contrasts greatly with traditional MAS, because in GS there is no defined subset of significant markers used for selection. Instead, GS jointly analyzes all markers in a population, attempting to explain the total genetic variance with dense genome-wide marker coverage through summing marker effects to predict breeding values of individuals (Meuwissen et al. 2001). The idea is that if we populate the genome with high-density markers, we can capture the LD between markers or marker haplotypes and causal polymorphism. Such association would be consistent across different families (Meuwissen et al. 2001). With advancement in DNA sequencing technologies and efficiency in genotyping, GS has become a reality in dairy cattle breeding (Goddard and Hayes 2009). Many livestock breeding programs now routinely apply GS to market bulls (Hayes et al. 2009). Genomic selection processes start from a training population. Candidates to establish a next cycle of breeding are selected through GS. The training can be performed iteratively as new phenotype and marker data accumulate (Heffner et al. 2011).

5.2 Empirical Examples from Forest Trees

Forest tree breeding programs are still at the first stage of breeding-testing and selection cycles with little genetic difference from natural populations. If successful, the impact of genomic selection on forest tree breeding could be far greater than for other crops or animal breeding programs. A few early empirical studies on genomic selection in forest trees are encouraging. For example, in a cloned loblolly pine breeding population, accuracies of GS varied between 0.55 and 0.88, matching those achieved by conventional phenotypic selection (Resende et al. 2012). Similarly in the same species, Isik et al. (2011) reported genomic estimated breeding values with reliability as high as breeding values based on resemblance among relatives and phenotypic data. These studies estimated the individual marker effect and summed up the coefficients to estimate genomic estimated breeding values of trees.

Alternatively a smaller subset of markers can be used to estimate realized genomic relationships using frequency of alleles shared by individuals (Legarra and Misztal 2008). Then, the additive genetic relationship matrix derived from pedigree is substituted by the genomic relationship matrix to predict genomic estimated breeding values. Genomic BLUP (GBLUP) could be a powerful tool for forest tree breeding programs. Such models can capture the Mendelian segregation effect in full-sib families, which was not the case using the average additive genetic relationships. For example, Zapata-Valenzuela et al. (2011) showed that accuracies of genomic estimated breeding values using GBLUP were comparable to traditional pedigree-based BLUP methods. In the same study, breeding values of a training population were estimated using GBLUP and classical BLUP (Henderson 1984). In the absence of phenotype, sibs from a cross had the same mid-parent breeding values when classical BLUP was used (Fig. 5). However, genomic relationship matrices based on SNP markers allowed prediction of different genetic values for sibs from a single cross.

Fig. 5
figure 5

Predicted breeding values of loblolly pine clones based on pedigree (y-axis) and genomic BLUP (x-axis) for eight crosses. Each cross is designated with a different color. In the absence of phenotype, the expected breeding value of sibs would be the same, which is the mid-parent value (ABLUP). However, DNA markers can capture Mendelian sampling effect within each cross as shown here, and thus, sibs can be ranked and selected without progeny testing (Zapata-Valenzuela et al. 2011)

5.3 Statistical Machinery

Classical linear mixed models are not efficient to handle large number of markers as predictors because the number of predictors (p) is larger than the number of data points (n) to explain variance in the phenotype. Such large p and small n effect causes lack of degrees of freedom. Statistical analysis of large number of markers has been a very active area of research in recent years, and many statistical methods have been proposed in the literature (Gianola et al. 2009). The effect of markers or haplotypes can be estimated by simultaneously including all markers in a model, but the challenge is to estimate the variances of marker effects. The best linear unbiased prediction (BLUP) method and ridge regression approaches have been proposed to estimate individual marker effects (Meuwissen et al. 2001; Whittaker et al. 2000). These methods make the assumption that markers are sampled from a population with expectation \( N \sim \left(0,{\sigma}_{g}^{2}\right) \) and each marker explain the same (\( {\sigma }_{g}^{2}/n \)) amount of genetic variance. Rather than categorizing markers as either significant or as having no effect, ridge regression and BLUP shrink all marker effects toward zero (Meuwissen et al. 2001). This is not a realistic assumption because regardless of association of markers with the trait loci, all the markers are shrunk towards the mean at the same level. Bayesian methods have a natural way of taking into account uncertainty about all unknowns in a model (e.g., Gianola et al. 2009) and, when coupled with the power and flexibility of Markov Chain Monte Carlo, Bayesian methods can be applied to almost any parametric statistical model. Meuwissen et al. (2001) introduced BayesA and BayesB and compared them with BLUP method in their original paper on GS. In BayesA, all the markers explain a fraction of genetic variance and the variance explained by each marker can vary based on the scaled inverted chi-square distribution as prior. Method BayesB corrects the shortcoming of BayesA by shrinking a high proportion (π) of markers to zero. Bayes C, Cπ, and D and Dπ were introduced to address the undesirable effect of priors on estimations observed for BayesA and BayesB. Habier et al. (2011) concluded that accuracies of the alternative Bayesian methods were similar and none of them outperformed all others across all traits and training data sizes. The choice of statistical methods for GS is sometimes is a matter of practicality, time and ease of application. Examples on empirical and simulated data suggest that Bayesian approaches are efficient to increase accuracy of predictions but the increase is usually minimal unless a large fraction of genetic variance in the trait in question is controlled by a few loci.

5.4 Challenges of GS in Forest Tree Breeding

Despite advances in the efficiency of genotyping technologies, genotyping is still costly for forest trees. For example, the Illumina SNP genotyping platform costs about $150 per sample for loblolly pine as of 2012, though the cost is decreasing. Several labs in the USA and other countries are working on alternative genotyping technologies, such as genotyping by sequencing (Baird et al. 2008; Elshire et al. 2011; Peterson et al. 2012; Poland et al. 2012; Truong et al. 2012), and we expect that the cost of genotyping could be less than $50 as of 2013.

GS has been successful in cattle breeding because the number of founders in these populations is relatively small (<30) and the LD between markers and trait loci are large, thanks to deep pedigree in the populations and small effective population size. Tree breeding populations still are at their infancy. The pedigree structures are still shallow with very low linkage disequilibrium (Neale and Savolainen 2004). Marker-trait phase detected in one generation may not hold in a subsequent generation because of meiotic recombination. For GS to be successful, well-structured populations (small effective population size, multiple generations) are needed.

Conifers are major targets of breeding programs in the northern hemisphere, and they have large and complex genomes. GS require dense coverage of whole genome to trace many QTLs associated with phenotype. Many more markers might be needed to populate genome of conifers. Grattapaglia and Resende (2011) suggested that 20 markers/cM are needed for an effective population size of greater than 30.

Forest trees have some advantages in implementation of GS. A large population can be put together easily. Each family can be represented by large number of progeny (several hundreds) with little investment and time. Phenotyping can be quite accurate thanks to efficient experimental designs and cloning of individuals.

An example GS plan has been proposed for a loblolly pine breeding population within the North Carolina State University Tree Improvement Program in the USA (Fig. 6). In the diagram given in Fig. 6, the process starts with creating a training population with an effective population size (Ne) smaller than 50 parents. In this example, 20 parents are used. Relatedness among the 20 founders is desirable, because that will make the marker-based model more powerful to predict GEBV by tracing historical LD in the population. From full-sib crosses of 20 parents, about 1,000 individuals can be genotyped. This progeny population is field-tested and breeding values are obtained Deregressed breeding values of 1,000 individuals or phenotypic values adjusted for fixed effects can be obtained to use as new ‘phenotype’ for development of a marker-based model (M1).

Fig. 6
figure 6

Genomic selection process for an elite breeding population. A marker-based prediction model is retrained across multiple generations. Such process would make the model more powerful for genomic estimated breeding values to trace LD of markers and QTLs

There are different methods to validate the predictive ability of markers. An additional 500 progeny from the same crosses (with known phenotype and genotypes) can be used as a validation population. Alternatively, random sampling of a small subset of progeny or selection of subset of progeny within each full-sib family can be used to validate model M1. This step is a proof of concept to show that the model has predictive power, and is not necessarily an application of GS. In order to utilize the benefit of GS approaches, we need to breed the selected individuals from the training population, obtain seeds, and use M1 to make selection decisions. This can be called ‘across generation’ GS application. The M1 model can be retrained when more genotypic and phenotypic data become available as breeding progresses (M2). GS training models would have more reliability as new data are included and can be used for multiple generations.

6 Conclusions

The availability of cost-effective genetic markers in forest tree species is expanding rapidly due to advances in DNA sequencing technology and investment in determining the reference genome sequences for several commercially-important species of forest trees. These resources are likely to fundamentally change the way tree breeding programs characterize genetic variation in their breeding populations, and several research groups are actively working to develop methods for applications of these tools in practical breeding programs. Molecular markers are already useful tools for population management applications such as validation of crosses, pedigree reconstruction, and unambiguous identification of clones. Association genetics results have already been reported for several traits in various species of forest trees, and application of these results in practical breeding programs may follow soon. Development of more sophisticated analytical methods capable of integrating the analysis of genetic variation detected by SNP assays with variation in gene expression patterns, metabolite levels, and phenotypic measurements may provide new tools capable of more accurate prediction of genetic value based on molecular assays. Predictive modeling of genetic value is the central objective of genomic selection methods, which have shown considerable promise in livestock and crop species that have appropriate patterns of LD in breeding populations. Forest tree breeding populations are likely to have very different patterns of LD than livestock or crop species, and new approaches to genomic selection may be required in order for this method to reach its full potential in applied tree breeding programs.