Introduction

Domestication and tree breeding of eucalypt species typically commences with first-generation testing of families from a range of subpopulations (or provenances) in provenance–progeny trials established across a range of sites in the target planting region. From quantitative genetic analysis of phenotypic data, subpopulations, families and individuals with superior properties for inclusion in the second generation are identified. In some cases, first-generation families will have been selected only from provenances identified as better-performing in previous provenance tests, though inclusion of untested provenances and selections from land races is also common. At the time of breeding program commencement, questions of subpopulation structure, such as whether ‘provenances’ from which seed was collected are genetically discrete entities, genetic origins of land-race selections and breeding system variables, e.g. outcrossing rates, are unresolved. Ideally, these questions should be answered before quantitative analysis commences, as assumptions concerning population structure and relatedness within and among families will affect the accuracy of selection and genetic gain in future generations. Molecular markers are a potentially powerful tool, complementary to quantitative analysis of phenotypic data, for managing breeding populations. Markers can be used to quantify within-population genetic diversity and manage the genetic base, identify population structure and hierarchy and elucidate pedigree to help manage inbreeding (Burdon and Wilcox 2007).

As eucalypts have a mixed mating system, there is potentially a mix of half-sib (HS), full-sib (FS), self-full-sib and self-half-sib relationships within families. In estimating genetic parameters related to future genetic gain, such as narrow-sense heritability (h 2), it is usual to accommodate the effects of mixed mating by dividing the additive variance component by a coefficient of relationship (ρ): in eucalypts, a value of 1/2.5 rather than 1/4, the value for true HS is commonly used (e.g. Eldridge et al. 1993). However, as the necessary paternal pedigree information does not exist, adjustments for heterogeneity of relatedness among subpopulations, families and individuals within families are not made, and the ranking of individual trees and families is equivalent to that made under the assumption of HS families.

Marker-based assays of young eucalypt germinant cohorts usually demonstrate a deficiency of heterozygous alleles relative to Hardy–Weinberg expectation (HWE) and positive values of the coefficient of inbreeding (f) (Potts and Wiltshire 1997). This is indicative of the effects of selfing and/or other forms of inbreeding associated with mixed mating and is more marked in some eucalypt species than others. For example, McDonald et al. (2003) showed that wild Eucalyptus cladocalyx is less outcrossed and more inbred than most widely planted eucalypt species. They also showed marked heterogeneity in the degree of heterozygosity and inbreeding among populations and families. However, for a number of eucalypts, inbreeding has resulted in marked inbreeding depression (ID) of fitness-related traits (e.g. vigour, survival), resulting in strong selection against inbred individuals with nearly complete mortality of selfed individuals before sexual maturity (e.g. Costa e Silva et al. 2010b; Griffin and Cotterill 1988; Hardner and Tibbits 1998; Hardner and Potts 1997). Observed heterozygosity (H O) in adult populations of wild eucalypts is, therefore, usually slightly higher than expected heterozygosity (H E) (Potts and Wiltshire 1997). Low vigour and/or rapid mortality in progeny trials involving control-pollinated selfs and open-pollinated progeny has been demonstrated in Eucalyptus regnans (Griffin and Cotterill 1988; Hardner and Potts 1997) and Eucalyptus globulus (Costa e Silva et al. 2010b; Hardner et al. 1996), though ID of vigour and survival was not evident in a small number of E. cladocalyx families estimated to be completely selfed using isozyme markers at age 5 years (Bush et al. 2011a).

It is usually assumed that the mothers of progeny within a first-generation breeding population are unrelated. Wild eucalypt mothers should ideally be sampled at sufficient spacing, twice the height of the trees (Gunn 2001) or approximately 100 m apart, to minimise the selection of related individuals within the family clusters typical of eucalypts (e.g. Skabo et al. 1998), which disperse most of their seed over a short distance (Potts and Wiltshire 1997). However, mating and relatedness between quite distant trees is still possible, especially if stand flowering is sparse, some individuals flower asynchronously and/or pollinators are foraging over relatively long distances. Another common practice in first-generation breeding populations is the integration of land-race selections, the genetic base of which is typically unknown. If established from a very narrow genetic base, trees sampled from distant parts of a plantation or even region may well be close relatives (e.g. Eldridge 1978).

In this study, we investigate the use of single-nucleotide polymorphism (SNP) markers to test assumptions that underpin orchard design and breeding program management of an E. cladocalyx F. Muell breeding population situated in southeast Australia. E. cladocalyx occurs in three disjunct regions of South Australia: Kangaroo Island (KI), Eyre Peninsula and the South Flinders Ranges (SFR) between latitudes 32–36° S (for grid references of specific subpopulations, see Online Resource 1). In the SFR, E. cladocalyx typically forms extensive open forests, as either pure or mixed eucalypt species stands. Though the total extent of the forest there has been reduced by clearing in the last 140 years, reserves now protect large populations. Stands in the Flinders Chase National Park and at Cygnet River on KI are generally small and isolated within creek lines, where they often form a dominant overstorey. Another KI population at American River forms a small coastal strip on the foreshore of the town, the existing trees probably a remnant of a more extensive stand that has been removed in the last two centuries. On the Eyre Peninsula, E. cladocalyx occurs as low woodlands, the trees often having a stunted appearance. The populations on Eyre Peninsula are very fragmented due to clearing for agriculture, occurring in two disjunct, upland localities. Mean annual rainfall in the natural range is between 350 and 650 mm and mean annual temperature is 16–17.5 °C (Commonwealth of Australia Bureau of Meteorology records). The species is predominantly insect-pollinated and has a mixed mating system, with individual trees ranging from self-compatible to self-incompatible (Ellis and Sedgley 1992). It has been widely planted since the 1870s in southern Australia on farms for naturally durable utility timber, fuelwood and windbreaks. It is the subject of tree breeding programs for low-rainfall planting environments in Australia (Harwood et al. 2007) and Chile (Mora et al. 2009). Population genetic diversity, structure, relatedness and inbreeding parameters within and among subpopulations and families in the breeding population are estimated. The implications of the findings in the context of domestication and breeding program management are discussed.

Materials and methods

Breeding populations and sampling

The E. cladocalyx breeding population was established at Lismore, Victoria (37°51′ S, 143°51′ E), as 1 of 11 sites established throughout southeast Australia in year 2001 by the Australian Low Rainfall Tree Improvement Group (ALRTIG) (see Bush et al. 2009). Samples were drawn from 119 families within 13 subpopulations (Table 1; for subpopulation location data, see Online Resource 1). Studies of range-wide selections by McDonald et al. (2003), using isozyme markers, and Bush et al. (2011b), who examined quantitative growth and wood trait data, have demonstrated that wild E. cladocalyx subpopulations form three groups, of which two are represented in the breeding population sampled in this study (KI and SFR). Families from the third, the Eyre Peninsula group, were not included in the ALRTIG breeding population as prior provenance tests had demonstrated the group’s markedly inferior growth and form (Harwood and Bulman 2001). The breeding population also contains families from cultivated stands including land-race selections from Western Victoria and a planted stand and multi-provenance seed production area (SPA) in South Australia. Genomic DNA from an average of 10 random selections per family was sampled at age 5 years from the 20 available trees per family in the trial. Height and diameter at breast height (DBH) from all survivors (total of 2,185 trees) of the 20 trees per family originally planted were measured at 5 and 7.8 years.

Table 1 MAF, average observed (\( {{\widehat{H}}_{\mathrm{O}}} \)) and expected (\( {{\widehat{H}}_{\mathrm{E}}} \)) heterozygosity, percentage of polymorphic loci (%P), relatedness (\( \overline{{2\theta }} \)) and inbreeding coefficient (\( \overline{f} \)) calculated by the Wang triadic method

Molecular markers

SNP markers, though individually less informative than highly multi-allelic markers such as microsatellites (simple sequence repeats [SSR]), can provide useful information if employed in sufficient numbers. Wang and Santure (2009) estimate that, for the purposes of co-ancestry estimation and pedigree reconstruction, 10 SNPs give similar information to 1 SSR. SNP discovery was carried out by creating DNA bulks from seedling tissue of three individuals from each of 149 open-pollinated family seed lots drawn from the SFR, KI and Eyre Peninsula subpopulation groups selected to be representative of the E. cladocalyx natural range. Amplicons of 39 genes (predominantly known and putative genes related to aspects of wood, vascular tissue development and some other traits identified in Eucalyptus nitens) that had been previously sequenced in E. nitens and other eucalypts, and for which existing primers were available, were selected for this study (see Online Resource 2 which gives the GenBank accession numbers and places the sequences within the Eucalyptus grandis genome reference assembly; Myburg et al. 2011). Sequencing was performed using the Roche/454 pyrosequence technology, producing ∼200 bp reads. Sequencing reads were aligned using CLC Genomics Workbench Version 4 (CLC bio, Aarhus, Denmark) to identify SNPs. SNPs were selected primarily on the basis of read depth and sequence quality score and absence of proximal SNPs (within approximately 10 bp of the target SNP) that can interfere with primer design. This resulted in a minimum minor allele frequency (MAF) >0.1. Target spacing of SNPs within genes was >500 bp. A total of 75 selected SNPs were genotyped using the Sequenom platform at the Australian Genome Research Facility. In many cases, multiple SNPs were selected from single genes (Online Resource 2), allowing exploration of intra-gene linkage disequilibrium (LD) breakdown which is normally rapid in eucalypts (Grattapaglia and Kirst 2008). SNP identity was established by aligning SNP flanking sequence (200–300 bp) to the E. grandis genome reference assembly using the Phytozome Version 8 (Goodstein et al. 2012) online basic local alignment search tool (BLAST).

Allele frequency-dependent measures

Parameter estimates of observed and expected heterozygosity and tests for HWE and F statistics (F ST, allelic diversity with subpopulations relative to the entire population; F IS, inbreeding coefficient of an individual relative to the subpopulation) were calculated using Arlequin (Excoffier and Lischer 2010). These allele frequency parameters were used to study genetic diversity, population sub-structure and inbreeding. Genetic distance matrices were constructed and used as the basis for analysis of molecular variance (AMOVA; Arlequin) and exact tests of population differentiation (Arlequin). Unweighted pair group method with arithmetic mean (UPGMA) cluster analysis (POPTREE; Takezaki et al. 2010) was undertaken to examine population differentiation.

Marker neutrality

Marker neutrality is an important consideration for population genetics studies because markers that are under selection are typically outliers with respect to at least some of their properties, are then likely to cause biased estimates of certain parameters and may produce misleading results. Their removal is, therefore, desirable (Luikart et al. 2003). Beaumont and Nichols (1996) have shown that loci likely to be under selection will show unusually high levels of genetic differentiation (F ST), and they and others have modelled selective neutrality by comparing F ST and H O against confidence limits simulated under various assumptions about population structure. Excoffier et al. (2009), in a refinement of the Beaumont and Nichols method, showed that false-positive identification of SNPs under selection may result where regional structuring of subpopulations exists. As E. cladocalyx is known to form three groups of subpopulations, confidence limits were simulated for 3 regional groups and 14 subpopulations. Calculations were implemented in Arlequin.

Linkage disequilibrium

SNPs from all genes were combined to estimate LD. Within-gene SNP positions obtained from BLAST-searching SNP flanking sequences against the preliminary E. grandis reference genome were used for plotting LD decay. LD measure R 2 was calculated using the Haploview program (Barrett et al. 2005). LD decay was investigated by plotting R 2 values against the relative positions of SNPs.

Relatedness and inbreeding

Measures of relatedness between individual trees, within and among families, subpopulations and whole populations are important but often unknown in first-generation eucalypt breeding populations derived from mixed mating in the wild and land-race selections. We used Coancestry Version 1.0 (Wang 2010) to estimate relatedness of pairs of trees (dyads) within each of the breeding populations. This software estimates the coefficient of relatedness (2θ), equivalent to twice the coefficient of co-ancestry (θ), for individual dyads. The coefficient of co-ancestry of two individuals reflects the probability that two gametes taken at random, one from each, carry alleles that are identical by descent (IBD) and is equivalent to the inbreeding coefficient (f) of their progeny should they be mated together (Lynch and Walsh 1998). The coefficient of relatedness and the coefficient of relationship (ρ), the latter often applied uniformly to whole populations to scale variance components in quantitative analysis, are related as follows:

For individuals i and j:

$$ {\rho_{ij }}=\frac{{2{\theta_{ij }}}}{{\sqrt{{(1+{f_i})(1+{f_j})}}}} $$
(1)

where f i and f j are their coefficients of inbreeding and θ ij is the coefficient of co-ancestry. Assuming no ancestral inbreeding: for selfs, f i  = f j  = 1/2, 2θ ij  = 1 and ρ ij  = 2/3; and for outcrossed relationships, 2θ and ρ will be equivalent, since f is nil, e.g. for HS, f i  = f j  = 0 and 2θ ij  = ρ ij  = 1/4.

The package Coancestry estimates 2θ by various methods. We selected the Wang triadic expectation maximisation strategy which has been shown to perform well when family structure is present and which also accounts for inbreeding (Wang 2007), an important consideration given previous findings of elevated selfing in E. cladocalyx (Bush et al. 2011a; McDonald et al. 2003).

Wang (2010) recommends that a critical first step in undertaking estimation of relatedness is determination of bias and precision of the selected method using data simulated to emulate the characteristics of the empirical marker system and data. Depending on the specific characteristics of the markers used, systematic under-estimation or over-estimation of relatedness parameters such as f and 2θ are likely to occur. The opportunity is to quantify the likely direction and magnitude of these biases using marker data, simulated for specific, known relationship types, and that emulate the allele frequency characteristics of those used in the empirical study. This then allows an appropriate correction to be applied. Simulated biallelic data were generated and analysed using Coancestry’s in-built functionality. Allele frequencies from the empirical marker set, as well as those drawn from a Dirichlet distribution, were used to simulate datasets with specified relationships among sets of dyads of varying sizes. Coancestry estimates confidence intervals (CI) by bootstrapping over loci. The effect on bias and precision of averaging over 45 dyads, representing the average E. cladocalyx family size, giving \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) and \( {{\overline{f}}_{\mathrm{f}}} \), was investigated, with CI determined by bootstrapping over sets of within-family dyad estimates.

Other measures commonly used to describe mixed-mating systems are the proportion of outcrossed progeny, t, and its complement of selfed progeny, s. The relationship between s, f and 2θ for families resulting from mixed mating, where FS relationships are absent, is given by Bush et al. (2011a):

$$ {{\overline{{2\theta}}}_{ij }}=\frac{\text{$\scriptstyle 1$}}{\text{$\scriptstyle 4$}}\left( {1+\overline{f}+2s+{s^2}+2\overline{f}s+\overline{f}{s^2}} \right) $$
(2)

Relationship between fitness and inbreeding

The relationship between family-average relatedness estimates (\( {{\overline{{2\theta}}}_{\mathrm{f}}} \) and \( {{\overline{f}}_{\mathrm{f}}} \)) and fitness-related traits (tree height, DBH and survival) was explored by multiple linear regression (with and without subpopulation and subpopulation group terms) implemented in GenStat 14 (VSN International, Hemel Hempstead, UK).

Results

SNP marker properties

Genomic location of SNPs

BLAST searches were used to position SNPs within 10 of the 11 E. grandis genome large chromosome scaffolds (Myburg et al. 2011) (Online Resource 2). SNPs were located on the basis of highly homologous (typically >90 %) alignment of SNP primer sequence (∼200–300 bp). SNPs, including several of those from the novel genes, were further annotated as 5′ and 3′ untranslated regions, promoter regions and intron and exon regions of E. grandis genes (Table 2). Of all SNPs, only four were non-synonymous (i.e. result in amino acid changes).

Table 2 Classification of SNP markers used to characterise the E. cladocalyx breeding population at Lismore, Victoria

Average spacing of SNPs within genes was 818 bp. Using the method of Excoffier et al. (2009), one non-synonymous SNP from the EcxCBS gene was identified as an outlier. A plot of subpopulation differentiation (F ST) versus heterozygosity is included in Online Resource 3.

Figure 1 shows the distributions of MAF of the marker sets in each of the subpopulation groups compared with those of a simulated biallelic marker set with frequencies drawn from a Dirichlet distribution. The cultivated and SFR groups of subpopulations had similar MAF (median MAF = 0.3 and 0.32, respectively). Most notably divergent from the Dirichlet allele frequency distribution was the KI group of subpopulations, which had a markedly lower MAF distribution.

Fig. 1
figure 1

Box and whisker diagram showing MAF from a simulated Dirichlet distribution, the South Flinders Ranges (SFR), Kangaroo Island (KI) and cultivated groups of E. cladocalyx subpopulations. Whiskers indicate the full range of the data, boxes indicate the inter-quartile range and bars within boxes indicate the median

SNP was variable among subpopulations. For the KI group of subpopulations, only one marker from the EcxHPA gene was monomorphic in all subpopulations, though the percentage of polymorphism (%P) ranged between 60 % for the small Cygnet River subpopulation and 99 % for the larger Flinders Chase 2 population. SNPs from the EcxMYB83 and EcxNOD genes, which had an MAF <0.03 in the KI subpopulations, were monomorphic in all SFR subpopulations, though generally the SFR %P was high (95–97 %). All markers were polymorphic in the cultivated-stand group of subpopulations taken as a whole, though the aforementioned SNPs from EcxMYB83 and EcxNOD were present only in the Kersbrook SPA families. All except one Kersbrook family is of SFR maternal origin, but KI pollen parents were present in this mixed-provenance stand.

Linkage disequilibrium

LD decayed rapidly (Fig. 2). No evidence of LD (R 2 > 0.3) between inter-gene (for gene-within-chromosome scaffold) SNP pairs was found involving SNPs with MAF >0.05. On average, inter-gene LD (R 2) was 0.01. Where pairs of loci were near-monomorphic, as is the case in some of the subpopulations, some inter-gene SNP estimates were 0 < R 2 < 0.3, though this is caused by correlation of numerous pairs of monomorphic individuals, probably not indicative of LD. Patterns of LD decay differed among the subpopulations. High levels of LD (R 2 values of 1) were more frequent, and LD persisted for longer distances in the KI subpopulations, whereas LD decayed more rapidly in the SFR and cultivated subpopulation groups. However, by 2,500 bp, LD had decayed significantly (R 2 < 0.1) in all subpopulations.

Fig. 2
figure 2

Intra-gene LD (from within 39 genes) LD (R 2) for wild (Cygnet River) and cultivated (Kersbrook) E. cladocalyx subpopulations. Logarithmic trend lines indicate rapid decay of LD

Fig. 3
figure 3

UPGMA phenogram for the E. cladocalyx breeding population at Lismore, Victoria based on \( {{\hat{F}}_{\mathrm{ST}}} \) uncorrected genetic distances (Takezaki et al. 2010); 10,000 bootstraps over loci were performed and the percentage of replications in which a branch bifurcated is given at nodal bases

Measures of heterozygosity and diversity

Heterozygosity

Heterozygosity (\( {{\widehat{H}}_{\mathrm{E}}} \)) averaged over subpopulations was 0.28 (standard error [SE], 0.01) (Table 1), though in the KI population it was 0.19, while in the SFR population and planted-stand groups \( {{\widehat{H}}_{\mathrm{E}}} \) averaged 0.37 and 0.35, respectively (Table 1). Observed heterozygosity (\( {{\widehat{H}}_{\mathrm{O}}} \)) was much lower than expected in trees from wild subpopulations, especially KI (only 52 % of loci met HWE, p < 0.05), but was very similar to expected in the cultivated stands (92 % of loci met HWE).

Partitioning of genetic diversity

Strong differentiation among subpopulations was indicated, with \( {{\widehat{F}}_{\mathrm{ST}}} \) (taken as an average across individual-locus AMOVAs) ranging from 13.50 % among the three groups of subpopulations (SFR, KI and the planted stands) and 5.34 % among subpopulations within the groups. Significant inbreeding (p < 0.001) was indicated for all wild subpopulations, with \( {{\widehat{F}}_{\mathrm{IS}}} \) averaging 0.26 in KI and 0.07 in the SFR subpopulations. \( {{\widehat{F}}_{\mathrm{IS}}} \) was not significantly different from zero for any of the planted-stand subpopulations.

The UPGMA phenogram (Fig. 3) shows wild subpopulations segregating strongly into two regions of provenance (KI and SFR), and exact tests of population differentiation showed this division to be significant (p < 0.001). Most cultivated stands clustered together with Wirrabara, while Wilmington, South Wilmington and Mount Remarkable formed another group with Mount Burr (cultivated) being an outlier to the other SFR populations.

Relatedness

Bias and precision of estimates of relatedness and inbreeding—simulation study

Simulations based on sets of 45 and 500 dyads showed that there are consistent biases in estimated relatedness (2θ) and inbreeding (f) with very similar results for simulations based on allele frequencies from the empirical dataset (Table 3) and Dirichlet-distributed allele frequencies (full results not shown, though compared with the 75-loci entry in Table 4). The magnitude and direction of bias for each relationship type is similar, decreasing only slightly as dyad count increases. The magnitude of bias in 2θ decreased as relatedness increased and was positive for non-inbred relationships and negative for inbred ones. The results and their general stability across small and large sets of dyads indicate that correction of bias in estimation from empirical data would be possible and desirable; for example, estimates of 2θ should be reduced by around 0.07 and 0.06 for unrelated and HS dyads, respectively.

Table 3 Bias estimates from triadic estimation of relatedness using 75 simulated SNP loci with allele frequencies drawn from the E. cladocalyx empirical dataset
Table 4 From 500 simulated dyads (allele frequency distribution drawn from Dirichlet distribution): (1) individual-dyad bias of 2θ and single-tree bias of f (deviation from expected values, E[2θ] and E[f] and 95 % CI obtained from 100 bootstraps (over varying numbers of loci) for different relationships and (2) bias and CI of family-average relatedness (\( {{\overline{{2\theta}}}_{\mathrm{f}}} \)) and inbreeding coefficient \( {{\overline{f}}_{\mathrm{f}}} \) obtained from bootstraps over a family of 10 siblings (45 dyads)

Dirichlet frequency-based simulations indicate that the magnitude of bias and CI for 2θ would decrease significantly if a considerably greater number of biallelic markers (500–1,000) were employed (Table 4). Indications are that 75 SNP markers give only modest pairwise precision of estimation of 2θ, especially for unrelated or relatively distantly related dyads (95 % CI of 0.01–0.31 for unrelated dyads and 0.11–0.57 for HS dyads). With 500 loci, the 95 % CI narrow substantially to 0.00–0.10 and 0.18–0.37 for unrelated and half‐sib (HS) relationships, respectively. Averaging across a number of dyads will also give satisfactory estimates from 75 loci. For example, with a 10-individual, 45-dyad HS family (the average size in the breeding population sample), \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) is 0.29 (95 % CI 0.26–0.33), which is close to expectation once a bias of ∼0.04 is subtracted from the estimate.

Relatedness within subpopulations

Trees from KI were closely related within families, with high proportions of selfing and/or ancestral inbreeding at the last mating required to explain subpopulation average \( {{\overline{{2\theta}}}_{\mathrm{p}}} \) values above 0.5 and up to 0.7, well above those expected for FS. Only four KI families had estimated \( {{\overline{{2\theta}}}_{\mathrm{f}}} < 0.4 \) (see Online Resource 4). Trees within families of the SFR population were also quite closely related on average, with \( {{\overline{{2\theta}}}_{\mathrm{p}}} \) of 0.4; however, the range of values was greater with several families appearing to be predominantly HS (\( {{\overline{{2\theta}}}_{\mathrm{f}}} < 0.3 \), allowing for bias). Estimates of inbreeding (\( {{\overline{f}}_{\mathrm{p}}} \)) ranged between 0.14 and 0.28 for the wild populations and between 0.06 and 0.11 (close to zero once likely bias is accounted for) in the planted stands. This represents subpopulation average outcrossing rates from t = 0.5 to t = 1, assuming selfing is the main form of inbreeding. For several wild families, \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) estimates approached 1 (and, therefore, \( {{\overline{f}}_{\mathrm{f}}} \) approaching 0.5 and t = 0), indicating nearly complete selfing. Three individuals from KI (two from the same family) were homozygous at all 75 loci, indicating several generations of ancestral selfing. For most families from cultivated stands, \( {{\overline{{2\theta}}}_{\mathrm{f}}} < 0.3 \), which, allowing for bias, is close to HS expectation, and only one family (from Kersbrook) showed signs of high levels of inbreeding with \( {{\overline{{2\theta}}}_{\mathrm{f}}}=0.6 \) and \( {{\overline{f}}_{\mathrm{f}}}=0.28 \). A whole-population positive, linear relationship between f and 2θ was found, with R 2 = 0.93.

The average co-ancestry estimate for dyads within subpopulations that are expected to be unrelated (i.e. between trees from different families) ranged between 0.06 and 0.17 and was, on average, 0.12, which is slightly >0, on average, allowing for bias ≈0.07. The cause of this class of inter-relatedness can be attributed to inter-relatedness between progeny of some specific families rather than generally elevated relatedness between all dyads within subpopulation. This was particularly strong in the Flinders Chase 2 subpopulation, where average inter-family relatedness (\( {{\overline{{2\theta}}}_{\mathrm{f}\text{-}\mathrm{f}}} \)) was 0.17, with some pairs of families having \( {{\overline{{2\theta}}}_{\mathrm{f}\text{-}\mathrm{f}}} > 0.5 \), indicating pollen flow between mothers, a common pollinator of the two mothers, relatedness of mothers and/or elevated ancestral inbreeding. Some evidence of relatedness among progeny of particular pairs of families was found within all of the wild provenances of E. cladocalyx and in at least one pair of families per subpopulation in the planted stands, except for Majorca. Inter-family relatedness was uniformly elevated (\( {{\overline{{2\theta}}}_{\mathrm{f}\text{-}\mathrm{f}}}=0.15 \)) in the Kersbrook SPA subpopulation. Average 2θ between dyads in different subpopulations was 0.08: this value is very close to zero once bias has been accounted for.

Relatedness, inbreeding and fitness

Differences in response of the fitness-related trait, DBH at age 7.8 years, to family-average relatedness (\( {{\overline{{2\theta}}}_{\mathrm{f}}} \)) and inbreeding (\( \overline{f} \)) were observed among subpopulation groups (Fig. 4). Multiple linear regression analyses including subpopulation group and \( \overline{f} \) then \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) explained 25 and 31 % of the variance in DBH at 7.8 years, respectively, with the interaction between the SFR group and the inbreeding and relatedness parameters being the only significant terms in the models (p < 0.001) (Table 5). The significant negative relationship was similarly observed between inbreeding and the DBH and height traits in the SFR group of subpopulations, but not others, at the age of 5 years. Wild families from KI subpopulations, many of which were highly related and inbred (Table 1), showed no significant regression response in height and DBH traits to either elevated \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) or \( \overline{f} \). The cultivated subpopulations had a narrow range of \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) and \( \overline{f} \) was generally low. Though weakly significant relationships between survival and \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) and \( \overline{f} \) (p = 0.04 and p = 0.05, respectively) were observed in the SFR subpopulation group, removal of a single family with high moment resulted in non-significant relationships. Survival was generally very high in all of the progeny trials at measurement age (trial populations averaged 88–92 %).

Table 5 Summary of regression analysis of family‐mean relatedness \( ({{\overline{{2\theta}}}_{\mathrm{f}}}) \) and inbreeding coefficient \( ({{\bar{f}}_{\mathrm{f}}}) \) against phenotypic trait means (DBH, height and percent survival at various ages) with percent variance accounted for by regression and probability that the fitted line is different to zero for groups of subpopulations in E. cladocalyx (where significant, the regression coefficients were all negative)
Fig. 4
figure 4

Relationship between family-average relatedness (\( {{\overline{{2\theta}}}_{\mathrm{f}}} \)), inbreeding (\( \bar{f} \)) and fitness trait, DBH, at age 7.8 years. Separate regression lines are fitted for three groups of subpopulations: SFR (circles, significant at p < 0.001, R 2 = 0.38 and 0.48 for \( {{\overline{{2\theta}}}_{\mathrm{f}}} \) and \( \bar{f} \), respectively), KI (crosses, not significant) and cultivated stands (triangles, not significant)

Discussion

SNP marker-based assays of the breeding population yielded a variety of information that can be used to manage the breeding program. The panel of 75 loci proved suitable for making allele frequency-based population-level estimates and also provided information on short-distance LD decay. In accordance with other eucalypts (Grattapaglia 2007; Grattapaglia and Kirst 2008; Thavamanikumar et al. 2011; Thumma et al. 2005), LD was found to decay rapidly, though it was more extensive in the KI subpopulations—a result consistent with elevated levels of inbreeding observed in this subpopulation group. Only one SNP from the EcxCBS gene was identified as likely to be under selection on the basis of its heterozygosity. This may indicate that this locus is under selection, though it may also be as a result of allelic differentiation caused by founder or bottleneck effects. Dillon et al. (2013) identified 8 % of SNPs likely to be under selection in a broadly similar set of Pinus radiata genes. We found that E. cladocalyx, from the subgenus Symphyomyrtus and monotypic section Sejunctae, appears to be highly homologous with E. grandis (subgenus Symphyomyrtus section Latoangulatae), with successful location of all SNPs within the currently mapped extent of the E. grandis genome.

The breeding population was strongly structured, with \( {{\widehat{F}}_{\mathrm{ST}}} \) totalling 18 % among subpopulations, the majority of which could be apportioned to the two groups of subpopulations corresponding to the wild SFR and KI regions identified by McDonald et al. (2003) using allozymes and by Steane et al. (2011) using Diversity Arrays Technology (DArT) markers. The cultivated stands clustered with the SFR subpopulations and were similarly diverse, indicative of a broad, original genetic base in the SFR, though they were less inbred, on average, probably reflecting the lack of neighbourhood structure (i.e. fewer closely situated near-relatives) in planted stands. Strong segregation in quantitative trait variation of growth, form and wood property traits corresponding to the wild groups of subpopulations has previously been identified (Bush et al. 2011b), with the phenotypically selected families from cultivated stands performing particularly well (Callister et al. 2008). Further selections from these cultivated sources could be made as infusions in future generations. Our estimate of \( {{{H}}_{\mathrm{E}}} \) was 0.28, on average, higher than 0.15, the isozyme-based estimate of McDonald et al. (2003) from a young seedling population of E. cladocalyx. However, the patterns of diversity among subpopulations in this study were very similar to those found by McDonald et al. (2003), with strong geographical regional differentiation of subpopulations, SFR having considerably greater diversity than KI, which supports their finding of SFR origins for the Wail cultivated stand. Wild subpopulations, particularly those from KI, showed strong signs of inbreeding, as evidenced by low MAF, marked deficit of heterozygotes, low genetic diversity (H E and %P) and greater persistence of LD.

Estimates of relatedness between pairs of individual trees would be beneficial in first-generation breeding populations, which are typically derived from sources where only the mother’s identity is known—even the original subpopulation identity can be obscure in the case of land-race selections. Pairwise estimates can be used to construct genomic relationship matrices, which substitute for pedigree-based additive relationship matrices, where pedigree does not exist, as is the case in testing and domestication of wild selections, to identify pedigree errors or to better calculate variance–covariance in accounting for IBD (Van Raden 2007). However, our simulation study indicated that, though bias in estimation of 2θ and f was of minor practical consequence and in any case could be consistently identified and corrected for using a panel of 75 SNP markers, single-dyad estimates are not sufficiently precise to use reliably in a relationship matrix and that precision would need to be improved by employing more markers: 500–1,000 SNPs would be required. However, averaging across groups of dyads, families (∼10 individuals or more) or subpopulation groups, for example, can provide useful, cost-effective estimates. The cost-effectiveness of family-level estimates arises not only because fewer markers are used, but far fewer individuals are genotyped—the family-level estimates can be applied to numerous family members across sites. For first-generation breeding populations, it would then be possible to construct a marker-based genomic relationship matrix, using a relatively modest genotyping effort, which will allow quantitative analysis to proceed while accounting for heterogeneity of family and subpopulation relatedness.

The E. cladocalyx breeding population demonstrated a large degree of heterogeneity in relatedness among families, a result presaged by the study of McDonald et al. (2003) on germinant seedlings using eight isozyme loci. Our study indicates that f ranges between 0 and 0.5 and, accordingly, family outcrossing (t) ranges between 0 and 1. Elevated f might be largely attributed to selfing in the parental generation; however, in some families, evidence of very high levels of inbreeding were detected, with three individuals from KI, two of which were from the same family, found to be homozygous at all loci, indicating several generations of selfed matings or other ancestral inbreeding leading to allele fixation. Though the SNP loci used in this study are widely distributed throughout the genome, the individuals are being examined, as a follow-up, at a greater number of loci to determine the extent of fixation: it is possible that heterogeneous variable regions of the genome still exist. The strong correlation between 2θ and f indicates that much of the higher relatedness in E. cladocalyx is attributable to inbreeding. However, in several families, high 2θ is not accompanied by elevated f, and this is indicative of a significant proportion of FS relationships, a result not indicated by the isozyme study of McDonald et al. (2003).

Contrasting patterns of relationship between fitness-related trait (DBH) and measures of relatedness and inbreeding were observed among the breeding populations. Among the cultivated stands, significant relationships between neither DBH nor survival and inbreeding were indicated. Relatedness was generally quite close to HS on average, with low levels of inbreeding. The wild subpopulation groups (SFR and KI) had a wide range of family-average coefficients of relatedness and inbreeding. SFR family-average growth declined significantly in response to increased relatedness and inbreeding. Controlled pollination experiments have demonstrated a similar marked decline in fitness associated with selfing in a number of eucalypt species including E. grandis (Hodgson 1976), E. nitens (Hardner and Tibbits 1998), E. globulus (Hardner and Potts 1995) and E. regnans (Griffin and Cotterill 1988). In E. regnans and E. globulus, this has been accompanied by quite rapid mortality in progeny trials. The relationship between inbreeding and survival is only weakly significant for the SFR subpopulation group in the progeny trial at Lismore, though in the long run, suppression of the smaller, inbred families by the more vigorous ones would probably result in increased mortality. Costa e Silva et al. (2011) have shown that ID in E. globulus survival is differentially expressed across environments, and it may be that selection against inbred E. cladocalyx families is stronger in the wild and on more demanding plantation sites. However, KI families, which were on average more inbred than SFR, appeared to be showing no significant signs of ID of survival or growth traits at this site. This is an unusual finding in eucalypts, though Costa e Silva et al. (2010a) have recently shown that ID is variable among families and subpopulations of E. globulus. This result may indicate a lower genetic load of deleterious recessive alleles in the KI subpopulations, where stands are generally small and isolated, either by natural and/or anthropogenic causes. Mimura et al. (2009) have similarly found that levels of outcrossing are lower in small fragmented stands of E. globulus than in stands occurring as more extensive forest. It is possible that the KI group of subpopulations, which are markedly less genetically diverse than those of the SFR, may have gone through a genetic bottleneck which has resulted in purging of recessive alleles. McDonald et al. (2003) postulated that this may have resulted from changes in climate and sea levels during the Pleistocene during which time the species retreated to small refugia on KI and more extensive open forests in the uplands of the SFR.

An often-implicit assumption in the analysis of first-generation eucalypt populations is that levels of inbreeding and relatedness among populations and families are homogenous. This assumption appears to be critically flawed for E. cladocalyx, where inbreeding levels varied from low to very high within and among subpopulations. Taking the traditional approach of using a single coefficient of relationship to scale variance parameters, a value of \( \widehat{\rho}={1 \left/ {2.8 } \right.} \) is indicated. However, the respective values of \( \widehat{\rho} \) for the SFR, KI and cultivated subpopulations would be approximately 1/3, 1/2 and 1/4. Quantitative analysis could be improved by using the marker-based estimates of the coefficient of co-ancestry to adjust the additive relationship matrix at the family level, a relatively complex step not normally carried out for first-generation domestication breeding parameter estimation. This procedure was carried out for a small selection of E. cladocalyx wild families by Bush et al. (2011a), who demonstrated that some re-ranking of breeding values may result.

While correction of the additive relationship matrix will result in a more realistic additive variance model, the problem of non-additive variation, particularly dominance, is not addressed. The effects of dominance in this breeding population may be considerable, given the clearly different responses to inbreeding among subpopulations. The relative magnitude of the additive and non-additive genetic effects will impact on selection efficiency in the early phase of domestication. Substantial non-additive genetic variation associated with inbreeding was found in E. globulus (Costa e Silva et al. 2010a), and because of the persistence of highly inbred individuals in the E. cladocalyx breeding population, this is likely to be an important consideration. Inbreeding can be theoretically expected to give rise to greater among-family and within-family variance relative to outcrossed populations (Kelly 2004), and Costa e Silva et al. (2010b) have shown that this can amount to very large within-family and among-family differences in experimentally selfed and outcrossed populations of E. globulus. Heritability estimates for growth traits from three E. cladocalyx sites in Western Australia, which include a subset of the families described here, were unusually high (0.41–0.85) (Callister et al. 2008) despite application of \( \widehat{\rho}={1 \left/ {2.5 } \right.} \), a value that, as it applies to additive variation, is too high, on average, given the large proportion of material from cultivated stands in the Western Australian breeding population plantings. The result may well indicate significant underlying non-additive components of variation that have led to inflated variance associated with the inbred families.

Concluding remarks

The study has shown that molecular markers can be used to characterise eucalypt breeding population structure, clarifying genetic origins of land-race selections, revealing subpopulation structure and providing information of relatedness within and between families. We have shown that the breeding population is strongly structured and that the levels of inbreeding in some wild subpopulations are unusually high, while selections from cultivated stands are generally outcrossed. Modification of the usual assumption of homogeneity of relatedness within and between first-generation families, based on the marker data, may, therefore, improve quantitative analysis. We also observed differential responses to inbreeding among the subpopulations, with indications of ID in the SFR group of subpopulations. Accounting for the likely effects of non-additive variation on within-family and between-family variance is likely to be another important consideration for this species.