Introduction

Eucalypts are long-lived, evergreen trees belonging to the angiosperm family Myrtaceae that occurs predominantly in the southern hemisphere (Ladiges et al. 2003). Plantation forestry species of Eucalyptus are well known for their fast growth, straight form, valuable wood properties, wide adaptability to soils and climates, and ease of management through coppicing (Potts 2004). They are now planted in more than 90 countries where the various species are grown for industrial use in cellulose pulp production, energy supply in the form of charcoal for steel manufacture, sawn timber, essential oils, as well as for firewood, shade, and shelter (Myburg et al. 2007). Besides their role in plantation forestry, eucalypts are dominant or co-dominant trees in almost all vegetation types where they occur and are considered keystone species for ecological studies in their natural ranges (Doughty 2000).

Eucalyptus subgenus Symphyomyrtus, the most speciose of the genus with over 300 species, includes the majority of the 20 or so commercially planted species. In temperate regions, Eucalyptus globulus has been the premiere choice for plantation forestry, providing fast growth and the best combination of wood properties for pulp and paper production. In the tropics, on the other hand, production forestry of Eucalyptus is currently based on a combination of interspecific hybrid breeding and clonal propagation, with Eucalyptus grandis as the pivotal species. Traits such as fast growth, wide adaptability, disease resistance, and tailored wood properties for specific end products are coalesced into elite clones which are in turn propagated for large-scale plantations (Grattapaglia and Kirst 2008).

The hypervariability and simple inheritance of microsatellites provide a powerful system for the unique identification of individuals for fingerprinting purposes, parentage testing, germplasm characterization, and population genetic studies. The individual identification of elite clones is currently a widespread application of molecular markers in tree breeding and production forestry. Quality control and quality assurance of large-scale clonal plantation operations becomes a crucial aspect in forestry, especially in vertically integrated production systems where the pulp mill plans on the availability of wood from specific clones with specific wood properties at specific times. Correct clonal identity also has important implications in several breeding procedures such as seed orchard management or controlled pollination programs affecting the expected gains of breeding cycles (Grattapaglia and Kirst 2008). Additionally, microsatellite markers coupled to Bayesian model-based clustering procedures (Pritchard et al. 2000) have been used in animals (Koskinen 2003; Kumar et al. 2003; Tadano et al. 2008) and increasingly in plants to assign individuals to species/populations or to estimate the most likely ancestral composition of admixed individuals, especially when the phenotypic differentiation between the species/populations in question is difficult and pedigrees are unavailable or ambiguous (Honjo et al. 2008; Millar et al. 2008; Muir and Schlotterer 2005; Sampson and Byrne 2008; Sarri et al. 2006).

The power of microsatellites for individual identification and population genetic studies in Eucalyptus has been demonstrated in a number of reports (Chaix et al. 2003; Grattapaglia et al. 2004b; Kirst et al. 2005; Ottewell et al. 2005). However, all the microsatellites currently available for clonal identification and population analysis are derived from genomic sequences containing di- or trinucleotide repeats (Brondani et al. 2006, 1998; Glaubitz et al. 2001; Ottewell et al. 2005; Steane et al. 2001). These markers, while providing powerful discrimination, do not provide high-precision genotyping needed for comparative multilocus profiling across laboratories or even at different times in the same equipment. This is due to the small base pair differences among alleles and to the well-known phenomenon of stuttering during PCR that renders allele calling challenging especially in dinucleotide repeat microsatellites (Litt et al. 1993). In human forensic DNA, a consensus was reached several years ago that for individual identification, tetranucleotide repeat markers should be used as the gold standard (Bar et al. 1995; Gill et al. 1994). Currently, only tetra- and pentanucleotide repeat microsatellites are acceptable for routine human forensic casework (Holt et al. 2002; Krenke et al. 2002).

In vertebrates in general, motifs of length equal or higher than tetranucleotides are relatively frequent and have been commonly used for marker development (Sharma et al. 2007). In plants, while tetranucleotide repeats have been observed at relatively high frequency both in mono- and dicotyledonous genomes (Morgante et al. 2002), only very few recent studies have reported the development of markers based on longer simple sequence repeats from expressed sequence tags (EST; Feng et al. 2009; Yi et al. 2006). In Eucalyptus, descriptive studies of existing EST databases (Ceresini et al. 2005; Rabello et al. 2005; Yasodha et al. 2008) confirmed the abundance of microsatellites seen previously in genomic library screening (Brondani et al. 2006). To date, however, no targeted marker development has been made from these EST resources. In this study, we report the development and characterization of a set of 21 polymorphic microsatellite markers based on tetra-, penta-, and hexanucleotide repeats derived from a large collection of ESTs of Eucalyptus. Four of the most widely planted species worldwide that also represent contrasting phylogenetic sections within Symphyomyrtus were used to develop this set of microsatellite markers, evaluate their interspecific transferability, and assess their genetic information content for population analyses, individual fingerprinting, and assignment tests.

Materials and methods

EST database mining and primer design

Tetra-, penta-, and hexanucleotide repeat microsatellites were mined in a database that had approximately 88,000 phred-20 filtered 5′-sequenced ESTs generated during a sequencing effort in the Genolyptus project (Grattapaglia et al. 2004a). EST sequences from leaf and developing xylem RNA were mostly from E. grandis, although approximately 30% was derived from three other species: Eucalyptus urophylla, Eucalyptus pellita, and Eucalyptus globulus. With an optimized microsatellite pipeline based on the software MREPS (Kolpakov et al. 2003), simple sequence repeats were identified under the following parameters: two to six base SSR motifs, perfect structure, i.e., no microvariant interruptions, and a minimum core of three repeated units. The microsatellite markers were derived from the alignment of a variable number of ESTs from the four species using E. grandis as the reference sequence. Primer pairs flanking these microsatellites were designed targeting expected PCR products between 80 and 450 bp.

Microsatellite marker selection and preliminary screening

Only tetranucleotide or higher order motifs were targeted for marker development. No selection was practiced regarding the potential location of the microsatellite in the expressed sequence, base composition of the motif, or BLAST hit identity. Besides being a tetranucleotide or higher repeat motif, priority was given to longer microsatellites, i.e., with a larger number of repeated units based on the assumption that these would likely be more polymorphic at the population level. Screening of primer pairs for simple amplification, polymorphism, and interspecific transferability was carried out in a panel of six unrelated trees involving five different species of Eucalyptus, four of them as pure species (Eucalyptus urophylla, E. grandis, E. globulus, Eucalyptus calmadulensis) plus Eucalyptus dunnii in one of the two hybrid combinations (E. dunnii × E. grandis and E. urophylla × E. globulus). Regular primers at small scale were synthesized (AlphaDNA, Montreal, CA, USA) and used for PCR amplification with a common touchdown PCR thermal profile: a hot start for 5 min at 96°C; 10 cycles of 94°C for 1 min, 64°C for 1 min, and 72°C for 2 min; 20 cycles of 94°C for 1 min, 56°C for 1 min, and 72°C for 2 min; and a final elongation step at 72°C for 7 min. The same reaction composition was used as described earlier (Brondani et al. 2006). High-resolution agarose (3.5%) gel electrophoresis and ethidium bromide detection were used for PCR product visualization. Microsatellite markers were classified as transferable when amplification was observed in all five species and tentatively polymorphic when at least one difference in product size was observed among the individuals in the screening panel.

Plant material

A population sample of 16 unrelated trees of each one of the four target species, E. grandis, E. urophylla (section Latoangulatae) E. globulus (section Maidenaria), and E. camaldulensis (section Exsertaria), were used for microsatellite characterization and to establish preliminary reference species data sets of allele frequencies for the assignment tests. A sample size n = 16, i.e., 32 alleles, provides a coefficient of variation of the mean squared error of the expected heterozygosity below 10% (Kirst et al. 2005) adequate for the purpose of this characterization analysis. Within each species, eight trees from each of two different provenances were sampled, Atherton (17°15′ S, 145°28′ E) and Coffs Harbor (30°18′ S, 153°07′ E) for E. grandis, Jeeralang (38°24′ S, 146°28′ E) and Flinders Island (40°00′ S, 148°07′ E) for E. globulus, Flores Island (8°39′ S, 122°15′ E) and Timor Island (9°37′ S, 124°10′ E) for E. urophylla, and Walsh River (17°17′ S, 144°88′ E) and Kennedy River (15°43′ S, 144°17′ E) for E. camaldulensis. These have been some of the most widely employed provenances in breeding programs in Brazil and thus most relevant for the evaluation of the assignment tests. A set of 24 elite public clones commercially planted in Brazil were used as a test set to evaluate the power of the microsatellite markers to assign individuals to their most likely source species. Four of the 24 clones had a documented hybrid origin as they were produced by controlled crosses of E. camaldulensis × (E. urophylla × E. globulus) and thus served as control cases for admixed individuals. The remaining 20 clones were of unknown origin, although anecdotal reports suggest a hybrid origin for several of them involving mainly E. grandis and E. urophylla.

Microsatellite genotyping

DNA extractions from expanded leaves of the target trees and microsatellite genotyping by fluorescence detection was carried out as described earlier (Missiaggia et al. 2005), with some modifications in the PCR protocol. PCR reactions in multiplexed systems were carried out in 10 μl volumes containing 1 μl of 10× Qiagen Multiplex PCR Buffer (Qiagen Inc., Valencia, CA, USA), equal concentration (0.1 μM) of all primers for all microsatellite markers co-amplified, and 2.0 ng of genomic DNA. The recommended Qiagen Multiplex PCR Handbook cycling protocol was used with an annealing temperature of 60°C and 30 PCR cycles. PCRs were carried out in hexaplex or heptaplex systems combining markers in such a way that loci whose alleles migrate in the same size range were labeled with different fluorochromes either 6-FAM (blue), NED (yellow), or HEX or VIC (green). To assist in the design of the multiplexed genotyping systems, primer pairs for all selected microsatellites were screened for potential cross-reactivity (i.e., primer dimer and hairpin structures) using the web-based version of the software AutoDimer (Vallone and Butler 2004). Default parameters were used and primer pairs that displayed primer dimer structures with score value >7 (i.e., number of matches minus number of mismatches) were avoided when choosing loci to be co-amplified. An aliquot of 1 μl of PCR was mixed with 1 μl of freshly prepared ROX-labeled size standard (Brondani and Grattapaglia 2001) and 10 μl of Hi-Di formamide (Applied Biosystems, Foster City, CA, USA). The mixture was electroinjected in an ABI 3100 genetic analyzer and data collected under dye set D spectral calibration using Genescan and analyzed with Genotyper (Applied Biosystems).

Microsatellite characterization

The following parameters of genetic information content were estimated for each microsatellite marker and species separately: (1) number of alleles (A); (2) allele size range; (3) observed (H o) and expected (H e) heterozygosity and a p value of an exact test for Hardy–Weinberg equilibrium; (4) polymorphism information content (PIC; Botstein et al. 1980); (5) probability of identity (PI) that corresponds to the probability of two random individuals displaying the same genotype; and (6) paternity exclusion probability (PE) that corresponds to the power with which the locus excludes an erroneously selected individual tree as being the parent of an offspring. This last parameter was estimated taking into account frequent situations when using microsatellites for paternity analysis in forest trees: (PE_1) paternity exclusion probability for one candidate parent given the genotype of a known parent, a common situation when paternity is investigated in open-pollinated progeny individuals with maternal control, and (PE_2) paternity exclusion probability for a candidate parent pair, a common situation when paternity and maternity needs to be checked in progeny individuals derived from controlled crosses, i.e., with maternal and paternal control. The software Cervus (Kalinowski et al. 2007) was used to estimate A, H o, H e, PIC, PI, and both versions of PE, and Powermarker (Liu and Muse 2005) was used to carry out an exact test for Hardy–Weinberg equilibrium for each microsatellite marker. Considering that Eucalyptus species are known for operating largely under a mixed mating model (Burczyk et al. 2002; Gaiotto et al. 1997), the frequency of null alleles at the 21 loci in the four species was estimated using an individual inbreeding model with the software INEST (Chybicki and Burczyk 2009). To account for missing data due to PCR failure, this analysis also provided a probability estimate (β) for absence of alleles due to random amplification failure as opposed to null allele homozygosity. The combined multilocus paternity exclusion probabilities and the probability of identity were also estimated for different combinations of multiplexed systems of microsatellites for genotyping applications.

Evaluation of microsatellites for genetic distance and population structure analysis

Multilocus genotypes for the 21 microsatellites were used to estimate pairwise individual-level genetic distances among the 88 individuals (64 pure species individuals, 4 known hybrids, and 20 suspect hybrid elite clones) to specifically assess the effective discrimination ability for fingerprinting purposes. For the co-dominant data, a shared allele distance (D SA) was calculated based on the infinite allele model. DSA is estimated by 1 − P SA, where P SA is the proportion of shared alleles averaged across loci (Bowcock et al. 1994). Distance matrices (1,000 bootstrap replicates) were calculated using MICROSAT (Minch et al. 1995). The matrix of genetic distances was then used to graphically represent distance relationships between the 88 individuals with an unweighted pair group method with arithmetic mean (UPGMA) consensus tree (majority rule, strict) constructed using the NTSYS 2.0 package (Exeter Software, USA). Based on the genotype data at the 21 microsatellite loci, the 64 individual trees of the reference species were assigned probabilistically to a given number of populations inferred with a Bayesian approach without any prior population information using STRUCTURE 2.1 (Pritchard et al. 2000). The tests were done based on an admixture model where the allelic frequencies were correlated and applying burn-in period of 50,000 and 100,000 iterations for data collection. The analysis was run with K ranging from two to eight inferred clusters (four species and two provenances per species) performed with five independent runs each. The model choice criterion to detect the most probable value of K was ΔK (Evanno et al. 2005). Average results of ten runs at the most likely K were entered into DISTRUCT (Rosenberg 2004) to provide a graphic display of population structure. Pairwise estimates of population differentiation (F st) between the four Eucalyptus species were also obtained using the software Arlequin (Excoffier et al. 2005).

Evaluation of microsatellites for assignment tests

Assignment of the individuals of the test set, i.e., the 24 elite clones, to the clusters created based on the reference species sets was carried out with STRUCTURE 2.1 (Pritchard et al. 2000) using both the reference and test sets combined. Assignments were tested using prior population information for individuals from the reference data set and an admixture model to allow for more flexibility to deal with the complexities of these populations. The number of clusters (K) was set to the most likely value determined in the previous structure analysis. STRUCTURE 2.1 was thus used to estimate the posterior probability that each test individual (elite clone) belongs to a given cluster corresponding to each one of the Eucalyptus species under consideration. Average results of the posterior probabilities of ten independent runs at the most probable K were used to estimate the most likely hybrid composition of each individual elite clone. These values were entered into DISTRUCT (Rosenberg 2004) to provide a graphic display of the ancestral composition of each elite clone.

Results and discussion

Microsatellite development

The data mining and microsatellite pipeline used for microsatellite marker development revealed 1,261 potential markers that met the specified constraints and for which primer pairs could be designed. Details of that study will be the subject of a separate publication. Out of the set of 1,261 potentially useful markers, the number of microsatellites that displayed at least three repeated units as the core microsatellite were 83, 51, and 116, respectively, for tetra-, penta-, and hexanucleotide repeats. This distribution reflects the higher frequency of tetra- and hexanucleotide repeats seen in genic regions when compared to pentanucleotides in Eucalyptus (Rabello et al. 2005), in line with previous reports in both mono- and dicotyledonous plants (Morgante et al. 2002; Zhang et al. 2004). Preliminary marker screening for amplification success and polymorphism detection was carried out for 50 tetra-, 18 penta-, and 24 hexanucleotide repeat microsatellites that displayed the largest number of tandemly repeated units in silico. From the preliminary screening, 36 primer pairs (19 tetranucleotide, 5 pentanucleotide, and 12 hexanucleotide) showed robust amplification and indication of polymorphism. These were selected for high-resolution fluorescence-based screening. Except for one locus that was removed from any further screening steps, all those that were deemed polymorphic based on the low-resolution agarose gel screening did in fact display more than one allele when tested in high-resolution electrophoresis. For the purpose of this study, where the objective was to select a set of polymorphic and transferable microsatellites across the four target species, a relatively stringent threshold was set that markers had to be polymorphic in at least three of the four species, i.e., display at least two alleles in a limited sample of 16 trees per species. This constraint was met by 11 tetranucleotide-based microsatellites, three pentanucleotides, and seven hexanucleotides for which full information is presented including the motif, the expected amplicon size, forward and reverse primer pairs, Genbank accession number of the original sequence from which the microsatellite primer pairs were designed, and the database of sequence tagged sites (dbSTS) Id (Table 1). BLASTx functional annotation returned highest hits to Ricinus communis (nine loci), followed by Vitis vinifera (five loci), Populus trichocarpa (five loci), and one each to Arabidopsis thaliana and to Carica papaya. Most genes where these microsatellites are contained have not yet been functionally characterized (data not shown). All these microsatellites are located in the 5′-untranslated (UTR) region of the gene. The general abundance of microsatellites in the 5′-UTR of plant genes has been described (Morgante et al. 2002). Observations of gradients of microsatellite density along the direction of transcription in rice and Arabidopsis ESTs (Fujimori et al. 2003) and in the rice genome when introns were scrutinized (Parida et al. 2009) suggest that some genic non-coding microsatellites might take part in regulating gene expression. These observations were reported almost exclusively for di- and trinucleotide repeat microsatellites, not higher order repeats and for a relatively small proportion of genes. It is therefore unlikely that the microsatellites developed in our study would be under selective pressure to a point that would jeopardize the premise of neutrality for population genetics studies.

Table 1 Basic properties of the 21 microsatellite markers developed and evaluated in this study including the identification number in the dbSTS

Microsatellite characterization

The 21 microsatellites spanned a wide range of allele sizes (Table 2) which later proved very useful to design multiplexed sets of markers that allow an optimized and higher throughput genotyping. The size range of the alleles for most loci matched the expected size of the in silico predicted amplicon. However, for loci EMBRA943 and EMBRA1374, the observed size was significantly larger than the expected one, strongly suggesting amplification across intronic sequences. The allele size range did not vary much across the four species, an important aspect to design more generalized multiplex genotyping panels (Table 3). The average number of alleles varied across loci with three monomorphic microsatellites in E. urophylla (EMBRA1456, EMBRA1463, and EMBRA1945) and a maximum of ten alleles observed for EMBRA813 and EMBRA1364 in E. camaldulensis. The average number of alleles overall species and markers was 4.43, slightly higher for E. camaldulensis (5.10), although not significantly different among the four species (F = 1.408, p = 0.246). Expected heterozygosities (H e) in both species were nominally larger than the observed heterozygosity (H o) for several loci. A goodness-of-fit test for Hardy–Weinberg equilibrium (HWE) revealed only eight (three in E. grandis, two in E. globulus, two in E. urophylla, and one in E. camaldulensis) out of the 84 tests significantly deviated from expectations at α < 0.01 (Table 2). However, more markers could show deviations as these tests have low power due to the relatively limited sample size. Marker EMBRA1945 did not fit HWE expectations for E. grandis and E. globulus, in both cases with a significant deficiency of heterozygotes, and monomorphic in E. urophylla, suggesting the occurrence of null alleles.

Table 2 Descriptive statistics of the 21 microsatellites for the four Eucalyptus species
Table 3 Proposed multiplex systems for high-throughput genotyping with the developed microsatellites

The frequencies of null alleles for all 21 loci in all four species were estimated under the individual inbreeding model (IIM; Table 2). This model provides a useful approximation for species with a mixed mating system as it allows for an accurate estimate of null allele frequency regardless of the sample size, the number of loci, or the actual inbreeding coefficient (Chybicki and Burczyk 2009). It should be noted, however, that under this model, an estimate of null allele frequency should only be considered significantly different from zero when the locus deviates from HWE expectations and not purely based on its absolute estimated value. We found that an estimated frequency of null allele between 0.1 and 0.3 was observed in those few cases (8 in 84 locus × species combinations) where a significant deviation from HWE (p < 0.01) was observed due to an excess of homozygotes. Furthermore, the probability (β) of random amplification failure for these few cases that deviated from HWE was always lower than 0.6%, indicating that the data are not consistent with random amplification failure but rather with the occurrence of true null alleles. No consistent pattern of deviation from HWE and frequency of null allele was seen across species. In other words, no specific locus can be pointed out as being more prone to the occurrence of null alleles in all four species. Rather it seems to follow a specific locus by species interaction. These results taken together indicate that the occurrence of null alleles is relatively rare and not a significant issue for these microsatellites selected for higher transferability. In the few particular cases where null alleles frequencies be considered significantly different from zero due to HWE deviation, the corresponding primers could be redesigned to attempt alternative flanking priming sequences. However, this might prove challenging due to the very high nucleotide diversity around 1 SNP every 30 bp for E. globulus and up to 1 SNP every 16 bp for E. camaldulensis, as recently described in a range wide re-sequencing survey of 23 genes (Kulheim et al. 2009).

Overall, this set of microsatellites has a lower information content when compared to dinucleotide repeat markers that typically display on average ten alleles per locus and heterozygosities in the range of 0.70–0.80 (Brondani et al. 2006; Ottewell et al. 2005). A lower variability of the tetra-, penta-, and hexanucleotide repeats was expected as the rate of mutation for longer simple sequence repeats has been reported to be lower in general for animals and plants (Chakraborty et al. 1997; Vigouroux et al. 2002). In spite of the lower number of alleles, the allele frequency distribution was such that good discrimination power both for parentage testing and individual identification could be reached for the majority of the microsatellites in both species (Table 3). Average PIC, paternity exclusion probabilities (PE_1 and PE_2), and PI for the set of loci were not substantially different among the four species, and these parameters were within the same range for several loci in all four species. However, for some markers, these parameters differed among species, reflecting the difference in number of alleles and/or their frequency distributions. An example was EMBRA915, highly informative in E. camaldulensis but not so in E. globulus. The impact of allele frequency distribution could be visibly recognized in locus EMBRA1851 where, in spite of the higher number of alleles in E. globulus (A = 8), when compared to E. grandis (A = 4) the allele frequency distribution was such that the genetic information content of this locus was very similar in both species. Again, as expected, these tetra-, penta-, and hexanucleotide microsatellites are evidently less powerful when it comes to parentage and individual identification when compared to dinucleotide repeat microsatellites. However, their clear advantage arises when it comes to the precision of the allele calling, a key aspect for several applications in genetic analysis. Still some markers such as EMBRA813 (tetra), EMBRA1364 (tetra), EMBRA1374 (hexa), and EMBRA1851 (tetra) displayed very comparable genetic information content to the average dinucleotides (Kirst et al. 2005), with probability of paternity exclusion around 0.6–0.7 and probability of identity around 0.1–0.2. These could be pointed out as the most informative microsatellites across all four species. This result is important as it indicates that further screening of larger sets of tetra- and hexanucleotide repeat microsatellites is warranted and could lead to the discovery of several other markers with similar behavior. With the upcoming availability of the full genome sequence for E. grandis, it will be possible to develop a larger set of such higher order repeat markers that consolidate high information content and high-precision genotyping quality.

Microsatellite multiplexed systems for high-throughput genotyping

Besides evaluating the information content of these new markers, we were interested in providing a practical toolkit to apply them in routine genotyping in Eucalyptus. The relatively small number of alleles per locus turned out to be an advantage when it came to the ability of multiplexing loci in single electrophoretic runs. Narrower allele size ranges allow fitting more markers in the same fluorescence detection spectrum. Out of the 21 loci, 18 were selected with the best compromise of genetic information content across the four species. Loci EMBRA1307, EMBRA1456, and EMBRA1463 were left out as they had the lowest PIC values when all species together were considered and two of them were monomorphic in E. urophylla. Marker EMBRA1945, although monomorphic in E. urophylla, was kept for the 18-locus multiplex version as it is informative for the other three species. Given that different laboratories are used to different fluorescence dye sets, two multiplex options were designed, one based on a four-dye set and a second one on a five-dye set. The larger five-dye system with 18 loci (18-plex) is simply an extension of the first one, with 14 loci (14-plex), by the addition of four markers labeled with a fifth fluorescence. The proposed multiplex designs provide flexibility to use only some or all the markers and whatever fluorochrome combination desired. The designs respect the compatibility of the allele size range, leaving usually between 20- and 30-bp difference between loci labeled with the same dye. Although new, rarer alleles will likely be detected as the number of individuals genotyped increases, such a difference should provide sufficient buffering capacity to accommodate several new alleles without overlapping. No significant allele dropout issues were seen for this set of markers by using the high-quality PCR reagents specifically designed for multiplexed amplification. It should be noted that the maximum size range was 422–452 bp for EMBRA943 (Table 2), which is well within the amplifiable range. Furthermore, even if some larger alleles emerge as more individuals are genotyped, it is unlikely that their size will be much larger than two or three further repeat units due to the slower evolution of such longer repeats.

These two different multiplex systems have a very high combined power of paternity exclusion above 99.9% in both parentage testing situations with a higher power of the 18-plex over the 14-plex reaching >99.999%. They also provide a low combined probability of genetic identity between unrelated trees, below 10−10 in all species for the 14-plex and below 10−12 for the 18-plex (Table 3). Using the protocol described in this work, we successfully amplified all loci in two separate PCR reactions in the case of the 14-plex and three PCRs in the 18-plex. When evaluating which microsatellites could be suitably co-amplified in the same PCR reaction, the following microsatellite combinations were avoided due to primer–dimer primer interactions with score values (number of matches minus number of mismatches) greater than the threshold value of 7 recommended for designing multiplex PCR reactions (Vallone and Butler 2004): EMBRA2014 with EMBRA813 (score = 8); EMBRA1977 with EMBRA813 (score = 7); EMBRA954 with EMBRA915 (score = 11); EMBRA943 with EMBRA915 (score = 9); and EMBRA943 with EMBRA925 (score = 7). No hairpin structures were detected in the analysis. All loci in each multiplex were analyzed in a single electroinjection providing a very high data throughput analogous to the one routinely used in human DNA profiling (Krenke et al. 2002). It should be noted, however, that due to the potential incidence of indels immediately flanking these microsatellites, likely to occur in the highly diverse genome of Eucalyptus, difficulties in the multiplexing approach may arise as more individuals, populations, and species are genotyped.

Microsatellite performance for individual discrimination

A preliminary evaluation of the power of these microsatellites for genetic distance analyses showed that this set of microsatellites provides high resolution for individual discrimination (Fig. 1). Estimates of average shared allele distances (D SA) among individuals within E. grandis (D SA = 0.478), E. globulus (D SA = 0.474), E. urophylla (D SA = 0.502), and E. camaldulensis (D SA = 0.536) were in the same range, although significantly different (F = 12.99; p = 3.59 × 10−8), with significantly larger values for E. camaldulensis when compared to the next highest estimate of D SA (E. urophylla; t = 3.24, p = 0.00068), reflecting the higher information content of this set of loci in this species. When all 64 individuals of the species samples were analyzed jointly, the average distance increased to D SA = 0.694 and ranged from 0.214 to 0.972. The average distance among the 20 elite clones of unknown origin was D SA = 0.513 and ranged from 0.214 to 0.792. When plotted in a phenogram, all individuals could be clearly discriminated. Individuals belonging to each species clustered together in clearly separate branches as expected, while the elite clones formed a separate cluster positioned closer to E. urophylla. This distance-based analysis yielded essentially the same clusters as the population-based STRUCTURE analysis (see below). The basal nodes separating the major species clusters were, however, not supported by significant bootstrap values. Only a few bootstrap values >50 were obtained and only for nodes between subgroups inside the major species clusters or between individuals at the end of the branches (data not shown). This is in agreement with the observation that microsatellite data become less informative for phylogenetic analyses among distantly related taxa (Bowcock et al. 1994).

Fig. 1
figure 1

UPGMA phenogram based on pairwise estimates of shared allele distance among all 88 individual Eucalyptus trees, 64 of them from eight provenances of four species (CAM-KR E. camaldulensis Kennedy River, CAM-WR E. camaldulensis Walsh River, URO-FI E. urophylla Flores Island, URO-TI E. urophylla Timor Island, GLO-JR E. globulus Jeeralang, GLO-FI E. globulus Flinders Island, GRA-AT E. grandis Atherton, GRA-CH E. grandis Coffs Harbor) and 24 elite clones identified by their usual codification. Species and elite clones clusters are indicated corresponding to the population structure analysis (see Fig. 2)

Provenances did not form discrete clusters, with the exception of E. grandis where individuals from Atherton and Coffs Harbor formed two separate clusters. All the elite clones but clone Sem3 formed a separate cluster closer to the E. urophylla branch. Clone Sem3 clustered directly with the E. urophylla individuals, suggesting a possible pure species origin, while the other 19 clones more likely have a hybrid composition with a predominance of E. urophylla. The 20 elite clones displayed unique multilocus genotypes with a minimum of nine allelic differences and an average of 21.5 differences out of the 42 alleles compared (2 × 21 loci) in all pairwise comparisons. These microsatellites provide high-resolution and high-quality fingerprints useful for clonal protection or quality control procedures in breeding and deployment operations.

Microsatellite performance for population structure analysis

The multilocus data for the species reference sets were subject to a population structure analysis to test for the optimal number of clusters under an admixture model. Consistent with the peak of Evanno’s ΔK at K = 4 (Electronic supplementary materials (ESM) Fig. S1), the genetic structure of the species reference data set was partitioned into four groups (Fig. 2a). Each cluster correctly included all the individuals of each species. The estimates of the average proportion of membership (Q) of the predefined reference populations to clusters corresponding to the species were above 0.975 for all species, showing the very robust resolution possible using Bayesian analysis with data from these microsatellites (ESM Table S1). Estimates of F st between species based on this set of microsatellites were also high and significant for all pairwise comparisons (ESM Table S2). By using prior population information regarding provenance origin and assuming K = 8 (four species versus two provenances per species), only the two provenances of E. grandis (Atherton and Coffs Harbor) could be separated (Fig. 2b). This result is consistent with the way that individuals clustered in the phenogram (Fig. 1) and indicates that very little detectable genetic variation with this set of microsatellites exists between the two sampled provenances of each species, possibly with the exception of E. grandis, although it is important to note that there is no statistical support for a structure analysis at K = 8 (ESM Fig. S1). Atherton and Coffs Harbor are the only two provenances separated at a large geographical scale, latitude-wise, by over 2,000 km. E. globulus provenances Jeeralang and Flinders Island are nearby populations at ∼220 km, albeit the first at the southern tip of the continental coast of Australia and the second an island population to the south. Our results for E. globulus are thus consistent with the results of a range-wide survey of population structure of E. globulus when a strong association was found between genetic similarity and geographic proximity (Steane et al. 2006). Similarly, for E. urophylla and E. camaldulensis, the two provenances within each species could not be separated, a result which is consistent with previous range-wide surveys that showed very little variation among provenances in E. urophylla (Payn et al. 2008) and E. camaldulensis (Butcher et al. 2002).

Fig. 2
figure 2

Population analysis and assignment tests using STRUCTURE 2.1. Each individual is represented as a vertical line partitioned into K colored segments whose length is proportional to the individual coefficients of membership in each of the K inferred clusters that represent the four species of Eucalyptus. a Reference species sets using an admixture model with α = 0.0288 for K = 4. b Reference species sets using prior population information and an admixture model with K = 8, i.e., four species versus two provenances per species, showing the separation of the two provenances for E. grandis but not of the other species. c Reference species sets and test individuals (elite clones) for K = 4 using prior population information for the species and an admixture model showing the assignment of the elite clones to the four species clusters. The last four clones to the right are known hybrids of E. camaldulensis × (E. urophylla × E. globulus) derived from controlled crosses

Microsatellite performance for individual assignment tests

The reference data set was used to assign the test individuals (elite clones) to one of the four genetic clusters and estimate their most likely ancestral species composition (Fig. 2c and ESM Table S3). Assignments were performed using prior species information for the reference set under an admixture model, i.e., assuming that these elite clones have mixed ancestry. So, for example, there is a 99.04% posterior probability that clone Sem1 has recent ancestry in E. urophylla and only 0.22% in E. globulus and 88.56% that clone BA6021 has ancestry in E. urophylla and 10.87% in E. grandis. All 20 elite clones showed a very strong predominance of E. urophylla ancestry that varied between 87.18% and 99.31% (ESM Table S3). The four controlled hybrids of E. camaldulensis × (E. urophylla × E. globulus) showed the anticipated hybrid composition with a predominance of E. camaldulensis genome, although the relative proportions did not match the expectations, especially considering the small proportion of E. globulus contribution, theoretically expected at 25% but observed only at 1.2% for clone C1UGL3 up to 13.19% for clone C1UGL1 (ESM Table S3). This less than expected contribution of E. globulus in these controlled hybrids might be the result of a strong selection for adaptability to tropical environments that took place during the development of these elite clones which resulted in the preclusion of E. globulus genome. The predominance of E. urophylla in the group of 20 elite clones of unknown origin is somehow expected, although not at such high levels. E. urophylla was introduced in Brazil in the early 1970s and extensively used in hybrid combinations with E. grandis to provide higher levels of resistance to Eucalyptus canker caused by Cryphonectria cubensis (Alfenas et al. 1983; Heerden and Wingfield 2002). These hybrid clones have been anecdotally considered to be F1 hybrids of E. grandis × E. urophylla. This assignment analysis, however, performed with a set of microsatellites that provide a robust separation between these two species (ESM Fig. S1), is not consistent with this hypothesis. Potential explanations for this result include repeated backcrosses, both spontaneous and controlled to E. urophylla, coupled to a strong preferential selection for E. urophylla genome and possibly some level of misclassification of seed sources or individual trees during the breeding procedures. These results, taken together, indicate that the estimated proportions of ancestral genomic composition should be viewed as a pointer and not be unambiguously taken at face value. Furthermore, although these higher order repeat microsatellites display alleles with wide frequency differentials among the species and thus provided a clear distinction among them, the development of larger numbers of ancestry informative markers is warranted. Following intensive screening efforts, selected SNPs with contrasting allele frequencies among the populations under study have been found and used for assignment tests in admixed humans (Lins et al. 2010) and animal (Stephens et al. 2009) populations. The discovery of such SNPs in Eucalyptus will soon be possible by employing next-generation re-sequencing efforts of a few tens of individuals of each target species at a reasonable coverage and mapping these sequences on the forthcoming Eucalyptus reference genome (Grattapaglia and Kirst 2008). However, the much higher nucleotide diversity in Eucalyptus (Kulheim et al. 2009) when compared to humans and domestic animals may complicate the robustness of SNP genotyping assays. So, although in principle SNPs will be more powerful and automatable for assignment tests, the development of robustly assayable SNPs across Eucalyptus species will depend on screening several hundred SNPs whose flanking sequences are conserved enough across species to make the assay work consistently. This same approach could also be very valuable to develop microsatellites less prone to the occurrence of null alleles as they are used across species.

Conclusions

In summary, we have exploited existing resources of Eucalyptus ESTs to develop a fully operational set of microsatellite markers based on higher order repeats, still rare for plants in general. Although they are less variable than the existing Eucalyptus dinucleotide- and trinucleotide-based microsatellites, they provide a significant advantage from the practical standpoint for easier allele calling due to their larger allele size difference. Multiplexed systems with up to 18 microsatellites in two or three PCR reactions were proposed that supply very high resolution power in all four studied species. These systems will be particularly useful for clone fingerprinting and parentage testing purposes, applications that require consistent allele calling for comparative analysis across different points in time or laboratories. These markers were also shown to provide good resolution for individual identification, species distinction, and individual assignment tests for some of the main planted species worldwide. A comparison of the observed and expected results of the assignment tests indicate that the estimated proportions of ancestral composition of individuals should be viewed as reliable leads but not be taken as definitive genomic proportions. Due to their genic origin, the interspecific transferability and genetic information content of these microsatellites will likely extend to other phylogenetically close species within the same subgenus, further emphasizing their practical value for Eucalyptus genetics and breeding.