Introduction

Molecular genetic markers are variable regions of DNA that provide valuable genetic tools in genetic linkage mapping, association studies, phylogeographic studies, and for the estimation of several population genetic parameters, such as diversity, gene flow, and inbreeding (Bruford and Wayne 1993). To date, the molecular markers most widely applied to tree species have been isozymes, random fragment length polymorphisms (RFLPs), random amplified polymorphic DNA (RAPDs), amplified fragment length polymorphisms (AFLPs), and simple sequence repeats (SSRs). Each marker technique has attributes that offer different advantages (Ritland and Ritland 2000). Isozymes are well studied and established, but are not numerous or highly polymorphic. RFLPs utilize probes derived from either genomic or coding DNA (cDNA) and are codominant markers, but require a large amount of high quality DNA. RAPDs and AFLPs do not require any sequence knowledge of the genome, and so are easy to apply to uncharacterized genomes. However, they are usually dominant and are often difficult to transfer between different mapping populations or species. Additionally, RAPDs are notoriously difficult to transfer across laboratories (Jones et al. 1997).

SSR markers exhibit codominance and are usually highly polymorphic, and thus, seem to be the ideal marker (Ritland and Ritland 2000). However, their development requires a significant investment, and their cross-species transferability is normally quite limited due to either disappearance of the repeat region, or to degeneration of the primer binding sites. Traditionally, the first stage of SSR marker development is to probe a genomic library with a particular SSR oligonucleotide and sequence positive clones. However, the success rate of identifying robust SSR markers from genomic DNA is typically low due to the high proportion of primers that do not amplify genetically interpretable PCR fragments (Squirrell et al. 2003).

In conifers, SSR discovery from genomic libraries (van de Ven and McNicol 1996; Pfeiffer et al. 1997; Rajora et al. 2000; Hodgetts et al. 2001; Scotti et al. 2002a, b) has been a particularly difficult process, with very low success rate, probably because of the large, repetitive nature of their genomes (Pfeiffer et al. 1997; Bérubé et al. 2003). Despite these problems, SSRs remain the marker system of choice for a number of conifer mapping projects (Paglia and Morgante 1998).

Expressed sequence tags (ESTs) are sequenced portions of messenger RNA and offer an alternative route for SSR marker discovery, particularly for the repetitive genomes found in conifers. The advent of large-scale databases with tens of thousands of ESTs provides resources for the novel, “in silico” identification of genetic markers. In marker development, EST databases have largely been used for identification of single nucleotide polymorphisms (SNPs) (Rafalski 2002). However, SSRs are found in both the untranslated regions of ESTs and occasionally within coding regions (Cardle et al. 2000).

One advantage of these “EST-SSRs” is that they are directly associated with a coding gene, and so may be useful for association with phenotypic traits. Also, because EST sequences are evolutionary conserved, cross-species PCR amplification of EST-SSRs are expected to be more successful compared to SSRs developed from genomic DNA (Arnold et al. 2002; Saha et al. 2003); however, their levels of variability may not be as great due to selective constraints. Finally, with their relatively high levels of variability, EST-SSRs seem especially appropriate for the detection of selective sweeps (Vigouroux et al. 2002).

Here, we utilize an EST database—developed as part of the Genome British Columbia Forestry Genomics project—to identify and characterize SSR markers for spruce. This database provides a valuable and unique resource for the development of new SSR markers within spruce and also for comparative analysis of genome structure and organization. We report 25 new EST–SSR markers of primary use with white, Sitka, and black spruce. We also evaluate 101 previously reported spruce SSRs (derived from genomic DNA libraries), evaluate the use of all SSRs across 23 spruce species, and arrive at a total set of 42 robust microsatellites markers for spruce.

Materials and methods

Library construction and DNA sequencing

Nine directional cDNA libraries were constructed from a range of tissues (xylem, phloem, bark, foliage, and roots) at different developmental stages of seedlings and mature trees, as well as from trees or seedlings exposed to chemical elicitors (methyl jasmonate), or mechanical wounding. Tissues were obtained from three different spruce species: white spruce (Picea glauca) cultivar PG29, Sitka spruce (Picea sitchensis) cultivar Gb2-229, and the interior spruce (P. glauca × Picea engelmannii) cultivar Fal-1028. cDNA libraries were constructed (5′ EcoRI, 3′ XhoI) using the pBluescript II XR cDNA Library Construction Kit, following manufacturer’s instructions with modifications (Stratagene). Select cDNA libraries were normalized according to the Soares method (Soares et al. 1994). A complete technical description of library construction methods will be reported elsewhere.

Library-stock plasmid DNA was transformed into electrocompetent DH10B T1-phage-resistant Escherichia coli cells (Invitrogen) and robotically arrayed into 384-well plates from which glycerol stocks were prepared. Plasmid DNA was extracted from overnight 96-well cultures and BigDye Terminator (ABI) cycle sequenced on an ABI Prism 3700 DNA Analyzer, using conventional procedures and the −21 M13 forward primer (5′-TGTAAAACGACGGCCAGT-3′) to obtain predominantly 3′ end sequences. DNA sequence chromatograms were processed using the PHRED software (Ewing and Green 1998; Ewing et al. 1998). Sequences were quality trimmed according to the high-quality contiguous region determined by PHRED and then vector trimmed using CROSS_MATCH software (http://www.phrap.org). Sequences with less than 70 quality bases after trimming were discarded.

EST database and SSR search

The EST database used for this search consisted of 34,846 EST sequences, which were quality clipped using PHRED and our own in-house software “EST Clean.” This step also removed poly-A tail sequence from the ESTs. The clipped sequences were aligned to generate 20,275 unigenes, using the CAP-3 software package (Huang and Madan 1999). We developed an EST–SSR discovery software package (BuildSSR, available at http://www.genetics.forestry.ubc.ca/ritland/programs.html) to search for SSRs within this unigene set. This included database organization, repeat-finding software, and tools for SSR-distribution analysis. This SSR-discovery pipeline identifies SSRs in the unigene set, constructs a summary table, and then, builds a FASTA-format database that includes the repeat type, size, and position. A minimum perfect-repeat number of nine dinucleotide repeats, six trinucleotide repeats, and four tetranucleotide repeats was used for the search. EST containing SSRs were then annotated using BLAST software.

Primer design and PCR conditions

Primers spanning 44 EST-SSRs were designed. These repeats were detected in sequences from the interior, white, and Sitka spruce libraries. Primers were designed using the Primer 3 software (Rozen and Skaletsky 2000). Regions 50 bp from each end of the repeat were excluded from primer site consideration, and all primers were designed to have similar annealing temperatures to allow for uniform PCR cycling conditions. The forward primer was tailed with an M13 sequence (Oetting et al. 1995) to facilitate visualization of PCR products on a LiCor 4200 (LiCor, Lincoln, Neb., USA).

The 101 previously described spruce SSR primers derived from the genomic DNA approach were also synthesized (Pfeiffer et al. 1997; Rajora et al. 2000; Scotti et al. 2000; Hodgetts et al. 2001; Scotti et al. 2002a, b; C. Newton, personal communication), with the forward primer tailed with an M13 sequence as above.

PCR was performed with 25 ng genomic DNA, 0.2 μM of each forward and reverse primer, 0.05 μM M13 IRD labeled primer, 0.2 mM dNTPs, 1.5 mM MgCl2, and 1 U of AmpliTaq DNA polymerase (Roche) in a 20-μl volume. PCR cycling conditions consisted of an initial denaturation step of 95°C for 2 min; 30 cycles of 95°C for 20 s, 53°C for 20 s, and 72°C for 30 s; followed by a final extension step of 72°C for 3 min.

Plant material

Fresh needle tissue from the current year’s growth was collected from 20 mature trees in wild populations of both white and Sitka spruce. The white spruce population was sampled in the region surrounding the town of Fort Nelson, situated in the northeast corner of the province of British Columbia, Canada. Trees of this population were sampled 1–2 km apart. The Sitka spruce population was located on Kodiak Island, Alaska, which marks the northern migrating tip of the species’ range. Trees of this population were sampled 30–50 m apart. DNA from 20 black spruce individuals was obtained from samples collected in Manitoba and Saskatchewan, Canada. Buds were collected from a single tree from each of 23 spruce species (for list see Table 3), which were growing as a collection at the British Columbia Ministry of Forests Kalamalka research station, Vernon, B.C. Only one genetic individual was available for each species in this collection. DNA was isolated from the bud and needle tissue following the CTAB method described by Doyle and Doyle (1990).

SSR testing and assay

The 44 SSR developed from the EST database were tested on the above described population collections of Sitka, white, and black spruce. In addition, the 101 previously described SSR primer pairs were tested on a panel of two white, two Sitka, one black, and one red spruce individual. The SSR primer pairs that amplified products from these species were then tested on the collection of 23 spruce species.

In the testing and assay, presence or absence of microsatellite PCR products was scored on 2% agarose gels. When products were found, they were tested for polymorphism on 6% (Long Ranger) polyacrylamide gels, using a LiCor 4200 automated sequencer. Microsatellite products were detected by M13 tailed primer (Oetting et al. 1995).

Analyses

Observed heterozygosity (Ho), expected heterozygosity (He) and the inbreeding coefficient (F: F=1−Ho/He) were estimated for each SSR locus within each of the three spruce species (Sitka, white, black). Standard errors of F were determined by bootstrapping individuals within populations, using a Fortran 95 program written by K. Ritland.

Genetic distances between individuals in the 23 species set were estimated as the mean squared difference of allele sizes (Goldstein et al. 1995)—after sizes were normalized by dividing by the variance of allele size (specific for each locus)—using a Fortran 95 program written by K. Ritland. A total of 100 bootstrap datasets were constructed by resampling loci. For each replicate, the computer program NEIGHBOR (in PHYLIP, Phylogeny Inference Package, version 3.57c, Felsenstein 1995) was used to construct an unrooted tree, using the neighbor-joining (NJ) method (Saitou and Nei 1987). The 100 trees were then evaluated by CONSENSE (in PHYLIP) to find an overall consensus tree, with confidence numbers attached to each branch.

The metric of mean squared allele size difference outperforms heterozygosity at differences over larger time periods [≥1,000 generations, particularly when standardized (Neff 2004)]. Hence, instead of standard measures such as Nei’s genetic distance or the proportion of bands not shared, we used a mean squared allele size difference, standardized by mutation rate (as the variance of allele size is proportional to the mutation rate, c.f. Goldstein et al. 1995).

Results

From the Genome British Columbia (BC) spruce EST unigene database, 188 unique SSR sequences were found within 183 contigs. A total of 119 dinucleotide, 61 trinucleotide, and eight tetranucleotide repeats were found (Fig. 1). The most common class of repeat was AT (91 of 188 SSRs). Of the SSR sequences found in the EST database, 31 were at the extreme 3′ end of the ESTs (adjoining the poly-A tail) and 22 were at the 5′ end of the sequences; therefore, primers could not be designed for these sequences. The distribution of the SSR repeat types in relation to the coding sequence was non-random. Of the 31 SSRs at the 3′ end of the ESTs, 30 were AT repeats (the remaining SSR was an ATT repeat). Of the 22 repeat types at the 5′ end of the ESTs, 19 were AG repeats (the remaining SSRs were one each of AT, GAC and AGA). Only two AC repeats were identified within the EST database. G+C content within the SSR containing ESTs was 40.2%, which is comparable to the G+C content in and Arabidopsis SSR containing ESTs (43.8%) (Morgante et al. 2002).

Fig. 1
figure 1

Distribution of simple sequence repeat motifs in the Genome British Columbia Forestry spruce EST database

Of the 145 primer pairs, 41 detected a single locus, and one previously developed SSR detected two loci across these four species (Table 1). This set of 42 primer pairs included 25 EST-SSRs and 17 previously developed SSRs.

Table 1 Primer pair sequences and repeat motifs. Forward primers were 5′ tailed with the M13 sequence 5′-CACGACGTTGTAAAACGAC- 3′ to facilitate visualization on LiCor sequencers

HeHo heterozygosity, and F are shown in Table 2 (in some cases there were insufficient individuals to obtain adequate estimates of F). As is normal with microsatellites, heterozygosity varied widely among loci. The average heterozygosity was highest in white spruce (0.78), lower in black spruce (0.72), and lowest in Sitka spruce (0.55).

Table 2 Polymorphism in Sitka, white, and black spruce as described by the number of alleles, the expected (He) and observed (Ho) heterozygosities, the estimated inbreeding coefficient (F), and its estimated standard error (SE)

The EST-SSRs showed significantly less variation than the genomic-derived SSRs; He values were 6.25% less in white spruce, 15% less in black spruce, and 9% less in Sitka spruce. Likewise, the numbers of alleles at ESR-SSR loci were comparably lower in all three species. Interestingly, F values were significantly lower at the EST-SSR compared to the genomic-derived loci (0.02 vs 0.13 in Sitka, 0.03 vs 0.10 in white, and 0.09 vs 0.14 in black).

Of the 43 loci amplified by the 42 primer pairs determined to be informative across white, Sitka, black, and red spruce, 33 were identified in all 23 spruce species (Fig. 2; Table 3). The minimum number of species in which a particular locus was present was 17. Twenty-five EST–SSRs primer pairs developed from the Genome BC spruce EST database were included in this set of markers. Of these, 20 amplified single locus markers from across all 23 spruce species tested, while five amplified single-locus markers from 22 of the 23 spruce species tested.

Fig. 2
figure 2

Amplification of locus WS0092.A19 across the 23 spruce species

Table 3 Allele sizes resulting from the amplification of 37 simple sequence repeat loci across 23 spruce species

Figure 3 gives the NJ tree of microsatellite genetic distances among the 23 spruce species, and Fig. 4 gives the consensus tree of microsatellite genetic distances among 23 spruce species. The relatively deep rooting of each species is due to the variability and high evolutionary rate at SSR loci. While some clustering of related species is evident, bootstrap confidence levels are not high.

Fig. 3
figure 3

Neighbor-joining tree of microsatellite genetic distances among the 23 spruce species

Fig. 4
figure 4

Consensus tree of microsatellite genetic distances among 23 spruce species. The numbers at the forks indicate the number of times the group consisting of the species, which are to the right of that fork occurred among the trees, out of 100 trees

Discussion

The EST–SSR markers are adjacent to coding genes, and the function of these genes can be often identified via sequence similarity to annotated genes in other plant species. Thus, they are useful in quantitative trait locus mapping and particularly “genomic scans” (Vigouroux et al. 2002). Their association with coding genes makes EST-SSRs more likely to be single copy, which is particularly useful for species with large genomes such as spruce. Furthermore, as coding regions tend to be more conserved, this potentially increases the transferability of these EST-SSRs across spruce species. While the EST-derived SSR markers in this study were somewhat less variable than the genomic SSR markers, the F values were also significantly lower, suggesting a lower frequency of troublesome null alleles in EST-SSRs.

SSR locations in spruce ESTs

The SSRs exhibited differential distribution within the expressed sequences. AT repeats were preferentially found at the 3′ end of the EST sequences, while AG repeats were preferentially found at the 5′ end of sequences. While Scotti et al. (2000) found six AC repeat regions from a Norway spruce cDNA library clustered at the 3′ end of the expressed sequences, we found only two AC repeats within our 3′ EST collection. This may reflect a difference in SSR composition between Norway spruce and the North American spruces used for our cDNA libraries. Alternatively, by specifically targeting AC repeats, Scotti et al. (2000) may have identified the rare AC repeats found in expressed portions of the spruce genome.

SSR motif types in spruce ESTs

The most common SSR motif found in our EST database was AT, accounting for 91 of the 188 repeats identified. By contrast, in Arabidopsis ESTs, AAG is the most common class of SSR, and AT repeats are less prevalent (Cardle et al. 2000). AT repeats, however, are the second most common SSR type (after poly A repeats) in Arabidopsis and other plant genomic DNA (Cardle et al. 2000). The prevalence of AT repeats in spruce ESTs is also supported by SSR searches of Picea sequences in the EMBL database, where of seven SSRs developed, four were AT repeats (Besnard et al. 2003). This prevalence of AT repeats in spruce ESTs may be a hitherto unnoted feature, as other studies where SSRs have been isolated from spruce coding sequences have utilized specific repeat probes (not AT) (e.g., Scotti et al. 2000). Alternatively, as AT repeats were found to be preferentially clustered at the 3′ end of the ESTs, this preponderance of AT repeats may be a consequence of the 3′ sequencing of this EST database. We are currently in the process of obtaining full-length EST sequences, and a survey of these may reveal a different distribution of SSR repeats.

SSR polymorphism

The amplification of the SSRs in Sitka, white, and black spruce populations revealed high levels of polymorphism, as indicated by the high average number of alleles and the high He and Ho, both typical of SSR markers. This suggests that most of these SSR markers will be useful in parentage and clonal assessments because of their high potential for discrimination. They will also be useful in constructing genetic linkage maps, as these markers will likely be segregating in a range of crosses.

The F for specific SSRs allows identification of loci with putative null alleles, with those showing significantly higher F values indicating the presence of null alleles. Null alleles can bias estimates of genetic variation and genetic structure, and are not useful for genetic mapping. Also, loci with prominent stutter bands often exhibited higher F values, due to the difficulty in scoring of heterozygous genotypes for adjacent sized alleles. In contrast to the genomic DNA-derived markers, our EST–SSR markers gave more uniform F values the three spruce species. Two genomic-derived loci in particular showed consistent patterns across the three species: PAAC 19 (positive F) and UAPgAG150A (negative F).

Cross-species amplification of SSR markers

Of the 43 loci identified as informative in white, Sitka, black, and red spruce, the majority (33/43) were able to amplify alleles across all 23 spruce species tested. The minimum number of species in which a particular locus was identified was 17 (locus SPL3AG1A4). This suggests that the regions flanking the SSRs are well conserved across the spruce species tested, and that if a particular locus can be amplified from white, Sitka, black, and red spruce, then it is likely that the locus will be widely transferable throughout other spruce species as well. Of the 44 SSR markers developed from the EST database, 25 were informative in white, Sitka, and black spruce. From these 25 loci, 20 were identified in all 23 spruce species tested, while the remaining five loci were detected in 22 of the 23 species.

While SSRs are instrumental in genetic mapping (e.g., Dib et al. 1996), studies of kinship (e.g., Queller et al. 1993), and population structure (e.g., Bowcock et al. 1994), they have received limited use as a tool for phylogenetic reconstruction of closely related species (reviewed by Schlötterer 2001). This is mainly due to allele size homoplasy resulting from an exceptionally high mutation rate. However, when a genetic distance measure that takes into account the mutational process is used (Goldstein et al. 1995; Neff 2004), SSRs, particularly those developed from ESTs, can be potentially very informative in resolving newly diverged specific complexes or groups with slower rates of evolution.

Interestingly, the tree topology obtained by microsatellite genetic distances among species (Figs. 3, 4) was similar to that obtained from phenetic and cladistic analyses of chloroplast DNA RFLPs (Sigurgeirsson and Szmidt 1993). Highlights of this similarity include P. mexicana and P. glauca clustering together in congruence with Sigurgeirsson and Szmidt’s (1993) “P. glauca alliance” and the association of P. asperta, P. koyamai, and P. koraiensis. Results from the bootstrap routine, however, showed no support for the branches of the phylogenetic tree generated. This is most likely a product of sampling a single individual per tree species. Because portions of the phylogenetic tree obtained matched the results of Sigurgeirsson and Szmidt (1993) and was in agreement with generally accepted views of classification within Picea, we propose that the microsatellite markers tested in this study, if applied to multiple individuals of each species, will likely prove to be powerful tools for investigating phylogenetic relationships within Picea.

Comparison of EST-derived SSRs with other SSRs in spruce

In this study, we found that the use of an EST database to develop novel SSR markers led to a high rate of success when compared to other studies. In addition, the EST–SSR markers developed and presented here have been readily transferable across species. Of 44 EST-SSRs, 25 were widely transferable across spruce species (~57%), while only 17 of 101 previously developed SSR markers were as widely transferable (~17%). These SSR markers are in the process of being placed onto a genetic linkage map of white spruce. This will increase their usefulness for other purposes such as population studies because markers evenly spaced throughout the genome will be able to be chosen. Also, the large allele size difference between different loci will allow placement of loci into “bins” for multiplexing. Although the potential for coamplification of loci has not been tested yet, even post-PCR pooling of these loci will save time and money by reducing the number of gels that have to be run.

Previous attempts at developing SSR markers from conifer genomic sequences have been hampered by a low success rate due to many primer pairs yielding complex banding patterns that cannot be genetically interpreted. This has been attributed to the large proportion of repetitive or low complexity sequence present in conifer genomes (Pfeiffer et al. 1997). The use of an EST database to identify SSR markers has resulted in the development of a higher proportion of useful and informative loci. This study identified a preponderance of AT repeats from spruce ESTs, in contrast to other plant genomes. When a full-length EST database is available for spruce, we will be able to determine if this SSR distribution is confirmed or if it is an artefact caused by the 3′-sequence data currently in our EST database.