Introduction

The conifers are classified with the seed plants, which include four living groups of gymnosperms, i.e. cycads (Cycadales), Ginkgo (Ginkgoales), gnetophytes (Gnetales), and conifers (Pinales), and the extant flowering plants or angiosperms (Magnoliophyta) (Raven et al. 2005; Gernandt et al. 2011). While angiosperms underwent tremendous adaptive radiation, reaching some 250,000 species (Kenrick 1999), the extant gymnosperms number fewer than 1,000 species (Farjon 2008). Despite the global success of angiosperms, one division of the gymnosperms, the conifers still dominate many of the world’s temperate and boreal forest ecosystems where they play a major role in global carbon cycles, are widely used in reforestation programs, and are critical to preventing soil erosion, among other functions.

Fully sequenced plant genomes are rapidly growing in number but, to date, they do not include a representative from the gymnosperm lineage. Until very recently, sequencing of conifer genomes had not been attempted owing to their extremely large sizes. Genome sizes ranging from 18,000 to over 35,000 Mbp have been estimated for economically and ecologically important conifers, such as many pines and spruces (Murray et al. 2004), making them on average > 200× the size of the Arabidopsis thaliana genome, and close to 24×, 10× and 7× the genomes of rice, maize and human, respectively (Fig. 1). It is now well known that large genomes among angiosperms are the consequence of multiple genome duplications and triplications, polyploidization events complemented with periods of transposon multiplication (Bennetzen 2002; Gaut and Ross-Ibarra 2008). Considerably less is known about the mechanisms that have led to the expansion of conifer genomes. For conifers, there is no evidence of polyploidization, but retrotransposons have been shown to be abundant and widespread in conifer genomes (Morse et al. 2009; Magbanua et al. 2011; Morgante and De Paoli 2011).

Fig. 1
figure 1

Genome sizes of 181 conifers (Murray et al. 2004) and select angiosperms (Bennett and Leitch 2005) with complete- or partially-sequenced genomes. Each angiosperm is represented by a colored circle (see KEY) while conifers are represented by white circles having gray outlines, except for a few socioeconomically important conifer species, which are represented by black circles with labels

In addition to their unusually large sizes, early insights into the composition and structure of conifer genomes strongly suggest that they are very different from angiosperms, and that their organization cannot readily be predicted or deduced based on lessons learned from other plant genomes. For example, glimpses into the unique biology of conifer genomes have come from attempts to estimate and catalogue their protein coding gene content. Studies have identified a large overlapping set of sequences between conifers and angiosperms, including both herbaceous annuals and woody perennials (Ralph et al. 2008; Pozo et al. 2011; Rigault et al. 2011). However, the actual total number of genes in conifers and the proportion of the genome that they represent remain elusive (MacKay and Dean 2011). Results reviewed in this report suggest that a complete genome sequence will hold the key to resolving these questions among many others.

The extreme genome expansion due to retrotransposon insertions reported for conifers is likely to have affected gene structure in ways that remain to be described, and has resulted in large distances between genes distributed amid a vast ocean of non-coding DNA. We may also expect (see Lynch 2007) that it has favored the accumulation and retention of many neutral or slightly detrimental mutations, which in turn could have contributed further to genome inflation, in addition to providing a potential store of within-species genetic diversity. Understanding the relationships between the very large genome sizes, gene structure and genome organization in conifers will undoubtedly provide new insights into genome evolution and function. Along with advances in our understanding of conifer physiology and ecology, information gained from conifer genome sequencing will help us describe the potential for conifer species and populations to adapt to environmental change or respond to selective breeding, which will help us to better protect and improve them in the future.

Technology advancements have now made the sequencing of conifer genomes feasible, and, far from representing just another plant group of interest, it is expected that such analyses will significantly expand basic knowledge of plant genomes into new areas because of the uniqueness of the genomes themselves as well as the taxonomic position of conifers relative to plants that have been sequenced previously. Furthermore, many opportunities for applied research will result owing to the ecological importance of conifers combined with their potential role in solutions to global warming as well as their economic significance in meeting the global demand for wood and other biomaterials. The purpose of this review is to outline the current state of knowledge concerning conifer genomes and to highlight not only the opportunities for knowledge creation and application associated with conifer genome sequencing, but also some of the underlying challenges that may be expected.

Accomplishments in conifer genomics

Transcriptome sequencing and analysis

Although genome sequencing has not been attempted in conifers, large-scale investigations of expressed gene sequences have being ongoing in conifers for over a decade through cDNA analysis, and clustering of expressed sequence tags (ESTs) to infer putative unigenes or transcript sets (reviewed by MacKay and Dean 2011). Knowledge of protein coding sequences has thus become rather extensive and has been fundamental to enhancing our understanding of a variety of biological processes and evolutionary mechanisms in conifers.

Over 90 % of the approximately 1 million gymnosperm ESTs found in dbEST are from conifers, mostly representing pines (Pinus) and spruces (Picea) (Pinaceae family) (MacKay and Dean 2011). Wood formation has received the most attention in gene discovery efforts, but other biological processes have also been targeted, including somatic embryogenesis, responses to defoliating insects, and root responses to water stress (for a partial list of larger projects, see Conifer Genome Network CGN; http://www.pinegenome.org/). Currently available expressed sequence resources include 328,662 ESTs and 17,379 unigenes (GenBank) for loblolly pine (Pinus taeda), as well as 313,110 and 186,637 ESTs with 27,848 and 19,944 unigenes for white spruce (Picea glauca) and Sitka spruce (Picea sitchensis), respectively (Table 1). Other conifers having EST datasets containing >10,000 sequences include Cryptomeria japonica, Pinus contorta, Pinus banksiana, Pinus pinaster, P. radiata, Pseudotsuga menziesii, Picea engelmannii × P. glauca, and Picea abies. Large full-length (FL) cDNA collections are available for P. glauca (23,000 unique fully sequenced cDNA insert sequences) and P. sitchensis (13,005 FL-cDNAs). Extensive gene catalogues have been derived from these data (Ralph et al. 2008; Rigault et al. 2011; Pozo et al. 2011) making feasible a wide variety of functional genomic approaches previously unavailable for conifer studies.

Table 1 Transcriptomics resources currently available for the five best-studied conifer genera

Conifer transcriptome sequencing using the 454 life sciences platform (Roche) recently added more than 10 million reads for over a dozen conifer species drawn for all seven families of conifers (Lorenz et al. 2012). Similar studies have developed resources specifically for P. contorta (Parchman et al. 2010), and P. glauca (Rigault et al. 2011). The 1,000 Plant Transcriptome (1KP) project (http://www.onekp.com/) set its sights on short-read sequence transcriptome scans from every plant family and numerous genera. Its current goals include generation of transcriptome scans comprising approximately 2 Gb of short-read (Illumina) sequences for at least 48 conifer species representing all seven conifer families, as well as several other gymnosperms and non-flowering plants. Once released, these sequences will constitute an enormous resource for comparative genomic studies of conifers.

Species-specific, as well as multi-species conifer EST assemblies are to be found in several public databases including NCBI (Table 1). The recently developed database, EuroPineDB, integrates ESTs from both dideoxy and 454 sequencing for several pine species (Pozo et al. 2011). MacKay and Dean (2011) discussed the approaches used to generate the wide ranging numbers of unigenes and assemblies found in some of these databases. Next-generation sequencing (NGS) greatly reduces the sequencing cost, but assembly fidelity can be difficult due to repeats, insertions/deletions, highly conserved paralogs, and other sequence variations that occur on scales ranging from a few tens of bases to complete genes (Wall et al. 2009). The large FL-cDNA datasets available for P. glauca and P. sitchensis are thus extremely important for their potential to serve as templates for guiding assembly of short-read sequences (Rigault et al. 2011).

Insights into gene and protein families and non-coding small RNAs

Phylogenetic analyses of EST and cDNA sequences have highlighted distinct evolutionary trajectories for gene families in conifers and angiosperms, including genes related to cell wall and wood formation, such as arabinogalactan proteins (AGPs), cellulose synthases (CesA), and expansins, defense-related genes, such as dirigent proteins and the cytochrome P450 monooxygenases of the terpenoid oxygenase superfamily, and transcription factors, such as auxin response factors (Aux/IAAs), KNOTTED-LIKE HOMEOBOX Class 1 (KNOX1), and R2R3-MYBs (see MacKay and Dean 2011). For example, the expressed KNOX1 genes in conifers were found to belong to only one of three sub-families found in angiosperms, but had more recent gene duplications (Guillet-Claude et al. 2004). The AGP, Aux/IAA, and dirigent protein families were comprised of clades representing numerous duplications that had occurred since the angiosperm-gymnosperm split (Li et al. 2010). And the PAL gene family was shown to have undergone gene duplication and loss events that have resulted in an expanded gene clade specific to gymnosperms (Bagal et al. 2012).

Small non-coding RNAs, including microRNAs (miRNA) and short-interfering RNAs (siRNA), have distinct functions and modes of formation in conifers, and contribute to transcriptional and post-transcriptional gene regulation (Morin et al. 2008). The miRNAs, which include many negative regulators of gene expression, predominantly accumulate as both 21- and 24-nucleotide RNAs in angiosperms, but only 21-nucleotide miRNAs were found in gymnosperms (Dolgosheina et al. 2008; Morin et al. 2008; Yakovlev et al. 2010). Specific miRNAs previously described in other plants have also been found in several conifers species (Dolgosheina et al. 2008), where they have been linked to important processes such as response to pathogens in P. taeda (see MacKay and Dean 2011). Novel miRNAs of uncharacterized function (51 in total) were identified in P. contorta (Morin et al. 2008), and in P. abies, miRNAs were shown to be involved in regulating temperature-dependent epigenetic memory (Yakovlev et al. 2010). Thus, current knowledge indicates that miRNAs are likely important for conifer biology, but the mechanics for their synthesis is unlike that in angiosperms plants, while siRNAs have not yet been described in conifers.

Transcript profiling

Custom cDNA arrays have been developed using sequences from various Pinaceae species (reviewed in MacKay and Dean 2011). Large cDNA arrays (>20,000 spots) using P. glauca (Holliday et al. 2008) and P. taeda (Lorenz et al. 2011) sequences, as well as an oligonucleotide microarray representing nearly 24,000 distinct P. glauca genes (Beaulieu et al. 2011), have enabled truly genome-scale analyses of conifer gene transcription.

Vascular tissue development and wood formation have been a major focus of transcript profiling investigations in conifer trees to date (MacKay and Dean 2011). Studies have characterized different stages of xylem differentiation and development, the response to mechanical stress, the transcripts that accumulate in secondary xylem relative to secondary phloem or needles (Pavy et al. 2008a), and transcripts, such as AGPs, lignin biosynthetic enzymes, alpha-tubulins, and various cell wall proteins, that are up-regulated in compression wood (see MacKay and Dean 2011). The latter paper also reviewed transcriptional profiling investigations of biotic factors affecting conifers, such as pathogens, insects and symbionts, as well as abiotic factors, such as drought, photoperiod and temperature as determinants of phenology, development and dormancy (El Kayal et al. 2011). These studies have only scratched the surface and a comprehensive understanding of conifer transcriptome dynamics and structure remains to be developed.

Genetic mapping: SNP resources and comparative mapping

Genetic linkage maps have been developed for the most economically important conifers using a variety of molecular markers (reviewed in Ritland et al. 2011). For example, as described by Echt et al. (2011) for loblolly pine (P. taeda L.), nuclear microsatellites (nSSR), microsatellites from expressed sequence tags (EST-SSRs), restriction fragment length polymorphisms (RFLPs) and expressed sequence tags polymorphisms (EST-Ps) were all combined to generate a consensus map of 460 markers covering 1,515 cM across 12 linkage groups. During the past 5 years, EST sequencing and the large-scale resequencing of amplicons gave rise to a scale change in SNP marker development (Le Dantec et al. 2004; Pavy et al. 2006) so that the pace of development for in silico SNP resources is speeding up dramatically. High-throughput genotyping methods have greatly benefitted from the accelerated rate of nucleotide polymorphism discovery, and in vitro SNPs verified by resequencing of genomic DNA using Sanger sequencing as well as SNPs discovered in silico, have been used successfully to construct gene-based linkage maps for P. glauca (Pavy et al. 2008b; Pelgas et al. 2011), P. taeda (Eckert et al. 2009a), and P. pinaster (Chancerel et al. 2011). Comparative analysis of EST sequences across species has enabled the development of comparative orthologous sequence (COS) markers that can be used across a broad spectrum of conifer species. Such markers showed high levels of macro-synteny between P. taeda and P. sylvestris (Komulainen et al. 2003), P. taeda and P. pinaster (Chancerel et al. 2011), and across multiple genera in the Pinaceae family (Pelgas et al. 2006). Liewlaksaneeyanawin et al. (2009) developed a set of 239 COS markers that have proven valuable for synteny analyses between Picea, Pinus and Pseudotsuga species. Together with the reference cytomolecular map of loblolly pine (Islam-Faridi et al. 2007), the emerging high-density gene-based maps for several conifer species will provide further opportunities for studying and comparing the organization and evolution of these genomes.

Applications: association studies and QTL analyses

Dense linkage maps can now be used to narrow down the location of loci that influence quantitative traits through classical quantitative trait locus mapping (Pelgas et al. 2011) (Tables 2, 3). So far, limited studies of candidate gene associations have been conducted with respect to wood-quality genes in P. taeda (González-Martínez et al. 2007), P. radiata (Dillon et al. 2010) and P. glauca (Beaulieu et al. 2011). Drought-related traits (associated with carbon-isotope discrimination) were also analyzed in P. taeda (González-Martinez et al. 2008) (Table 3). Associations between candidate genes and timing of growth cessation or cold tolerance initiation were examined in P. menziesii (Eckert et al. 2009b) and P. sitchensis (Holliday et al. 2010). More recently, Eckert et al. (2010) analyzed associations between a large set of SNPs (independent of function) and aridity-related environmental variants as a surrogate for the phenotype of each tree.

Table 2 Saturated linkage maps in conifers
Table 3 QTL and association genetics studies in conifers

These candidate gene-based studies have, in general, confirmed earlier results of QTL mapping, i.e. the effects of individual loci on quantitative traits are mostly small, and the total detected effects are still far from accounting for all of the heritability for a given trait. In maize, in a very large study with high power, the effects of individual SNPs on flowering time were even smaller, but the whole-genome associations accounted for most of the additive genetic variance (Buckler et al. 2009). Additional genomic sequence characterization in conifers will provide more comprehensive sets of markers that also account for gene promoters and non-genic regions of the genome. These expanded sets of markers, which will include more rare (and possibly larger effect) variants, will provide powerful tools for breeding programs.

Links to updated results from ongoing and concluded research programs and projects in the area of tree genomics have been collected as part of the FoResTTraC project and can be found at http://www.foresttrac.eu/index.php/resources-database as well as in Plomion et al. (2007, 2011).

Pilot studies in genome sequencing

Sequence structure: BAC sequencing

With the aim of characterizing the structure of transcriptional units as well as non-coding intergenic regions, large-insert “bacterial artificial chromosome” (BAC) genomic libraries have been constructed for P. pinaster, P. glauca, P. taeda, and Taxodium distichum (Bautista et al. 2007; Hamberger et al. 2009; Magbanua et al. 2011; www.mgel.msstate.edu).

Targeted screening of a small number of conifer BACs harboring gene or gene-like sequences has indicated that pseudogenes may be a frequent feature within conifer genomes (Kovach et al. 2010; Magbanua et al. 2011). Of two P. pinaster BACs hybridizing to a sequence for ferredoxin-dependent glutamate synthase (Fd-GOGAT), one contained a sequence coding for an intact Fd-GOGAT polypeptide while the other contained what appeared to be a pseudogene (Bautista et al. 2007). Of eight P. taeda BACs containing sequences hybridizing to known genes, most contained apparent pseudogenes (Kovach et al. 2010). P. glauca BAC clones harboring intact coding sequences for genes encoding a terpenoid synthase and a cytochrome P450 were successfully isolated using PCR-screening and amplicon sequencing to confirm identity (Hamberger et al. 2009).

In all of these cases, no more than one intact gene sequence was present in any given BAC assembly (>100 kb), suggesting that large intergenic regions separate the coding sequences in conifer genomes. Consistent with this interpretation and the very large size of conifer genomes, sequencing of several random BACs proved unsuccessful in identifying multiple coding sequences (unpublished data).

Conifer genomes contain very large amounts of repetitive DNA (Morse et al. 2009; Magbanua et al. 2011). For example, the P. taeda LEA3 BAC was composed of 18.8 % LTR retroelement sequences (Magbanua et al. 2011). Similarly, three conifer-specific LTR retroelements (PtIFG7, PtGypsyX1 and PtCopiaX1), as well as direct and tandem repeats from putative uncharacterized LTR retrotransposons, were abundant in P. taeda and P. pinaster BACs (Kovach et al. 2010; Cánovas et al. unpublished). The distribution of the Gypsy retrotransposons, Gymny and PtIFG7, and the Copia retroelement, TP1, suggest that pine genomes contain highly abundant and diverse repetitive DNA (Morse et al. 2009; Magbanua et al. 2011; Kovach et al. 2010), and it has been hypothesized that accumulation of retrotransposon derivatives could explain the tremendous size and complexity of conifer genomes (Morse et al. 2009). There is even evidence that the low-copy, non-genic portion of the pine genome is primarily composed of extremely diverged mobile elements (Morse et al. 2009; Kovach et al. 2010; Magbanua et al. 2011). These patterns seen in pine genomic sequences were consistent with major features observed in the spruce (P. glauca) genome (Hamberger et al. 2009). This information holds important ramifications for efforts to develop reference genome sequences for conifers since the quantity, complexity and divergence of repetitive sequences in genomic DNA profoundly affect the speed and quality of output from available sequence assembly tools.

Genome composition: genome filtration or reduced representation sequencing, and snapshot sequencing

The tools and methods used in genomic analyses are changing rapidly, driven in large part by recent advances in sequencing technologies. The new generation of sequencing technologies, represented by HiSeq/Illumina, SOLiD/ABI, GS-FLX Titanium-FLX +/Roche, HeliScope/Helicos, PACBIO RS/Pacific Biosciences, and GridION-/Oxford Nanopore, are providing new opportunities for high-throughput functional genomic research. Yet even with the vastly improved throughput of these new technologies, large genomes, such as those of P. abies (1.8 × 1010 bp) or P. taeda (2.7 × 1010 bp) compared to A. thaliana (1.6 × 108 bp) (Fig. 1), demand novel approaches in order to reduce genome complexity to levels permitting sequence assembly with reasonable fidelity.

Among several methods available for creating reduced-representational libraries for genome sequencing (Springer et al. 2004), restriction enzymes sensitive to DNA methylation at the ′5 position of cytosines in CpG dinucleotides can be used to generate genome fractions that are enriched for genes. This approach can be used prior to sequencing to help eliminate the most highly repetitive elements, which can constitute >75 % of conifer genomes (Kovach et al. 2010; Morse et al. 2009). However, one pilot study using this approach did not find the expected degree of enrichment using P. taeda DNA, suggesting that methylation patterns in pine may be somewhat different from those typically seen in angiosperms (Rabinowicz et al. 2005). Sequence capture methods developed to enrich for exons in resequencing projects (Ng et al. 2009), as well as methods based on thermal renaturation kinetics (Peterson et al. 2002), represent alternative approaches for reducing the complexity of conifer genomic DNA for sequencing projects. Because of their tendency to enrich for gene space sequences, these techniques are particularly useful for targeting protein coding regions that correspond to the expressed sequences available from EST projects.

Genome scans have been performed at low coverage on diploid genomic DNA from P. taeda (Kovach et al. 2010), P. glauca (Rigault et al. 2011), and P. abies (Morgante and De Paoli 2011). Early analyses of these studies too have indicated that a significant proportion of the genome is comprised of uncharacterized repeats (Morgante and De Paoli 2011). However, while these authors reported that 3 % of the genome of P. abies (approximately 600 Mb) is comprised of sequences similar to genes, their finding was at odds with studies based on cDNAs that estimated the P. glauca transcriptome at between 40 and 50 Mb (Rigault et al. 2011). On the other hand, the finding that a large portion of the genome may be comprised of gene-like sequences could fit with earlier results suggesting an unexpectedly complex transcriptome in P. taeda (Lorenz and Dean 2002). These discrepancies could indicate that conifer genomes could contain more non-coding pseudogene sequences than genes encoding expressed functional proteins or that novel transcriptional mechanisms may exist in conifers. Additional studies are underway using haploid genomic DNA from P. abies (Invargson et al. unpublished data) and P. pinaster (Cervera et al. unpublished) to help test these hypotheses. Additional efforts using low-coverage genome scans will provide important preliminary evidence for the potential of using NGS in shotgun sequencing approaches to characterize conifer genomes.

Future directions, prospects, and barriers

Opportunities and strategies for sequencing conifer genomes

In the last 2 years, genome sequencing projects for seven different conifer species representing three different genera have been launched: (a) the P. abies genome project, a European consortium led by Sweden to sequence the genome of Norway spruce (http://www.upsc.se/Networks/Networks/sprucegenome.html); (b) PineRefSeq, a USDA-funded project to sequence the genomes of loblolly pine (P. taeda), sugar pine (Pinus lambertiana), and Douglas-fir (P. menziesii) (http://dendrome.ucdavis.edu/NealeLab/pinerefseq); (c) SMarTForests, a Genome Canada project to characterize the white spruce (P. glauca) genome; (d) a Spanish consortium, funded by MICINN, with INRA contribution, to generate an initial draft of P. pinaster genome sequence; and (e) ProCoGen, a 7FP-KBBE project from the European Commission, to sequence the maritime (P. pinaster) and Scots pine (P. sylvestris) genomes.

Recent advances in sequencing technology have made it a straightforward exercise to achieve 50–100× coverage from conifer giga- genomes using whole-genome shotgun sequencing (WGS). WGS is typically performed using paired-end sequencing of fragments from a number of different size classes (from 100 to 800 bp); however, these fragment sizes are too short to span most repetitive elements in conifers. WGS therefore needs to be complemented with jumping (mate pair) libraries generated from fragments ranging from a few kb to tens of kb. Because even 10–20 kb fragments are too short to span some repeat regions, other methods, for example, end-sequencing of fosmids or BACs, as well as sequencing of fosmid pools, are actively being researched (Philippe et al. 2012). The silver lining to conifer genome sequencing with respect to repeat sequences is that a significant portion of the repeats appear to be rather old and as a consequence are highly diverged in their sequences (Kovach et al. 2010; Morgante and De Paoli 2011). Therefore, the highly repetitive nature of conifer genomes may in the end not be an impassable obstacle for sequence assembly. Combined with appropriate approaches to reduce genome complexity (genome filtration or BAC/fosmid pooling), the assembly of a conifer reference genome sequence appears more and more feasible. Assembling data from WGS from diploid tissue may lead to problems with allele splitting due to sequence polymorphisms and polymorphic repetitive elements. However, conifers have the advantage that bulk haploid tissue is available from megagametophytes, and WGS from haploid tissue may mitigate a lot of these assembly problems. The only drawback is that a single megagametophyte might not provide enough DNA to achieve the desired sequence coverage given the seed size in most of the conifer species currently targeted by genome projects. Regardless of what different strategies that are pursued, a critical and difficult step for most de novo conifer genome projects will be merging of multiple assemblies produced from different kinds of data.

A more significant short-term obstacle for conifer genome assembly may be accessing adequate computing resources, particularly computers with sufficient random-access memory (RAM), to handle the most advanced sequence assembly methods. Successful assembly of these reference genome sequences will require high-density genetic maps that can be used to position scaffolds along genetic linkage groups. The genetic maps currently available for conifers have relatively low marker density, given the very large numbers of sequence scaffolds expected from WGS assemblies. Construction of saturated genetic maps and expanded mapping populations for higher map resolution are currently in progress (e.g. Neale and Kremer 2011; Pelgas et al. 2011; Table 2).

Application of genomic information to advanced breeding

While the conifer species used in commercial production worldwide have undergone a degree of artificial selection, which has improved productivity and quality of plantations (White et al. 2007), most are regarded as undomesticated and still occur in large, broadly distributed populations representing large pools of untapped genetic potential. These obligate out-crossing species are highly heterozygous and harbor diverse alleles that await discovery through high-throughput surveys. Provided with an understanding of how different combinations of alleles determine economic performance as well as fitness, future tree breeding and conservation programs will likely use cataloged individual genomes for purposes of controlled matings, for preserving diversity, and for selecting optimal combinations of alleles (Nelson and Johnsen 2008; Neale and Kremer 2011) using techniques such as genomic selection (Grattapaglia and Resende 2011). Iwata et al. (2011) have, in fact, used a modeling approach to demonstrate how genomic approaches may be optimized for greater efficiency in conifer tree breeding. The high degree of synteny between conifer species and the increasing availability of high-density genetic maps should bring the power of these genomics-based tree improvement approaches even to species that previously received relatively little attention (Liewlaksaneeyanawin et al. 2009; Chancerel et al. 2011; Jermstad et al. 2011; Pavy et al. 2012).

Improved understanding of conifer biology through genomics

The key to realizing the vision of both enhancing and preserving conifers is a deep understanding of the relationships between specific alleles and phenotypic traits, as well as the influence of environmental conditions on trait expression. Nowadays, knowledge of the function and relationship to phenotype for specific genes and alleles is growing rapidly in other plants and important strides have been made toward understanding these relationships in conifers. The possibilities for linking this information via conifer genomic sequences hold enormous potential for rapidly improving our understanding of fundamental conifer biology and for identifying genes and gene networks critical to the performance of these trees for commercial purposes as well as in the wild (Neale and Ingvarsson 2008; Dean 2011).

Unlocking the conifer genome will have several economical and environmental benefits. Fundamental understanding of the genes that control wood formation, the responses to the attack of invasive pests or to environmental risks, and the capacity of a better and faster growth as well as an efficient carbon sequestration will result in the availability of new tools to breed robust and well-adapted trees that assimilate CO2 from the atmosphere, or produce renewable bio-based products, including bio-fuels, more efficiently in a changing climate. This will accelerate the selection process and the discovery of new ways to capture the ecological and economic value contained in the genetic information of conifers.

These resources will also make it possible to proceed with genome-scale investigations that have so far only resided in our imaginations. For example, is there a biological basis for the rarity of polyploidy conifers? What constitutes the microbiome of a conifer and to what degree does this microbiome influence conifer responses to the environment? Have conifers found a way to prevent viral infections or have we just overlooked conifer viruses as the agents of pathologies of unknown cause? Have the long life spans and generation times of conifers led their genomes to retain genes whose expression is only needed to guard against selective processes that occur on the order of millennia? Could such ‘hidden’ genes allow conifers to adapt more rapidly than expected to challenges such as climate change? In addition to such large-scale organismal questions, it stands to reason that having reference genome sequences for conifers will also make it possible to study a variety of small-scale molecular questions as well. For example, to what extent do uncharacterized types of genotypic variation, including copy number variation, regulatory non-coding RNAs, transposable elements and epigenetic imprinting contribute to the diversity significant for adaptation.

The sequencing of conifer genomes is expected to rapidly fill these key gaps in our knowledge. We will thus be able to more accurately identify the natural stores of genetic diversity in conifers and using this knowledge, devise plans to protect and preserve this key genetic resource. This knowledge could facilitate the identification of genetic mechanisms and their association with traits of interest provided that databases and other information resources are designed with foresight. The transfer of knowledge will be facilitated by adopting standard nomenclature and data structure conventions that already exist in the genomic research communities. Development of these information resources for conifers requires new investment, and will benefit greatly from the wide range of genomic tools and resources for diverse organisms.

Perhaps the most critical investment going forward will be in the human capital associated with conifer genomics. Even with great technological advances, conifer genome projects are large-scale and have long time horizons. To fully realize the goals of this work will require the coordinated efforts of many researchers across national and international boundaries, through continuous assessment of progress and planning of future work.