Keywords

5.1 Plant Material

Wild samples of P. mume were sequenced by the Illumina Genome Analyzer II in Tongmai, Bomi County, Tibet, China, which is the western-end region of the origin of domesticated P. mume (Fig. 5.1).

Fig. 5.1
figure 1

Sample distribution map of GPS for P. mume in Tongmai town, Tibet. Number 4 represents the location of the sample used for P. mume sequencing

The genetic maps that were used to develop the integrated map for anchoring the scaffolds were derived from F1 populations, totalling 260 individuals from the cross between ‘Fenban’ and ‘Kouzi Yudie’ from Qingdao Meiyuan.

5.2 Methods

5.2.1 DNA Preparation and Whole-Genome Shotgun Sequencing

We used a whole-genome shotgun sequencing strategy with the Illumina Genome Analyzer. Total DNA was extracted from fresh young leaves of wild specimens from Tongmai Town, Tibet, China, using the plant genomic DNA extraction kit (TIANGEN, Beijing, China) in accordance with the manufacturer’s instructions. To obtain sufficient DNA to construct and sequence the library, we performed whole-genome amplification. We constructed eight paired-end sequencing libraries with insert sizes of approximately 180 base pairs (bp), 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, 20 kb and 40 kb. In total, we generated 50.4-Gb data of paired ends, ranging from 50 to 100 bp. These data were checked and filtered on the reads that were generated previously (Li, et al. 2010). We filtered the low-quality reads using the following filtering criteria: reads had an ‘N’ over 10% of their length; reads contained more than 40-bp low-quality bases (quality score ≤ 5); reads contained more than 10-bp adapter sequences (allowing ≤ 2-bp mismatches); small insert size paired-end reads that were overlapped (≥10 bp); Read1 and Read2 of paired-end reads were completely identical, which means that these paired-end reads were determined to be artefacts of the PCR experiment. Using stringent criteria, 28.4-Gb high-quality data were used for de novo genome assembly.

5.2.2 K-mer Analysis

We determined the relationship between sequencing depth and the copy number of a certain K-mer (refers to a sequence with K base pairs e.g. 17-mer), and when ignoring the sequence error rate, heterozygosis rate and repeat rate of the genome, the K-mer of the distribution should follow the Poisson theoretical distribution. The size of the genome was estimated using the total length of the sequence reads divided by the sequencing depth; the peak value of the frequency curve represents the overall sequencing depth. We estimated the genome size as (N × (LK + 1) − B)/D = G, where N is the total number of sequence reads, L is the average length of sequence reads and K is the K-mer length, defined as 17 bp. Here, B is the total number of low-frequency (frequency ≤ 1 in this analysis) K-mers, while G is the genome size and D is the overall depth, estimated from the K-mer distribution. It must be pointed out that as the K-mer of the distribution should approximate the Poisson distribution, not all low-frequency K-mers will be errors. This might lead to an underestimate of the genome size, especially at low sequencing depths.

5.2.3 Genome Assembly

We performed a whole-genome assembly using SOAPdenovo (Version 1.05) (Li et al. 2010) with high-quality reads. Reads were loaded into the computer memory, and de Bruijn graph data structure was used to represent the overlap among the reads. The graph was simplified by removing erroneous connections and solving tiny repeats by read path. On the simplified graph, we broke the connections at repeat boundaries and output the unambiguous sequence fragments as contigs. Before generating scaffolds, in the ‘map’ step, SOAPdenovo realigned all usable reads to the contig sequences and obtained aligned paired ends (PEs). Subsequently, the software calculated the number of shared PE relationships between each pair of contigs, weighed the rate of consistent and conflicting PEs and constructed the scaffolds step by step, from short-insert to long-insert PEs. To close the gaps inside the constructed scaffolds, which were composed primarily of repeats that were masked before scaffold construction, we used the PE information to retrieve the read pairs that had one end mapped onto the unique contig and the other located in the gap region and performed local assembly using GapCloser (Version 1.12) with these collected reads.

5.2.4 Estimation of Heterozygosity Rate

The heterozygosity rate was calculated by calling the heterozygous SNPs. All high-quality reads were mapped onto the genome assembly using the software SOAP2 (http://soap.genomics.org.cn/soapaligner.html) with a cut-off of less than five mismatches. Subsequently, the alignment results were analysed for SNP mining using the SOAPsnp (http://soap.genomics.org.cn/soapsnp.html). The sites that met the following criteria were searched and termed criterion-effective sites: (a) quality score of consensus genotype in the SNP mining result is greater than 20; (b) counts of all the mapped best and second best base are supported by at least four unique reads; (c) sequencing depth is more than 10X; (d) SNPs are at least 5 bp away from each other, with an additional requirement to the criterion-effective sites that the number of reads-supported best base is smaller than four times the number of reads-supported second best base (reads-supported best base/reads-supported second best base < 4) were identified as heterozygosis sites. Finally, the heterozygosity rate was estimated as the number of heterozygosis sites divided by the number of criterion-effective sites.

5.2.5 Whole-Genome Mapping

High-quality DNA (high molecular weight, >200 kb), specific for whole-genome mapping, was prepared from fresh P. mume leaves. Whole-genome shotgun single-molecule restriction maps were generated with the automated Argus system (OpGen Inc., Maryland, USA). First, DNA molecules were deposited onto silane-derivatised glass surfaces in MapCards (OpGen Inc., MD, USA) and digested with Nhe I or BamH I for 20 min. Then, the DNA was stained with JOJO fluorescent dye (Invitrogen, CA, USA) and imaged with the Argus system. Overall, 243,174 single-molecule restriction maps (SMRMs) (>250 kb) with an average size of 344 kb were generated. Total size was approximately 83.6 Gb.

When using whole-genome mapping data to extend scaffolds, the original scaffold sequences were first digested in silico to generate corresponding restriction maps for each scaffold. Then, the in silico restriction maps were used as seeds to identify single molecules by length-based alignment using the Genome-BuilderTM software package (http://www.opgen.com/products-and-services/softwares/genome-builder). These single-molecule maps were assembled together to extend the scaffolds with consensus restriction maps. Meanwhile, the scaffolds with low coverage regions at both ends were trimmed, and only the high-quality extensions remained.

To extend sufficient scaffolds, this alignment-assembly process was iterated four times. All pairwise alignments were considered initial candidates for scaffold connection. The alignments with the highest scores remained when conflicts occurred. Super-scaffolds were constructed using the adjacent overlapping relationship between scaffolds; simultaneously, the orientation between each pair of scaffolds could be determined. The details on scaffold alignment and orientation can be obtained from the manufacturer of OpGen.

5.2.6 Identification of RAD Markers

The genetic maps that were used to develop the integrated map for anchoring the scaffolds were derived from F1 populations, totalling 260 individuals from the cross between ‘Fenban’ and ‘Kouzi Yudie’ from Qingdao Meiyuan. Young leaves of these P. mume seedlings and their parents were collected for DNA extraction. Genomic DNA was isolated from the leaves using the Plant Genomics DNA kit (TIANGEN, Beijing, China) according to the manufacturer’s recommendations.

The RAD protocols were the same as in Chutimanitsakun et al. (2011), except we used EcoR I (recognition site: 5′G^AATTC3′). All 24 F1 plants were pooled into one sequencing library with nucleotide multiplex identifiers (4, 6 and 8 bp). Approximately, 830 Mb of 50-bp reads (3.1 Mb of reads data for each progeny on average) were generated on the NGS platform HiSeq 2000. The SNP calling process was performed using the SOAP2 + SOAPsnp pipeline.

5.2.7 Genetic Map Construction and Scaffold Anchoring

A total of 260 F1 seedlings of the cross between ‘Fenban’ and ‘Kouzi Yudie’ were used to construct the linkage map. Linkage analysis was performed using JoinMap version 3.0 (Van Ooijen and Voorrips 2001). The RAD-based SNP markers were first tested against the expected segregation ratio. Two heterozygous SNP alleles between two parents were expected to segregate at a 1:2:1 ratio. One heterozygous and one homozygous SNP allele between two parents were expected to segregate at a 1:1 ratio. Distorted markers (p < 0.01) were filtered to construct a genetic map by the chi-square test. Subsequently, reads that contained SNP markers were aligned to the scaffolds. Only unique aligned SNPs with a cut-off of 87.5% identity remained per Blat (coverage ≥ 0.90) (Kent 2002). An LOD score of 12.0 was initially set as the linkage threshold for linkage group identification. Eight linkage groups that had the same number of P. mume chromosomes were formed at an LOD threshold of 12.0. All SNP markers were used to construct the P. mume consensus map with the CP population model in JoinMap, version 3.0.

To reduce the complex of scaffolds that were anchored to hundreds of SNP markers, a tag SNP was selected from each scaffold with multiple SNPs. We calculated the recombination fractions between all pairs of SNPs in a scaffold and chose the SNP that had the minimum recombination fraction in the sum. Tag SNPs were used to identify the order of scaffolds. Subsequently, two marginal SNPs were used to orient the scaffolds. Scaffolds with only one SNP marker could be anchored but not oriented due to a lack of markers. One hundred scaffolds (18% of 567 scaffolds) were labelled ‘uncertain orientation’.

5.2.8 Identification of Repetitive Elements

There are two main types of repeats in the genome, tandem repeats and interspersed repeats. We used Tandem Repeats Finder (Version 4.04) (Benson 1999) and Repbase (composed of numerous transposable elements, Versions 15.01) to identify interspersed repeats in the P. mume genome. We identified transposable elements in the genome at the DNA and protein levels. For the former, RepeatMasker (Version 3.2.7) was applied using a custom library (a combination of Repbase, a de novo transposable element library of the P. mume genome). For the latter, RepeatProteinMask, an updated tool in the RepeatMasker package, was used to conduct RM-BlastX searches against the transposable element protein database (Jurka et al. 2005). Identified repeats were classified into various categories.

5.2.9 Gene Prediction

To predict genes, four approaches were used: de novo prediction, the homology-based method, the EST-based method and transcript-to-genome sequences. For de novo prediction, Augustus (Stanke et al. 2006), GENSCAN (Salamov and Solovyev 2000) and GlimmerHMM (Majoros et al. 2004) were used with parameters trained on Arabidopsis thaliana. For the homology search, we mapped the protein sequences of four sequenced plants (Cucumis sativus, Carica papaya, Fragaria vesca and A. thaliana) onto the P. mume genome using TBLASTN, with an E-value cut-off of 1e−5; homologous genomic sequences were aligned against matching proteins using GeneWise (Birney et al. 2004) for accurate spliced alignments.

In the EST-based prediction, 4699 ESTs of P. mume were aligned against the P. mume genome using BLAT (identity ≥ 0.95, coverage ≥ 0.90) to generate spliced alignments. The de novo set (28,610–36,095), four homology-based results (24,277–29,586) and the EST-based gene set (2001) were combined by GLEAN (Elsik et al. 2007) to integrate a consensus gene set. Short genes (CDS length < 150 bp) and low-quality genes (gaps in more than 10% of the coding region) were filtered. To finalise the gene set, we aligned RNA-Seq data from buds, fruits, leaves, roots and stems to the genome using TopHat (Version 1.2.0, implemented with bowtie1 Version 0.12.5) (Trapnell et al. 2009), and the alignments were used as input for Cufflinks (Trapnell et al. 2010) (Version 0.93) with default parameters. Open reading frames (ORFs) of those transcripts were predicted using structure parameters trained on perfect genes from homology-based predictions. In the end, based on their coordinates on the genome sequences, we manually combined the GLEAN gene set and ORFs of transcripts to form the final gene set, which contained 31,390 genes.

5.2.10 RNA-seq Data Generation

Using TRIzol (Invitrogen), RNA was purified from five fresh tissues (bud, fruit, leaf, root and stem). The RNA sequencing libraries were constructed using the mRNA-Seq Prep kit (Illumina, San Diego, USA). Briefly, first-strand cDNA synthesis was performed with oligo-T primer and Superscript II reverse transcriptase (Invitrogen). The second strand was synthesised with Escherichia coli DNA Pol I (Invitrogen). Double-stranded cDNA was purified with a Qiaquick PCR purification kit (Qiagen) and sheared with a nebuliser (Invitrogen) to 100–500-bp fragments. After end repair and addition of a 3′-dA overhang, the cDNA was ligated to Illumina PE adapter oligo mix (Illumina) and size selected to 200 ± 20-bp fragments by gel purification. After 15 cycles of PCR amplification, the 200-bp paired-end libraries were sequenced using the paired-end sequencing module (90 bp at each end) of the Illumina HiSeq 2000 platform.

5.2.11 Gene Annotation

Genes were aligned to the Swiss-Prot (release 2011.6) and TrEMBL (release 2011.6) databases using BLASTP (1e−5) to determine the best match of the alignments. InterProScan (Version 4.5) motifs and domains of the genes were identified against protein databases of Pfam (release 24.0), PRINTS (release 40.0), PROSITE (release 20.52), ProDom (release 2006.1) and SMART (release 6.0). Gene ontology IDs for each gene were obtained by the corresponding InterPro entry. The genes were aligned against KEGG proteins (release 58), and the matches were used to establish the KEGG pathway.

5.2.12 Identification of Noncoding RNA Genes

The tRNA genes were predicted by tRNAscan-SE (Version 1.23) (Lowe and Eddy 1997). For rRNA identification, the rRNA template sequences (e.g. A. thaliana and rice) were aligned against the P. mume genome using BLASTN to identify possible rRNAs. Other noncoding RNAs, including miRNA and snRNA, were identified using INFERNAL (Version 0.81) by searching against the Rfam database (Release 9.1).

5.2.13 Comparative Genome Analysis

Paralogous and orthologous genes were identified by BLASTP search (E-value cut-off 1e−5). After removing self-matches, syntenic blocks (≥5 genes per block) were identified based on MCscan (Tang et al. 2008). The aligned results were used to generate dot plots; for self-aligned results, each block represents the paralogous region that arose from genome duplication, and for inter-aligned results, each block represents the orthologous region that was derived from a common ancestor. We calculated 4DTv (the number of transversions at fourfold degenerate sites) for each gene pair in the block and drew the distribution of 4DTv values to estimate the speciation between species or WGD events.

5.2.14 Identification of Duplicate and Syntenic Regions

Three new parameters were used to identify paralogous and orthologous relationships between P. mume, Malus × domestica, F. vesca and Vitis vinifera by BLASTN. Paralogous gene pairs that were identified during duplication analysis in P. mume and M. × domestica, respectively, and orthologous gene pairs that were identified by colinearity analysis between P. mume and M. × domestica, F. vesca and M. × domestica, P. mume and V. vinifera were validated by CloseUp (Hampson et al. 2004) analysis. Based on the syntenic and duplication relationships, Rosaceae ancestral chromosomes were reconstructed, and the paleo-history was analysed.

5.2.15 Identification of CBF and BEAT Genes

The CBF genes of P. mume were identified with A. thaliana CBF genes using BLASTP (E-value < 1e−10, identity > 30% and coverage > 70%). The BEAT genes of P. mume were identified with BEAT genes (Gene Bank ID: AF043464) using BLASTP (E-value < 1e−10, identity > 30% and coverage > 70%).

5.3 Results

5.3.1 Sequencing and Assembly

To construct the reference genome of P. mume, we initially sequenced two domesticated P. mume samples using the Illumina Genome Analyzer II. Both samples had a high heterozygous rate, as estimated using K-mer statistics (Fig. 5.2a, b), and the de novo assembly results by current NGS algorithms did not meet the reference quality standards. We then considered wild samples due to their lack of artificial grafting and asexual reproduction. We collected wild samples of P. mume in Tongmai, Bomi County, Tibet, China, which is the western-end region of the origin area of domesticated P. mume (Xing et al. 2009). In a previous study (Chen 1995; Xing et al. 2009), we determined that the origin area of P. mume was confined to an area of approximately 0.7 km2, and that the primary distribution area was less than 0.3 km2, below an elevation of 2230 m. Samples from this region were highly homozygous due to generations of self-fertilisation in a hermetically sealed geographic environment. We chose one such wild sample for sequencing, and its low heterozygosity was confirmed by sequencing and K-mer statistics (Fig. 5.2c).

Fig. 5.2
figure 2

K-mer analysis. a, b Estimating the domesticated samples, c Estimating the wild P. mume sample used for genome assembly. The x-axis represents depth (X); the y-axis is the proportion that represents the percentage at that depth. (Without consideration of the sequence error rate, heterozygosis rate and repeat rate of the genome, the 17-mer distribution should obey the Poisson theoretical distribution. In the actual data, due to the sequence error, the low depth of 17-mer will take up a large proportion.)

We generated 50.4 Gb of sequencing data for this wild sample, using the Illumina Genome Analyzer II, including three short-insert (180–800 bp) pair-end libraries and five large-insert (2–40 Kb) mate-pair libraries (Table 5.1). The SOAPdenovo (Li et al. 2010), a genome assembler algorithm that is based on the de Bruijn graph, was used to assemble the P. mume genome. Based on K-mer statistics, the P. mume genome was estimated to be 280 Mb (Fig. 5.2c). Approximately, 84.6% (237 Mb) of its genome were assembled. The contig N50 of the assembled sequence was 31.8 Kb (longest, 201.1 Kb), and scaffold N50 was 577.8 Kb (longest, 2.87 Mb) (Table 5.2). By mapping raw reads back to the draft genome, we observed a heterozygosity rate of 0.03% in this wild sample, supporting our assumption of low heterozygosity in the wild sample. We identified 125,383-bp and 19,897-bp sequences in assembly similar to chloroplast and mitochondrial sequences, respectively. Although most of the regions were shorter than 1000 bp, we found a ~25,000-kb chloroplast-similar region which might be the plasmid sequence or a plastid-transferred nuclear fragment.

Table 5.1 Construction of libraries, generation and filtering of sequencing data for genome assembly used
Table 5.2 Statistics of P. mume genome assembly

To improve the assembly of the P. mume genome, we performed whole-genome mapping (WGM), an automated high-throughput optical mapping method (Zhou et al. 2004), to generate an entire genomic map (Fig. 5.3a). We constructed the WGM map with a 300-fold whole-genome depth using BamH I and Nhe I independently. Through an iterative assembly strategy, combined with WGM and sequence data, 170 scaffolds, that were assembled by NGS, were grouped into 49 large scaffolds; thus, the scaffold N50 improved significantly from 578 Kb to 1.09 Mb (Table 5.2).

Fig. 5.3
figure 3

Whole-genome mapping assembly and anchoring of the P. mume genome. a Assembly of P. mume genome by whole-genome mapping. b Anchoring of the P. mume genome into eight linkage groups using 779 high-quality SNP markers. Yellow scaffolds were anchored by whole-genome mapping and SNP markers, whereas the blue scaffolds were anchored by SNP markers

Subsequently, we constructed a high-density genetic map by applying restriction site-associated DNA (RAD) marker strategy (Chutimanitsakun et al. 2011; Baxter et al. 2011; Miller et al. 2007) in a segregating F1 population. The consensus genetic map contained eight linkage groups, consisting of a set of 1484 high-quality SNP markers (co-dominate markers), 779 of which were used in anchoring and orienting scaffolds (Fig. 5.3b). The genetic map improves the quality of the reference and would be useful in map-based cloning and further marker-assisted molecular breeding.

5.3.2 Genome Annotation

We annotated 106.8 Mb (45.0% of the assembled genome) of repetitive sequences (Table 5.3) in the P. mume genome by integrating the de novo and homology-based approaches. Transposable elements (TEs) were the predominant components, constituting 97.9% of all repetitive sequences. The long terminal repeat (LTR) Copia family and the Gypsy family are the most abundant TEs in the P. mume genome. The TE divergence rate suggested a lack of recent amplification (Fig. 5.4); these were conserved in M. × domestica and F. vesca, while M. × domestica had a much higher proportion in the Gypsy family than P. mume and F. vesca (Table 5.4).

Table 5.3 Statistics of repeats in the P. mume genome
Fig. 5.4
figure 4

Divergence rates of the transposable elements in the assembled scaffolds. The divergence rate was calculated based on the alignment between the RepeatMasker-annotated repeat copies and the consensus sequence in the repeat library

Table 5.4 Occurrence of transposable elements in sequenced Rosaceae genomes

To improve gene annotation, we generated 11.3 Gb of RNA-Seq data from five major tissues: bud, fruit, leaf, root and stem (Table 5.5). Using these data, integrated with ab initio homology prediction methods, we annotated 31,390 protein-coding genes, comparable with the value in F. vesca (34,809) (Shulaev et al. 2011) and less than that found for M. × domestica (57,386) (Velasco et al. 2010), as expected. Average transcript length in P. mume was 2514 bp, with 4.6 exons per gene (Table 5.6). Approximately 82.5% of all genes could be functionally annotated following a consensus method of either known homologous or predictive sequence signatures using Swiss-Prot, GO, TrEMBL (Bairoch and Apweiler 2000), InterPro (Zdobnov and Apweiler 2001) or KEGG (Kanehisa and Goto 2000), and we considered 98.3% of these annotations were high confidence (Table 5.7). We noted that 25,854 (82.6%) gene models were expressed, of which 768, 308, 240, 762 and 179 genes were expressed solely in buds, fruits, leaves, roots and stems, respectively. We further integrated 82,832 peach EST sequences; together with RNA-seq data, 85.1% of gene models represented transcripts. We also annotated the noncoding RNA genes in the current assembly, including 508 tRNA, 209 miRNA, 125 rRNA and 287 snRNA (Table 5.8).

Table 5.5 List of tissues and reads for whole-transcriptome sequencing mapped onto the P. mume genome
Table 5.6 General statistics of predicted protein-coding genes
Table 5.7 Functional annotation of predicted genes with homology or functional classification by each method
Table 5.8 Noncoding RNA gene fragments in the current P. mume assembly

5.3.3 Genome Evolution

By genome self-alignment via MCscan (Tang et al. 2008) (Fig. 5.5) and 4DTv (the number of transversions at fourfold degenerate sites) distribution of duplicated pairs, there was no recent whole-genome duplication after P. mume species differentiation from M. × domestica (Fig. 5.6a).

Fig. 5.5
figure 5

Whole-genome duplication in the P. mume genome mapped using gene collinear order information. Syntenic blocks are formed by red or blue dots, representing best hits across any two chromosomes in the same or opposite direction, respectively

Fig. 5.6
figure 6

Evolution of P. mume. a 4DTv distribution of duplicate gene pairs in P. mume and M. × domestica, calculated based on alignment of codons with the HKY substitution model. b The duplication of P. mume by paralogous pairs in the P. mume genome (chromosomes Chr1 to Chr8). Each line represents a duplicated gene. The seven colours reflect the seven ancestral eudicot linkage groups (A1, A4, A7, A10, A13, A16 and A19)

We examined the paleo-history in Prunus and found that there was a triplicated arrangement (ancestral γ event). We aligned 27,819 gene models to the seven paleo-hexaploid ancestor chromosome groups in grape (Jaillon et al. 2007) and identified that the colinearity blocks contained 2772 orthologs; the extent of these blocks covered 78.1% of the P. mume genome.

Further, we aligned 27,819 P. mume gene models to themselves and identified seven major blocks of duplication, which corresponded to 194 gene pairs, covering 38.5% of the anchored genome. The chromosome-to-chromosome relationships P5-P7, P2-P4-P8, P1-P2-P4-P6, P1-P5, P2-P5-P8, P2-P4-P7 and P3-P4 (Fig. 5.6b) suggested that triplicated arrangement (γ event) marks remained in the P. mume genome. Based on the evidence of paleo-hexaploidisation (γ event) and lineage-specific duplications in eudicots, it was possible to examine chromosomal changes during the evolution of P. mume and other Rosaceae species.

5.3.4 Reconstruction of Ancestral Chromosomes of Rosaceae

We reconstructed nine ancestral chromosomes of the Rosaceae family and determined the history of chromosome fusion, fission and duplication in the three major Rosaceae subfamilies. Previous studies have reported the eudicots ancestor with seven proto-chromosomes (Jaillon et al. 2007) and the grass ancestor with five proto-chromosomes (Salse et al. 2009). Using M. × domestica as a reference, we analysed the syntenic relationships between the sequenced Rosaceae genomes of P. mume (P), F. vesca (F) and M. × domestica (M).

We identified 151 blocks that contained 4546 orthologous genes, covering 96.9% of the anchor P. mume genome, between P. mume and M. × domestica versus 132 blocks with 2031 orthologous genes, covering 88.8% of the F. vesca genome, between F. vesca and M. × domestica. The chromosome-to-chromosome orthologous relationships are shown in Fig. 5.7.

Fig. 5.7
figure 7

Synteny between P. mume, F. vesca and M. × domestica. Schematic representation of the orthologs identified between P. mume (P1–P8), F. vesca (F1–F7) and M. × domestica (M1–M17). Each line represents an orthologous gene. The nine different colours represent the blocks reflecting the origin from the nine ancestral Rosaceae linkage groups

Combining intergenomic and intragenomic analysis of the Rosaceae genomes, we noted the following primary chromosome pair combinations: M5-M10/P3-P6/F2-F3, M3-M11/P1-P3-P6/F2-F3-F5, M9-M17/P2-P4/F1-F6, M13-M16/P2/F4, M4-M12-M14/P1-P2-P5-P8/F5-F6, M5-M14/P7/F5, M1-M2-M7/P5/F7, M8-M15/P2/F2-F5 and M1-M2-M15/P1-P2-P3-P8/F1 (Fig. 5.7). We reconstructed a putative ancestral genome of Rosaceae and proposed an evolutionary scenario of P. mume, M. × domestica and F. vesca from the putative nine-chromosome ancestor (Fig. 5.8).

Fig. 5.8
figure 8

Evolutionary model of the Rosaceae genome. The Rosaceae ancestor chromosomes are represented by nine colours. The various evolutionary processes from the common ancestor are indicated as R (whole-genome duplication (WGD)) and F (for fusions of chromosomes). In the second layer, different colours in each chromosome represent the origin of the common ancestral chromosomes. The current structure of the Rosaceae genome is shown at the bottom of the figure. In some regions, we were not able to determine which ancestral chromosome they came from, and those regions therefore represented as white spaces

In P. mume, chromosomes 4, 5 and 7 did not undergo rearrangement, coming directly from ancient chromosomes III, VII and VI, respectively. Chromosome 1 came from ancestral chromosomes II, V and IX. Chromosome 2 originated from ancient chromosomes IV and VIII, into which some of chromosomes III and V were inserted. Chromosome 3 came from chromosomes I, II and IX; chromosome 6 came from I, II, and V, and chromosome 8 originated from ancient chromosomes V and IX. Thus, we hypothesised that at least eleven fissions and eleven fusions occurred in P. mume from the nine common ancestral chromosomes. For M. × domestica, at least one whole-genome duplication (WGD) and five fusions took place to reach the 17-chromosome structure, compared with 15 fusions for F. vesca to affect the 7-chromosome structure.

5.3.5 Early Blooming of P. Mume

P. mume is nearly the first tree that blooms in early spring, blooming even at temperatures below 0 °C. Thus, P. mume has a specific mechanism to acclimate to cold weather and to release itself from dormancy. The (DAM) dormancy-associated MADS-box transcription factors family, which is related to dormancy induction and release (Sasaki et al. 2011), was identified in the P. mume assembly, and all six DAM genes were noted in the arrayed tandem (Fig. 5.9a, Table 5.9). In a previous study, the authors identified six DAM genes, all of which were transcriptional repressors in P. mume (Sasaki et al. 2011). They also found that all DAM genes were repressed during prolonged exposure to cold and maintained at low levels until endodormancy release. The authors of another study observed that expression of DAM3, DAM5 and DAM6 was suppressed by chilly temperatures, bottoming on bud break in Prunus persica (Jiménez et al. 2010). These findings suggest that DAM inhibits dormancy release and that its expression is suppressed during prolonged exposure to cold to allow bud release from dormancy.

Fig. 5.9
figure 9

Six tandemly arrayed DAM genes in P. mume. a Overview of PmDAM in the P. mume genome. Six PmDAM genes are located as tandem repeats. b Structures of PmDAM genes in P. mume. Boxes and lines represent exons and introns, whereas red points represent CBF-binding sites

Table 5.9 DAM gene orthologs of in P. mume

We hypothesised that the DAM genes explained the early dormancy release in P. mume. To this end, we examined the phylogenetic relationships between DAM genes in P. mume, using the PHYML 3.0 software. The molecular evolution models of the six tandem DAM genes in P. mume suggested that they were derived from serial duplication events in the following order: PmDAM1, PmDAM3, PmDAM2, PmDAM5, PmDAM4 and PmDAM6 (Fig. 5.10). This model is consistent with previous studies in the peach genome (Jiménez et al. 2009), and we suggest that these duplication events are unique in the Prunus subfamily—we did not find these tandem DAM genes in M. × domestica or F. vesca. The application of two other programs, MEGA 4 (Tamura et al. 2007) and PAML4 (Yang 2007), obtained similar results. By estimation of pairwise dN and dS rates using MEGA 4, there was significant purifying selection and no significant positive selection of the six sequences. In an analysis of P. mume DAM genes, most sites were highly conserved, with a dN/dS rate ratio near 0 or nearly neutral.

Fig. 5.10
figure 10

Maximum likelihood rooted tree of six P. mume DAM genes; PtMADS27 was used as outgroup

In addition to DAM, the C-repeat binding transcription factor (CBF) mediates the establishment of early dormancy release—overexpression of PpCBF1 in M. × domestica results in a strong sensitivity to short day lengths and induced dormancy at optimal growth temperatures (Wisniewski et al. 2011). Horvath et al. (2010) noted that EeDAM1 was cold stress-responsive and contains putative CBF-binding sites, which are cis-regulating motifs that are targeted by the cold/drought stress CBF regulon in the 2000-bp region upstream of the EeDAM1 translation start codon. This finding suggests that CBF controls the cold-responsive EeDAM1 gene (Horvath et al. 2010). Similar to EeDAM1, conserved CBF sites were identified in the 1000-bp region upstream of the translation start codons in DAM4 to DAM6 in peach and Japanese apricot (Sergio Jiménez et al. 2009; Sasaki et al. 2011). In the P. mume genome, we identified 13 CBF orthologous genes (Tables 5.10) and seven CBF regulons, late embryogenesis-abundant (LEA) proteins/dehydrins (Table 5.11). In the upstream regions of DAM genes in P. mume, we noted more putative CBF-binding sites on DAM4, DAM5 and DAM6 than in peach and found novel sites, one on DAM1 and two on DAM6 (Fig. 5.9b). We suggest that these additional sites render P. mume more sensitive to cold and result in early blooming in spring. In summary, our analyses have increased our understanding of the molecular control of dormancy, flowering regulation and acclimation to cold. The CBF and DAM genes that we identified in P. mume might allow molecular biology facilities to breed fruits and ornamental plants with disparate blooming times.

Table 5.10 CBF orthologs of in P. mume, M. × domestica, F. vesca, P. trichocarpa, V. vinifera, O. sativa and A. thaliana
Table 5.11 Dehydrin orthologs in P. mume, M. × domestica, F. vesca, P. trichocarpa, V. vinifera, O. sativa and A. thaliana

5.3.6 Floral Scent

Floral scent, determined by a complex mixture of low-molecular-weight volatile molecules, has a significant function in the reproductive processes of many plants and enhances the aesthetic properties of ornamental plants (Pichersky and Dudareva 2007). In earlier research, we determined that the dominant compound classes in P. mume flowers were benzenoid/phenylpropanoid from the cinnamic acid pathway, in addition to terpene compounds.

Genes related to the biosynthesis of volatile compounds, such as (BEAT) benzyl alcohol acetyltransferase, which catalyses the synthesis of benzyl acetate (Dudareva et al. 1998; Aranovich et al. 2007), have been identified in the P. mume genome. The BEAT gene family expanded notably in P. mume (34 members) compared with M. × domestics (16), F. vesca (14), V. vinifera (4), Populus trichocarpa (17) and A. thaliana (3) (Table 5.12). Twenty-six of 34 P. mume BEAT genes lay in clusters, the largest of which contained 12 members that were arranged in tandem (Fig. 5.11), suggesting that BEAT genes originated from serial duplication events, in contrast to the other sequenced plants.

Table 5.12 Numbers of orthologous genes found in P. mume (Pm), M. × domestics (Md), P. trichocarpa (Pt), A. thaliana (At), V. vinifera (Vv), F. vesca (Fv) and O. sativa (Os) that synthesise volatile molecules
Fig. 5.11
figure 11

Phylogenetic relationships of BEAT genes in P. mume (Pm), F. vesca (mrna) and A. thaliana (AT)

In summary, the expansion of the BEAT gene family might increase the content of benzyl acetate and be related to the special fragrance of P. mume. Research on the P. mume genome should allow us to breed novel aromatic cultivars and other aromatic plants in the Rosaceae family.

5.4 Discussion

We are eager to establish an appropriate reference genome for Rosid species, which include one-third of all flowering plants (Hummer and Janick, 2009). Currently, draft genome sequences are available for three model Rosaceae species—M. × domestica, F. vesca and P. mume. The domesticated apple (M. × domestica), the main fruit crop of temperate regions throughout the world, is highly heterozygous and has a large genome, which creates technical challenges in assembling its genome, resulting in 1629 metacontigs.

The strawberry (F. vesca) has a much smaller genome of ~240 Mb, allowing functional gene studies within Rosaceae. Although the strawberry is useful for functional genomics research, most related high-value fruit plants in the Rosaceae species—peach (P. persica), pear (Pyrus nivalis) and cherry (Prunus avium)—are woody plants, not herbaceous ones. Unfortunately, the nearest relatives of woody fruit crops usually have a cumbersome polyploid genome.

After a detailed study of its origin and the current distribution in P. mume, we obtained a suitable sample from an isolated group in the origin area of P. mume for genomic sequencing and assembly. We report the genomic sequence of P. mume due to its small genome of ~280 Mb and low heterozygosity. With whole-genome mapping, we increased the scaffold quality to 1.1 Mb in N50 and constructed P. mume pseudochromosomes using 779 SNP markers from eight linkages groups. We conclude that the combination of Illumina GA, whole-genome mapping technologies and the genetic map constructed by RAD can be used to perform de novo sequencing of plant genomes, allowing high-quality, rapid and low-cost sequencing of other plant species with similar conditions.

The sequence of the P. mume genome is a valuable resource for biological research and breeding. Based on the sequences of P. mume, M. × domestica and F. vesca, we reconstructed nine ancestral chromosomes of the Rosaceae family and inferred that they were shaped from an ancestor. Analysis of the P. mume genome and transcriptome can provide insights into the mechanisms of flowering scent, flowering dormancy and disease resistance. The genome also increases our knowledge of the evolution of the Rosaceae family and the function of the plant relative system.