Introduction

Coffee is one of the world’s major crops, and is mainly cultivated in Africa, America and Asia. Smallholders with <5 ha account for about 70 % of world coffee production and more than 80 million people depend on the crop for their income. Coffee belongs to the 4th largest flowering plant family, Rubiaceae, which consists of more than 11,000 species in 660 genera (Robbrecht and Manen 2006). The Rubiaceae family belongs to the euasterid I clade of dicotyledonous plants, to which the Solanaceae family also belongs.

Although the Coffea genus includes more than 124 species (Davis et al. 2011), commercial coffee production relies mainly on two related species: Coffea arabica L. and Coffea canephora Pierre ex A. Froehner, which account for 65 and 35 % of world coffee production, respectively (International Coffee Organization, http://www.ico.org). C. arabica is a recent allotetraploid (2n = 4x = 44) species which derived from a spontaneous hybridization between two closely related diploid species, C. eugenioides S.Moore and C. canephora, whereas C. canephora (2n = 2x = 22) is an allogamous diploid tropical tree consisting of polymorphic populations and of strongly heterozygous individuals. However, producing doubled haploid plants using haploid embryos, which occur spontaneously in association with polyembryony, offers the possibility of developing completely homozygous genotypes from heterozygous parents in a single generation (Couturon and Berthaud 1982; Lashermes et al. 1994). The haploid genome of C. canephora is estimated to be 710 Mb in size (Noirot et al. 2003), over five times larger than that of the model plant Arabidopsis thaliana (125 Mb) and more than 50 % larger than that of grapevine (Vitis vinifera) or poplar (Populus trichocarpa). Both macro-and microsynteny have been reported between coffee trees (Rubiaceae) and tomato (Solanaceae), and between coffee trees and species belonging to the Rosid clade (Cenci et al. 2010, 2012; Guyot et al. 2009, 2012; Lefebvre-Pautigny et al. 2010; Mahé et al. 2007). In particular, a high level of conservation was observed between coffee and grapevine (Vitaceae). Recent phylogenetic analyses identified the Vitaceae family as the earliest diverging lineage of the Rosid clade, meaning this family can be considered as the “sister” group of all other Rosid plant species (Jansen et al. 2006).

The ability to capture and efficiently use the abundant genetic resources in coffee breeding programmes is essential for sustainable coffee production. Until recently, coffee improvement mainly relied on conventional breeding methods. Moreover, coffee is a perennial plant with a time from seed to seed of about 5 years, which makes genetic studies difficult and time consuming. While some genomic information has been generated in the last few years, it is far less than what is available for many other agricultural species. Significant advances in our understanding of the coffee genome and its biology must be achieved in the coming decades to increase coffee quality and yield, and to protect the crop from major losses caused by insect pests, diseases and abiotic stresses related to climatic change.

The availability of a large-insert genomic DNA library is indispensable for physical mapping, map-based gene cloning, and analysis of gene structure and function in most organisms including plants. Large insert genomic libraries constructed with bacterial artificial chromosomes (BACs) are known for their high degree of genomic insert structural stability and easy handling of Escherichia coli host cells. The construction of a high qualtiy BAC library, together with sequencing of BAC-ends (BESs, BAC-end sequences) refer to bidirectional end sequencing of the genomic DNA insert (with the help of universal priming sites in the cloning vector) are also two of the first steps in a whole-genome sequencing project. Although the sequences sampled may be not truly random, due to the need for specific restriction sites for the construction of a BAC library, a BES project can provide significant clues about the composition of the genome and the evolution of a given species.

The use of BESs from a large number of BACs was first proposed as a strategy for identifying overlapping clones during whole genome sequencing since BESs greatly facilitate the assembly of contigs into scaffolds (Goff et al. 2002). In spite of continued improvements in sequencing technologies and the development of whole-genome shotgun (WGS) sequencing approach for plant genomes, the BESs remain extremely helpful during the genome sequence assembly process, in particular for repeat-rich regions or genomes (Feuillet et al. 2011). Furthermore, BESs also represent a sample of the whole genome which can be used to get a first glimpse into the sequence composition of a given genome. BES analyses have been performed in a number of plants in the initial stage of genome characterisation including rice, maize, Korean ginseng, papaya, Brassica rapa, wheat 3B, Musa acuminata, white clover, Brachypodium, potato, tomato, citrus, apple and carrot (Cavagnaro et al. 2009; Cheung and Town 2007; Datema et al. 2008; Febrer et al. 2007; Han and Korban 2008; Hong et al. 2004, 2006; Huo et al. 2008; Lai et al. 2006; Mao et al. 2000; Messing et al. 2004; Paux et al. 2006; Terol et al. 2008). BESs are also a rich sources of genomic simple sequence repeats (SSRs) which act as reliable landmarks across the genome during genetic mapping, as reported in plant genomes such as cotton, soybean and Brassica napus (Cheng et al. 2009; Frelichowski et al. 2006; Shultz et al. 2007).

Here, we report the generation of coffee BESs from two BAC libraries from a homozygous doubled haploid plant. They represent approximately 13 % of the C. canephora nuclear genome, and provide initial insights into the content and composition of the coffee genome. Our analysis focused on microsatellite content, repeat element composition, protein-coding regions and comparative mapping of BES pairs to other sequenced plant genomes. This study provides the first glimpse of the genome of the DH200-94 accession, which is the accession chosen for the whole-genome sequencing of C. canephora.

Materials and methods

BAC libraries

Bacterial artificial chromosome libraries of the C. canephora accession DH200-94 were constructed from high molecular weight DNA isolated from 20 g of young leaf tissue by the Arizona Genome Institute (AGI, USA). The accession DH200-94 is a doubled haploid plant produced from the clone IF200 based on the haploid plants occurring spontaneously in association with polyembryony. Two restriction enzymes, HindIII and BstYI, were used for partial digestion of mega-size DNA, cloned into the pAGIBAC1 vector (Jetty et al. 2006) and transformed in the E. coli strain DH10B (Invitrogen, United States). A total of 36,864 clones were picked for each library. In order to evaluate the average BAC insert size, 768 BAC clones (384 clones from each library) were randomly chosen and the corresponding DNA was extracted, digested with the rare cutter NotI enzyme and analysed by PFGE.

BAC-end sequencing

Bidirectional-end sequencing of the 73,728 BAC clones was performed using ABI dye-terminator chemistry at Genoscope (Evry, France) on ABI3730 sequencers (Applied Biosystems, Foster City, United States). Sequence and quality files from trace files were read by the PHRED software for base calling and quality trimming using a quality score of 20. Cloning vector sequences were removed using an internal procedure based on the Smith-Waterman algorithm. The sequence data were then filtered for sequences contaminated by E. coli and by plant organelle genomes based on matches with the mitochondria genome sequence of A. thaliana (NC001284) and C. arabica chloroplast (NC008535) sequences. BESs shorter than 60 bp were also removed. Sequences were deposited in EMBL-EBI Bank (accession numbers FO535330, FO538768 to FO624989, and FO624992 to FO680656).

Identification of simple sequence repeats (SSR)

A pipeline software previously developed for SSR mining (Poncet et al. 2006) was used for the identification of BESs containing microsatellites. A total of 69,066 ESTs from C. canephora (available at NCBI and downloaded in May 2012) were assembled into 26,483 unigenes (16,870,301 bp) using the TGICL program (Pertea et al. 2003). A similar SSR analysis was performed on these unigenes. The SSR program was also used to design the primer pairs and to check their high specificity. The parameters were set for detection of mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs with a minimum of 15, 9, 6, 5, 4, and 3 repeats, respectively. The following primer design parameters were used: primer length from 18 to 21 bp (optimum 20 bp), PCR product size from 100 to 300 bp, optimum annealing temperature 60 °C.

Analysis of known repetitive sequences

To tentatively identify repeated sequences based on sequence similarities, we performed BLASTn (Altschul et al. 1997) and Censor (Kohany et al. 2006) analyses on the 131,412 BESs from the C. canephora BstYI and HindIII libraries. We first used a BLASTn with E-value significance thresholds of 1e−10 and a minimum of 50 aligned base pairs against the Plant Repeats database (http://plantrepeats.plantbiology.msu.edu/) for similarity searches (Jan. 2011). This reference database contains 6,715 sequences from Brassicaceae, Fabaceae, Gramineae and Solanaceae families. According to the results obtained and based on the lack of detection sensitivity, we used Censor and the transposable element (TE) protein database containing 10,307 reference amino acids downloaded (http://www.girinst.org/) from Repbase (Version 17.01, Jun. 2012; 8,416 and 1,891 sequences from Class I and Class II TEs respectively) (Jurka et al. 2005) to detect similarities. We applied a minimum detected fragment length of 100 residues to be reported by Censor.

Bacterial artificial chromosome-end sequences from seven different plant species were retrieved from GenBank (http://www.ncbi.nlm.nih.gov/) or from the SOL genomic web site (http://solgenomics.net/): A. thaliana (46,193); Oryza sativa (124,625); V. vinifera (229,315), Solanum tuberosum (140,540), Solanum lycopersicum (76,975), Mimulus guttatus (BstYI library: 4,721; HindIII library: 3,398) and used for a similar analysis. Finally, we constructed our C. canephora TEs database by extracting identified sequences and by assembling them into contigs using a sequence assembly program (CAP3, Huang and Madan 1999) with default parameters.

Analysis of coding regions

For these analyses, we used both assembled and unassembled TE masked BESs, named BESnr and BESr, respectively. These were assembled using TGICL (Pertea et al. 2003) with default parameters. Similarity searches were performed using BLAST against Coffea ESTs and protein databases.

For functional analysis, BESnr were compared with the UniProtKB (SwissProt + TrEMBL) protein database with an E-value cutoff of 1e−20. Annotations from the two databases were merged into a single annotation file as follows: for each BES, priority was given to the SwissProt annotation, which is more reliable. When no match was found in the SwissProt database, the TrEMBL annotation was then searched. If no match was found in either database, the BES was discarded. Gene ontology (GO) terms were assigned to the BESs present in this annotation file. Categories for the annotations were determined and visualized using BLAST2GO (Conesa and Götz 2008).

To determine the coding region, we used the BESr dataset in order to avoid biases and artifacts in estimating the coding fraction due to underestimation of the number of multigenic families. We performed a BLASTn with an E-value cutoff of 1e−6 using a Coffea spp. assembled unigene database, which consists of 70 % C. arabica ESTs and 30 % C. canephora ESTs (Vidal et al. 2010). The total length of matching sequences was calculated by summing the length of each match. When several HSPs (high segment pairs) were identified for a single query, thus revealing the presence of introns, the total length (HSPs + introns) was identified as the transcribed coding region.

Comparative genome mapping

Unassembled high-quality sequences not containing known repetitive sequences were used to detect potential regions of microsynteny between the coffee genome and the available complete genome sequences of A. thaliana (Arabidopsis Genome Initiative, 2000), S. lycopersicum (Tomato Genome Consortium, 2012) and V. vinifera (Jaillon et al. 2007). The analysis was limited to BES pairs for which both forward and reverse sequences showed similarities with coding regions. These paired BES were aligned against the genomes using tBLASTx, with 1e−6 E-value as cut off. For each sequence, only the highest scores were retained for subsequent analysis. BAC clones were used for microsynteny analysis if both ends had a high score in the target model genome. All BES pairs which produced significant hits with V. vinifera and S. lycopersicum, co-localized (i.e. within a 10–300 kb interval) or not, were finally mapped on the chromosomes of these species using a dedicated in-house Perl script. Ratios of the number of co-localized BES pairs to the number of double hits were calculated on a sliding window and mapped along the V. vinifera and S. lycopersicum chromosomes with a window length of 1 and 2 Mb, and a step size of 250 and 500 kb, respectively.

Results and discussion

Characterisation of the BAC library

With the aim of developing essential genomic resources for the genus Coffea, we constructed two high quality genomic BAC libraries with high molecular weight DNA from C. canephora. We chose a double haploid tree (acc. DH200-94) to reduce the complexity of a highly heterozygous plant and consequently increase the effective coverage of the genome. To limit cloning biases associated with a single restriction enzyme-based BAC library, two different enzymes, HindIII and BstYI, were used. Each of the two libraries contained 36,864 BAC clones that were arrayed in 96 384-well microtiter plates. The HindIII library, named CC__Ba, and the BstYI library, named CC__Bb, contained estimated average insert sizes of 166 and 121 kb, respectively (Fig. S1, supplemental data). Since the haploid genome size of C. canephora is in the order of 710 Mb (Noirot et al. 2003), genome coverages of ~8.6X and ~6.3X were estimated for the HindIII and BstYI libraries, respectively. The characteristics of the two high quality BAC libraries are summarised in Table S1 (supplemental data), and both resources are publicly available from the Arizona Genomics Institute Resource Center (http://www.genome.arizona.edu/orders/).

BAC-end sequencing

A total of 134,827 high quality BESs were generated from 73,728 BAC clones of the two HindIII and BstYI libraries, with a success rate of 91.4 %. All generated sequences exceeded 60 bases in length with a PHRED-quality score ≥20. Sequences were compared to C. arabica chloroplast genome and A. thaliana mitochondrial genome sequences. Results indicated that a total of 1,389 (1.03 %) and 28 (0.02 %) reads showed high similarity with chloroplast or mitochondrial genomes, respectively. In addition, 1,998 BESs (1.48 %) appeared to be contaminated by cloning vector sequences. The two libraries contributed a similar number of reads and genomic raw sequences. Finally, 131,412 BESs with an average length of 682 bp were retained for further analysis. This represents ~92 Mb of genomic sequence data, corresponding to almost 13 % of the estimated size of the C. canephora genome. These data are an excellent resource to provide a first view of the composition, structure and evolution of the coffee genome.

Analysis of simple sequence repeats (SSR)

Simple sequence repeats (or microsatellites) are a class of molecular markers which are often polymorphic and are widely used to produce genetic maps. In addition to ESTs, BESs have also proved to be very useful to design SSR markers in plants (Cheng et al. 2009; Frelichowski et al. 2006; Shultz et al. 2007). Depending on the search parameters used (see “Materials and methods” section), 9,875 SSRs were detected in 6.7 % of the obtained BESs. Mononucleotide motifs were the most abundant (47.8 %), followed by di- (22.9 %), and hexa-nucleotide repeats (13.5 %) (Table 1). Among the dinucleotide motifs, AT/TA was the most abundant (57.5 %) and no GC/CG motif was found. The most abundant trinucleotide motif was AAT/TTA (43.4 %), followed by AAG/TTC (24.2 %) and GGC/CCG was the least abundant motif (0.6 %) (Table 2).

Table 1 Distribution of simple sequence repeats in C. canephora BESs as a function of the length of the repeat motif
Table 2 Distribution of simple sequence repeats in C. canephora BESs as a function of the motif of di- and tri-nucleotide repeats

The distribution of microsatellite motif frequencies differed between ESTs and BESs whereas their frequency was identical in BESs and ESTs (0.1 SSR per kb). All the EST sequences available at NCBI for C. canephora were downloaded and screened for SSRs using the same program and the same parameters. A total of 1,865 SSRs were identified. Di-, tri- and mono-nucleotide motives were the most abundant (respectively 25, 24.7 and 24.5 %) (Fig. 1). The motif frequency differed with the type of sequence (Fig. 2). Among the dinucleotide motifs, the most common repeat from the ESTs was the GA/CT motif (68.4 %) while for the trinucleotide motives, the AAG/TTC motif was the most abundant (25 %).

Fig. 1
figure 1

Distribution of simple sequence repeats in C. canephora BESs and ESTs as a function of the length of the repeat motif

Fig. 2
figure 2

Distribution of simple sequence repeats in C. canephora BESs and ESTs as a function of the di and trinucleotide motifs

Our results indicated that potential SSRs are more numerous in BESs (which group non-coding and coding regions) than coding regions only as detected in ESTs. All the novel SSRs detected in this analysis are currently being used to develop and refine a dense genetic map of C. canephora. Because it has been proven that many microsatellite markers defined in C. canephora can be used in other coffee species (Poncet et al. 2007), the markers reported here will help advance comparative genomic studies in the Coffea genus.

Identification and characterisation of repetitive DNA elements

The 131,412 BESs were first screened for repetitive DNA sequences by BLASTn searches of the Plant Repetitive database in a way similar to the method published by Hsu et al. (2011). A relatively low number of repeated sequences were detected using this method with the exception of rRNA, which represented the majority of the repeats found (respectively 82 and 53 % of all repeated sequences in the BstYI and HindIII genomic libraries) (Fig. S2, supplemental data). Including the detection of rRNA, only 5.6 and 2 % of BESs in the BstYI and HindIII libraries were found to contain repeated sequences. This result indicates that the method used lacks sensitivity and suggests that most of TEs in C. canephora may diverge significantly at the nucleotide level from data in the reference database.

To overcome the observed lack of sensitivity, we used the RepBase database for TEs, which contains 10,307 reference sequences, and Censor (Kohany et al. 2006).

Like in other plant genomes, we found that the C. canephora genome contains a significant proportion of TEs (Table 3). In total, 17,834 fragments representing 11.91 % of the whole length of the BESs contained similarities with known TE peptides (Table 3). Interestingly, we observed a bias depending on whether the BstYI or HindIII library was used for the analysis (14.96 vs. 8.81 %). Long terminal repeat (LTR)-retrotransposons (both Ty1-Copia and Ty3-Gypsy elements) were over-represented in the BstYI library compared to the HindIII library (Table 3). Equal amounts of the other type of TEs (LINE and Class II DNA transposons) were identified in the two libraries. We compared the results with results of a similar analysis conducted on publicly available BESs from M. guttatus BstYI and HindIII libraries (respectively 4,721 and 3,398 sequences). A bias in the representation of LTR-retrotransposons was also observed between the BstYI and HindIII libraries (Fig. S3 and S4, supplemental data). We concluded that the bias we observed between BstYI and HindIII libraries could be due to the restriction enzyme sites used to construct the two libraries. These results underlined the need to construct BAC libraries based on different restriction sites to overcome cloning artifacts due to the use of a single enzyme.

Table 3 Number of TE sequences, length (bp) of TE sequences and number of BESs containing repeated DNA using Censor and the Repbase transposable element protein database (minimum length of detected fragment: 100 residues)

We also compared our results with a selection of other plant species. BESs from seven different reference plant genomes were retrieved and analysed in the same way as for C. canephora BESs (Fig. 3).

Fig. 3
figure 3

Comparison between the ratios of sequence lengths similar to known TEs found in C. canephora BES sequences and a selection of BESs from model plant species (i.e. % total sequence lengths similar to TE/whole length of BESs). TEs were detected using Censor and proteins from REPBASE as the reference library

Among the 17,834 fragments containing potential TEs, a large majority (92.2 %) showed homology with Class I retrotransposons (Table 3). This class is subdivided into Ty1-Copia and Ty3-Gypsy LTR retrotransposon superfamilies. In C. canephora Ty3-Gypsy clearly outnumbered Ty1-Copia LTR retrotransposons with a Ty3-Gypsy:Ty1-Copia ratio of 3:1.

This ratio is similar to those of O. sativa (2.95:1), Arabidopsis (2.94:1), S. tuberosum (2.48:1) and S. lycopersicum (2.45:1) but higher than or equal to those of Carica papaya (1.61:1) and M. guttatus (1:1); and Prunus persica (1:1.18) Musa acuminata (1:2.47) and V. vinifera (1:3) where Ty1-Copia outnumbered Ty3-Gypsy TEs (Fig. 4). The second most abundant TE type found in C. canephora was the non-LTR Retrotransposons LINE element (L1 family) (Table 3).

Fig. 4
figure 4

Frequency of TE classes (Class I retrotransposons, Ty1-Copia, Ty3-Gypsy and Class II DNA transposons) in the BESs of C. canephora BstYI and HindIII libraries and a selection of reference plant species

Taken together, our analyses of BESs against known coding sequences of TEs indicate that 11.9 % of the genome corresponds to known repeat sequences. Thus the percentage of TE in C. canephora appears to be similar to the percentage found for M. guttatus (11.6 %) and P. persica (10.9 %).

However, considering that in plant genomes many elements are non-coding elements (with a lack of coding sequences like in MITEs), that a large proportion of TE domains are non-coding (like the LTR domains of retrotransposons) and that the coding sequences of elements inserted a long time ago are highly altered, the proportion of TE may be significantly higher than our current estimation.

We conclude that the C. canephora genome present a significant proportion of TE similarly to M. guttatus and P. persica genomes. The most frequent elements were class I retrotransposons Ty3-Gypsy group. Considering the relative importance of LTR-Retrotransposons, the transcriptional and transpositional activities of these elements should be analysed and used to study the genetic diversity of the C. canephora species.

Finally, all identified TE sequences were extracted from BESs and assembled into 2,475 contigs and 4,951 singlets (for a total size of 4,610 Mb) to construct the first database of C. canephora TE sequences (Online resource, Table S2). Among the assembled contigs, 615 were ≥1,000 bp and 61 ≥ 2,000 bp. The longest assembled sequence was 4,704 bp showing similarities with Ty3-Gypsy proteins from the public database when translated. This database can be used for masking procedures in the assembly of C. canephora genomic sequences and detailed TE characterisations.

Analysis of coding regions

BAC-end sequences were analyzed to identify coding region via homology searches. Overall, 54,501 BESs (41.5 % of BESr) displayed significant homology with Coffea spp. ESTs represented approximately 18 Mb of transcribed sequences and accounted for 20.1 % of the cumulative length of BESr (Table 4). At the C. canephora genome scale, this represents a transcribed portion of 142.7 Mb in length (20.1 % of 710 Mb). Assuming an average gene length (3.4 kb) similar to that of V. vinifera (Jaillon et al. 2007) and a total coding region of 142.7 Mb, the gene contents of the C. canephora genome was assessed as 41,973. This estimate is likely to be biased due to the generation process of BESs or by the fact that the EST database used does not represent the complete transcriptome of C. canephora. Taking this probable overestimation into account, this estimation is in the same range as the number of protein-coding genes identified in the genome sequence of related plant species (i.e. 30,434; 34,727 and 35,004 in grapevine, tomato and potato, respectively).

Table 4 Summary of BLAST analyses using different BES sets against SwissProt and TrEMBL protein databases (E-value cutoff: 1e−20) and Coffee EST databases (E-value cutoff: 1e−6)

Additionally, the BESraw based analysis was split up into two separate analyses for the two BAC libraries to point out potential differences in coding sequence representation. This revealed a noticeable difference between the two libraries, the BstYI library showing a higher proportion of coding sequence compared to the HindIII library.

Coding sequences were annotated with GO terms. A total of 10,607 GO annotations were found and 3,812 BESs (corresponding to more than 3 % of the cumulative length of BESnr and to 4.2 % of the total number of BESnr) were characterised by at least one annotation. This result is rather low compared to results in the literature: 36 % in Citrus (Terol et al. 2008) or 24 % in Brachypodium (Huo et al. 2008), but are easily explained by the more conservative E-value threshold and protein databases we chose (UniProtKB vs. NCBInr). Annotations were divided into three classes as follows: molecular function (45 %/3,133 sequences), biological process (29 %/2,019 sequences) and cellular component (26 %/1,803 sequences). Figure 5 shows the distribution of GO terms of gene products predicted from BES. Among the BES in the biological process categories, 32 % corresponded to proteins involved in metabolic process and 26 % were associated to cellular process. Sequences in the molecular function categories were distributed as follows: 22 % with transferase activity, 14 % with hydrolase activity and 13 % with involvement in nucleotide binding. Among the BESs in the cellular component categories, more than a half (53 %) encoded cellular proteins, 28 % membrane-bounded organelle proteins, and 5 % vesicle proteins. The distribution of GO categories in the coffee BES is comparable to previous findings in other dicot species by BES approaches (Cavagnaro et al. 2009; Han and Korban 2008) and in Arabidopsis by a whole genome functional annotation (Berardini et al. 2004).

Fig. 5
figure 5

Distribution of GO annotations of gene products predicted from the C. canephora BESs for biological process, molecular function and cellular component. Only major categories are presented

Comparative genome mapping

The BAC-end sequences were also used to evaluate synteny relationships between C. canephora and reference plant genomes, the latter represented by A. thaliana, S. lycopersicum and V. vinifera. We used tBLASTx searches (1e−6 as cut off value), with a set of BES pairs free of known TEs and containing different coding region families in both paired sequences as queries. A total of 14,325 pairs of sequences corresponding to the parameters described above were used in this analysis against reference whole-genome sequences. tBLASTx results were filtered according to the location and distance between each pair of BAC sequences and only the best hits were taken into consideration. BES pairs mapped between intervals of 10–300 kb within the same reference plant chromosome were then considered as potential microsyntenic regions.

Table 5 shows that the C. canephora BESs and V. vinifera genome share more potential micro-syntenic regions (331) than tomato (270) and Arabidopsis (88) genomes, suggesting that microsynteny is higher between coffee tree and grapevine than between coffee tree and tomato. However the coffee tree and tomato genomes are more closely related than the coffee tree and V. vinifera. Coffee tree and tomato diverged from a common ancestor approximately 83–89 million years ago, (Wikström et al. 2001) while the divergence between coffee tree and V. vinifera is estimated to have occurred 114–125 million years ago. This result confirms at the genome scale the high level of conservation of genome microstructure observed between C. canephora and grapevine (Cenci et al. 2010; Guyot et al. 2009) and the ancestral synteny established using conserved orthologous sequences (COSII) (Guyot et al. 2012). Although distantly related, the high level of conservation between the grapevine and coffee genomes at the microstructure level suggests limited genome evolution associated with the perennial habit of these species (Cenci et al. 2013). Regarding A. thaliana, the limited microsynteny observed is likely the consequence of the long history of segmental duplication and the resulting genome reshuffling that occurred in Arabidopsis (Blanc et al. 2000; Mahé et al. 2007).

Table 5 Microsynteny analyses between 14,325 C. canephora BES pairs (showing similarities with coding regions) and A. thaliana, S. lycopersicum and V. vinifera whole-genome sequences

Furthermore, all BES pairs (1,642) that produced significant hits with V. vinifera were mapped on the 19 grapevine chromosomes as shown in Fig. 6. Distributions of BES pairs that either co-localized between intervals of 10–300 kb (331 BESs) or not, were compared. Similar comparative analysis was also performed with tomato (Fig. S5, supplemental data). A bias in the distribution of BES pairs showing significant hits along the chromosomes of both species was observed. Nevertheless regions exhibiting a high frequency of BESs showing microsynteny were detected. It is noteworthy that the potential syntenic regions identified did not appear to be uniformly distributed on the different chromosomes. The occurrences of microsynteny appear to be preferentially located in the distal part of the chromosomes; this statement is particularly obvious for tomato chromosomes whereas it is less marked when comparing coffee and grape. This observation is likely related to the pericentric heterochromatin and distal euchromatin reported in the tomato chromosomes, resulting in a substantial higher density of genes in distal regions (Tomato Genome Consortium 2012). Although covered by BES pairs, some regions showed a low frequency of “microsyntenic” BES pairs (see grapevine chromosomes 10 and 19 or tomato chromosomes 6, 10 and 12 in Fig. 6 and supplementary data 4), suggesting that the conservation of structure between the coffee genome and the related genomes studied here is not uniform either along a given chromosome or from one chromosome to another.

Fig. 6
figure 6

Mapping of C. canephora BES pairs on grapevine (V. vinifera) chromosomes. Chromosome numbers are shown on the left. Coloured vertical lines indicate the positions of the C. canephora BES pairs mapped on the grapevine genome (TBLASTx E-value 1e−6). Red bars indicate BES pairs (1,642) mapped whatever the interval and green bars indicate the positions of BES pairs (331) mapped within an interval of 10–300 kb (Table 5). The blue line represents the ratio between the number of BES showing microsynteny and the number of mapped BES on a 1 Mb sliding window (step size: 250 kb) along chromosomes

Conclusion

The two developed BAC libraries are a genomic resource that is suitable for a broad range of applications in genetic and genomic research in coffee. In addition, the analyses and the data generated in this study provide a first glimpse of the genome constitution of C. canephora. Compared to reference plant genomes, a high level of microsynteny was observed between coffee tree and grapevine suggesting conservation of the microstructure of the genome. Furthermore, in relation to the ongoing C. canephora genome sequencing initiative, the present project appears extremely useful. While the generated paired-end sequences from BACs should considerably facilitate the scaffolding of sequence contigs, SSR identified from BESs could be used for saturating existing linkage maps and for anchoring physical and genetic maps. Moreover, the constructed TE database resulting from the present BES analysis would greatly improve the masking procedures in the assembly of C. canephora genomic sequences.