Introduction

Theobroma cacao L., which belongs to the Malvaceae family (Whitlock et al. 2001), is a naturally growing tree in South America. T. cacao is a diploid species (2n = 2x = 20) with a small haploid genome size of 390 Mb (Lanaud et al. 1992; Figueira et al. 1992).

T. cacao germplasm was classified into three morphogenetic groups corresponding to the first cultivated cocoa varieties (Cheesman 1944), thus distinguishing between Forastero, Criollo, and Trinitario (hybrid form between the first two groups). However, the diversity of the Forastero genetic pool is very high (Lerceteau et al. 1997; Laurent et al. 1993a, b), and a new classification in ten major groups has been proposed very recently in relation to geographic and genetic differences between the Forastero populations (Motamayor et al. 2008).

Cocoa is cultivated in humid tropical regions (Africa, Latin America, and Asia), where it is one of the most economically important perennial crops. Cocoa is the third-largest product on the world market for tropical commodities after sugar and coffee. Cocoa is mainly produced on smallholdings and, according to the World Cocoa Foundation (Cocoa market update 2009), 40 to 50 million people depend on it for their livelihood worldwide. About 3.5 million tons of cocoa are produced annually, 70% coming from Africa, and the demand for cocoa is increasing. This crop is subject to several fungal diseases, including Moniliophthora sp. (Aime and Phillips-Mora 2005) and Phytophthora sp., causing major yield losses (globally, 30%) in all regions. A major objective of breeding programs is to provide new varieties with sustainable resistance to fungal diseases, by accumulating several sources of resistance in the newly created varieties.

Several linkage maps based on codominant or dominant markers (restriction fragment length polymorphisms (RFLPs), SSRs, random amplification of polymorphic DNAs (RAPDs), amplified fragment length polymorphisms (AFLPs) and isoenzymes) have already been published for T. cacao (Lanaud et al. 1995; Crouzillat et al. 1996; Risterucci et al. 2000). A high-density linkage map, with codominant markers only, was developed (Pugh et al. 2004) with the addition of 201 simple sequence repeat markers (SSRs), resulting in a mean genome coverage of one SSR every 3 cM.

SSRs are present in the majority of eukaryotic genomes and consist of DNA sequences of simple motifs (mono- to hexa-nucleotides) that are repeated in tandem (Litt and Luty 1989; Beckman and Soller 1990). SSR length variation is mainly due to the slippage of DNA polymerase or unequal crossover leading to the insertion or deletion of tandem repeats (Ellegren 2004). Allelic variation in SSR loci can readily be detected by PCR using specific primers flanking the SSR motif.

Among the different classes of molecular markers, SSRs have been extensively used in many crop species for numerous applications in plant genetics and breeding because of their simplicity, reproducibility, high degree of polymorphism, codominant inheritance, relative abundance, and good genome coverage (Powell et al. 1996). Moreover, SSRs are technically easy to analyze because of the small amount of DNA required and their suitability for multiplexing on automated systems. SSR markers are widely used in cultivar fingerprinting, genetic diversity assessment, molecular mapping, QTL detection, and marker-assisted selection.

The standard method for developing SSR markers is based on the creation of a small-insert genomic library, subsequent hybridization with tandem repeated oligonucleotides (SSR nucleotides) and sequencing of candidate clones, making the process quite time-consuming and laborious (Thiel et al. 2003). Furthermore, the SSRs acquired are limited to the probed SSR motifs (Chen et al. 2006). However, this approach has been widely used in the past for many species. With the increased public availability of genomic and EST sequences it is now possible to mine these sequence collections directly for SSR identification.

In cocoa, 273 SSR markers were developed and mapped following the construction of genomic libraries enriched with SSRs (Lanaud et al. 1999; Risterucci et al. 2000; Pugh et al. 2004). These SSRs were already useful tools for the identification of QTLs related to morphological or disease resistance traits (Crouzillat et al. 2000; Flament et al. 2001; Risterucci et al. 2003; Clement et al. 2003; Queiroz et al. 2003; Brown et al. 2007; Brown et al. 2005). The high level of polymorphism for these SSR markers also helped to create a composite map by mapping SSRs (bridge markers) common to three different maps (Brown et al. 2008), and led to the establishment of a consensus map derived from ten different progenies, which has enabled a meta-analysis of QTLs related to resistance traits in cocoa (Lanaud et al. 2009). The SSRs defined in genomic sequence often correspond to non-coding sequences and thus cannot reflect the functional diversity of cocoa populations. Alternatively, expressed sequence tag-based SSRs (EST-SSRs) can be identified in sequence databases using data mining pipelines composed primarily of SSR search and primer design programs. SSRs derived from EST libraries offer the advantage, over genomic SSRs, of being intrinsically associated with coding sequences within the genome (Eujayl et al. 2002), thus providing functional information about genome regions linked to trait variations.

To date, EST-SSRs have been identified, developed, and used in a variety of studies for several plant species. Firstly, the use of EST libraries to develop polymorphic SSR markers was investigated in some cereals such as barley (Thiel et al. 2003), durum wheat (Eujayl et al. 2002) and rye (Hackauf and Wehling 2002) and more recently in some trees such as apricot (Decroocq et al. 2003), prunus (Ogundiwin et al. 2009) and in some tropical species such as sugar cane (Cordeiro et al. 2001; Pinto et al. 2004), cotton (Han et al. 2004; Guo et al. 2007) or coffee (Poncet et al. 2006; Aggarwald et al. 2007).

Given their high level of transferability from one species to another (Gupta and Prasad 2009; Luro et al. 2008), EST-SSRs constitute a useful tool for functional diversity studies and comparative mapping between species (Varshney et al. 2005). Although genomic SSRs are more polymorphic than genic SSRs (Cho et al. 2000; Lee et al. 2004), the latter have some important advantages: (1) the estimated frequency of genic SSRs is higher than genomic SSRs (Morgante et al. 2002), (2) such genic markers are functional markers that can contribute to “direct allele selection”, if they are shown to be completely associated with, or even responsible for, the variation of a trait of interest (Sorrells and Wilson 1997), (3) EST-SSR markers provide a set of readily shared markers that can be used to unify different genetic maps and establish consensus maps and (4) they can be more easily transferred between related species if they are derived from putative candidate genes related to traits of interest that are relatively well conserved among taxa (Gupta and Rustgi 2004).

A first set of 34 EST-SSRs was published for cocoa (Borrone et al. 2007; Riju et al. 2009). These markers were identified in the first ESTs (6581 sequences) from leaf and bean cDNA libraries (Jones et al. 2002), and from suppressive subtractive hybridization (SSH) from leaves treated with inducers of the defense response (Verica et al. 2004), and shoot tips inoculated with witches’ broom caused by Moniliophthora perniciosa (Leal et al. 2007). More recently, Lima et al. (2010) identified 49 EST-SSRs from full length and SSH libraries constructed from shoots and pods elicited with M. perniciosa.

Recently, a large collection of cocoa ESTs was generated from various tissues under various conditions (Argout et al. 2008). In total, 149,650 valid EST sequences were generated, corresponding to 48,594 unigenes, of which 29,849 unigenes shared significant homology with public sequences from other species. This constitutes an available resource to search for new SSRs in large quantities.

In order to find potentially mappable loci, the allelic diversity of some loci was tested on 8 genetically very different genotypes, including parents of offspring used in the construction of the reference map (Pugh et al. 2004). In this work, a subset of 314 EST-SSRs was analyzed for the potentiality to reveal polymorphism and 115 of them were mapped on the cocoa reference map.

Materials and methods

SSR detection

EST-SSR markers were detected in a T. cacao unigene set generated from the large EST collection produced under international collaboration and stored in the ESTtik database (Argout et al. 2008). SSRs were identified using the MIcroSAtellite identification tool (MISA). To consider as true SSRs, a minimum repeat unit of six was used for dinucleotide repeats and five for tri-, tetra-, penta-, and hexanucleotide repeats. The maximum interruption between two SSRs was set at 100 nucleotides. Primer3 software (Rozen SaS 2000) was then used to design three pairs of primers. The product size range was set between 100 and 280 nucleotides, the optimum, minimum and maximum primer sizes were set at 17 and 23 nucleotides respectively, and the optimum melting temperature at 56°C (ranging from 50°C to 63°C).

The localization of SSRs in the coding sequence was estimated using the prot4EST pipeline (Wasmuth and Blaxter 2004). ESTs were first translated into high quality predicted polypeptides and SSRs were classified according to the position found in the proteins produced: in the coding sequence, in 5′UTR or in 3′UTR.

EST functional annotation

Functional annotation of the EST-SSRs was carried out using the Blast2GO program (Conesa et al. 2005) which uses definitions of Blast assigned by gene ontology (GO) annotation (Ashburner et al. 2000). EST-SSR Blast searches were completed against the NCBI non-redundant protein database (Wasmuth and Blaxter 2004) with an E value set at 1e-6. Ten maximum hits were retained per result. The Gene Ontology mapping step was performed against a locally built Blast2GO database and the Annotation Cutoff was set at 75.

Polymorphism detection

The polymorphism of SSRs was revealed by genotyping eight genotypes originating from contrasting genetic groups: LAN28 (Criollo), SCA6 (Upper-Amazon Forastero), MAT 1–6 (Lower-Amazon Forastero), SA16, SNA1001, SNA1003 (Ecuadorian Nacional varieties) and two parents of the reference map: UPA402 (Upper-Amazon Forastero) and UF676 (Trinitario).

PCR reactions were performed on an MJ Research PTC Thermal cycler (MJ research, Waltham, Mass.) in a 20-μl volume containing 10 ng of cocoa DNA, 0.25 μM of forward primer, 0.25 μM of reverse primer, 2 mM MgCl2, 50 mM KCL, 10 mM Tris–HCl (pH 8.3), 0.2 mM dNTP mix, incorporation of 0.34 nMol of α-(33P) CTP (2 μCi) and 1 U Taq polymerase (Eurobio, France). The PCR reaction began with an initial denaturation step (95°C, 5 min), followed by ten cycles of 30 s at 95°C, primer annealing by touch-down (45 s from 55°C to 50°C), then 45 s at 72°C. The program continued with 25 cycles of 30 s at 95°C, primer annealing of 45 s at 50°C and 45 s at 72°C and was completed with final extension at 72°C for 8 min.

Twenty microliters of loading buffer (98% deionized formamide, 10 mM EDTA, bromophenol blue and xylene cyanol) was added to individual reactions. Samples were denatured at 94°C, and 5 μL of each sample was subjected to electrophoresis at 55 W on 5% denaturing polyacrylamide gels containing 75 M urea in 0.5X TBE buffer (pH 8.0). The gels were dried and exposed for 48–72 h to X-ray film (Fermentas). The potentially mapped loci were revealed with the polymorphism given by the parents of the reference map.

Mapping population

EST-SSRs were mapped on the same mapping population used to establish the previous cocoa reference map (Risterucci et al. 2000; Pugh et al. 2004). One hundred twenty-five progeny individuals, planted in Ivory Coast, were analyzed. This population resulted from a cross between two heterozygous cacao clones, UPA 402 (♀) an Upper-Amazon Forastero and UF676 (♂) a Trinitario. In this case, there were three possibilities of segregation, the loci that were homozygous for one parent and heterozygous for the other, segregation (1:1), and those that segregated in both parents (1:2:1 or 1:1:1:1). The latter situation made it possible to establish bridge markers needed to integrate each parental dataset and construct a consensus linkage map (Grattapaglia and Sederoff 1994).

Genotyping of the mapping population

Several panels were designed for genotyping with the selected polymorphic EST-SSR markers. Forward or reverse primers were labeled at the 5′ terminus with a fluorescent dye (Fam, Hex or Dragonfly Orange™). Simplex PCR assays with fluorescently labeled primer pairs were carried out in 10 μl reactions using one of two standard PCR protocols. The PCR reactions were performed with 3.3 pmole of each primer, 25 nmoles of MgCl2, 2.5 nmoles of each nucleotide, and 0.225 U of Taq Polymerase (BioTaq™, Bioline). The PCR products were pooled according to the defined panels.

Plates for the PCR and pooling steps were prepared with a robot pipetting station. The fluorescent PCR products were then analyzed on a MegaBACE™ 1000 Sequencer (Amersham Biosciences). The internal size standard was fluorescently labeled with a fourth dye (ET-ROX™ 400, Amersham Biosciences). The raw data were then analyzed using Genetic Profiler software (version 1.1).

The PCR reaction was initiated with a denaturation step at 95°C for 12 min, followed by ten cycles of 15 s at 94°C, 15 s at 50°C, 30 s at 72°C and 20 cycles of 15 s at 89°, 15 s at 50°C and 30 s at 72°C, then final extension at 72°C for 10 min.

Linkage mapping and map construction

The map was constructed using JoinMap software, version 4.0 (Van Ooijen 2006), by integrating the new EST-SSR loci in addition to the genomic SSRs, defense and resistance gene analogs, RFLPs, and isoenzymes previously mapped (Pugh et al. 2004). The segregation of each SSR marker was tested with a chi-square test for goodness-of-fit to the expected Mendelian segregation ratio. JoinMap software is able to combine data of several segregation types to construct a genetic map. Here, we used the population type “CP”, planned for a population resulting from the cross between two heterozygous diploid parents with possibly known linkage phases. A logarithm of the odds (LOD) score of five was used to identify linkage groups. The Kosambi mapping function, with a LOD of five and jump threshold of three, was used to convert recombination frequencies into map distances (Kosambi 1944).

Marker nomenclature

Designation of the EST-SSR loci was based on the nomenclature presented on the last reference map (Pugh et al. 2004), thus these new loci were named mTcCIR like the SSR markers that were previously identified from genomic DNA.

Results and discussion

Polymorphism screening

In total, 149,650 valid EST sequences were generated corresponding to 48,594 unigenes. Of them, 29,849 unigenes showed significant homology with public sequences from other species. A functional characterization of ESTs was performed to assign Gene Ontology annotations. A set of 314 EST-SSRs identified in genes with a known function was selected from part of the cocoa EST collection constructed at CIRAD (Argout et al. 2008). After primer definition, the polymorphism of SSR loci was evaluated on 8 genotypes of various genetic origins (Table 1).

Table 1 Polymorphism revealed by 314 EST-SSR markers on eight genotypes

Amplification was successful for 251 of the 314 EST-SSRs screened. Of them, 174 revealed polymorphism between the eight genotypes tested, representing 55.4% of the total SSRs screened. Of those polymorphic SSRs, we were able to genotype 115 SSRs with the MegaBACE™ 1000 sequencer and integrate them on the genetic map. The markers mapped on the new map therefore corresponded to 36.6% (115) of the EST-SSRs screened (Table 2) and the 59 polymorphic but unmapped EST-SSRs amounted to 18.8% (Table 3).

Table 2 Characteristics of 115 EST-SSR loci mapped on the reference map
Table 3 Characteristics of 59 unmapped EST-SSR loci but revealing polymorphism on eight genotypes

When polymorphism was tested for eight different genotypes, 20.1% of loci were not amplified. This result can be partly explained by the presence of introns. Indeed, we defined the primers on cDNA sequences and the amplifications were tested on genomic DNA, which may have generated a larger amplified DNA fragment than expected, due to intron length.

EST-SSR distribution

The most frequent classes of SSRs presenting good amplification (251 EST-SSRs) were dinucleotide repeats (49.4%) and trinucleotide repeats (43%), compared with tetranucleotide (1.6%) and hexanucleotide repeats (0.4%). Dinucleotide SSR repeats and composite SSRs (often composed of dinucleotides) were the most polymorphic and revealed the largest number of alleles per locus out of the eight genotypes screened. For EST-SSRs with good amplification, 78.2% of loci were polymorphic when they consisted of dinucleotide repeats and 58.3% for the trinucleotide repeats (Table 4). This observation can be explained by polymerase errors that are more frequent when the repeat element is short.

Table 4 Distribution and polymorphism of different types of SSR repeats depending on their position in the gene (5′UTR, CDS, and 3′UTR)

The SSR loci showed strong heterogeneity in their level of polymorphism depending on the different gene regions from which they originated (5′UTR, 3′UTR, and CDS). The SSRs originating from the CDS region were the least polymorphic, with 53.5% of polymorphic loci, compared with the 5′UTR (68.9%) and 3′UTR (82.5%) regions. This result was expected and related to better conservation of the coding region, thus providing fewer polymorphic SSRs. The SSRs originating from the 5′UTR region were less polymorphic than those originating from the 3′UTR region. This observation can be explained by the presence, in the 5′UTR region, of important regions involved in the regulation of gene expression, which need to be more conserved.

The distribution of SSR repeat patterns was very different depending on the EST regions considered. In the CDS region, the main repeats were of the trinucleotide or hexanucleotide type (83.7% of SSRs). This percentage was much higher than in the 5′UTR and 3′UTR non-coding regions, with a 37.2% and 32.1% trinucleotide pattern respectively. This result seems logical because the trinucleotide and hexanucleotide repeat retains the correct reading phase for coding sequences, causing minimum disruption.

Linkage analysis and map construction

Of the 115 EST-SSRs mapped, 74 (64.4%) were heterozygous for UF 676 only, nine (7.8%) were heterozygous for UPA 402 only, and 32 (27.8%) were heterozygous for both parents (bridge markers) with two, three, or four alleles. These percentages were similar to those found for SSRs mapped by Pugh et al. (2004).

In order to check good co-linearity of the parents, two parental maps were constructed separately (data not shown) with JoinMap 4.0 software. The large number of bridge markers, 154 (amounting to 26.4% of total markers), led to easier construction of the consensus map between the two parents. However, the number of heterozygous loci in both parents was found to be very heterogeneous between the linkage groups, ranging from 0 for LG8 to 34 for LG9. LG8 and a large share of LG10 showed markers segregating parent UF676 only. The resulting order of the loci and map distances between markers are shown graphically in the integrated linkage map (Fig. 1).

Fig. 1
figure 1figure 1

Linkage map of Theobroma cacao containing 582 codominant markers including 115 new EST-SSR loci. This linkage map is based on cross UPA 402 × UF676 and contains 582 codominant markers (115 EST-SSRs, 269 SSRs, 177 RFLPs, 16 Rgene-RFLPs, and five isoenzymes). EST-SSRs are designated by a black dot. Polymorphic markers for UPA 402 only, for UF 676 only and for both parents are designated on the right, left and in the middle respectively. Markers showing distorted segregation ratios are denoted with a circle for UPA402 or an asterisk for UF676; *P < 0.1; **P < 0.05; ***P < 0.01; ****P < 0.005; *****P < 0.001. The segregation distortion regions (SDRs) are identified with braces

The complete map contains 582 codominant markers (115 EST-SSRs, 269 SSRs from genomic DNA, 177 RFLPs, five isoenzyme loci, and 16 Rgenes-RFLPs) arranged in ten linkage groups for a LOD of 5.0, corresponding to the haploid chromosome number of T. cacao. The total length of the map was estimated at 779.2 cM, comparable to the last reference map (Pugh et al. 2004) which was 782.8 cM in length. The average distance between markers on this map was evaluated at 1.3 cM against 1.7 cM before. A large variation was found in the number of mapped loci between linkage groups, ranging from 32 in LG7 to 80 in LG9. Marker density between the different linkage groups was relatively homogeneous, from 1.1 cM for LG5 to 1.8 cM for LG8. Only LG10 showed lower marker density with an average of 2.2 cM between markers. The current level of genome SSR coverage was estimated approximately at one SSR every 2 cM. A genetic map was constructed with the 384 SSRs and EST-SSRs only (data not shown); it accounted for 93.8% (730.6 cM) of the total length of the map.

EST-SSR markers were distributed throughout all linkage groups, but their distribution was not random. The number of EST-SSRs per group ranged from two in LG7 to 21 in LG9. The average spacing between EST-SSRs was heterogeneous, ranging from 4.3 cM for LG9 to 8.3 cM for LG2. LG7 and LG10 were very poor in EST-SSRs, the average spacing between two EST-SSRs in these groups being more than 24 cM. These two chromosomes were also those that incorporated fewer markers (32 and 33 markers, respectively).

In many plants, the level of polymorphism of EST-SSRs has been reported to be lower than that of genomic SSRs (Cho et al. 2000; Eujayl et al. 2001). In our study, the polymorphism revealed by the EST-SSRs (55.4% of loci) was similar to that found for genomic SSRs (52%). EST-SSR loci represented from 14% to 30% (with an average of 19.6%) of total loci on chromosomes. Only two chromosomes (LG7 and LG10) contained a very small proportion of EST-SSRs (6.3% and 9.1% of all markers respectively). One hypothesis explaining this situation would be that these chromosomes are less rich than the rest of the genome in coding regions or they have large homozygous regions that we cannot detect on the genetic map due to a lack of polymorphism.

The addition of new markers enabled us to increase the density of the previous map, but did not lead to the saturation of LG10. The non-homogeneous distribution of the polymorphism on the chromosomes can be explained by the dynamics of domestication (Pugh et al. 2004). Such selection can fix some segments of chromosome (Temnykh et al. 2000). LG8 was an exception because all the markers were mapped from UF676 only. The explanation is probably linked to the UPA402 pedigree. Indeed, UPA402 is an Upper-Amazon Forastero clone derived from a full-sib cross and probably includes several homozygous chromosome regions.

Segregation distortion

Of the 115 EST-SSR loci mapped, 88.5% followed the Mendelian ratio expected from the genotypes of the parents. A total of 67 markers (11.5%), 19 RFLPs, two isoenzymes, one R gene, 24 SSRs, and 21 newly included EST-SSRs, significantly deviated from the expected ratio at P < 0.05.

After calculation of the chi-square test at P < 0.05, several segregation distortion regions (SDRs) appeared on the genetic map. An examination of the direction of segregation distortion showed that the loci were skewed for the male parent (UF 676) only, or for the female parent (UPA402) only.

A large proportion of the EST-SSR loci (18.3%) displayed distortion when compared with the expected Mendelian ratio. These new loci were twice as distorted as SSRs from genomic libraries (8.9%) and significantly more distorted than all other markers (11.4%). All these EST-SSR loci were clustered in SDRs in the genome. The genetic map constructed with new markers highlighted all the SDRs found on the previous map (Pugh et al. 2004). There were five major SDRs at P < 0.05, from UF676 in LG2, LG3, LG5, LG6, and from UPA402 in LG9. The segregation distortions from UPA402 were only found in LG9. Some markers distorted at P < 0.01 were detected around regions of high distortion and at the end of chromosomes (LG3, LG7, and LG9). Two other small SDRs were observed in LG8 and LG4 around the cluster of R genes. These non-random SDRs, due to preferential recombination in some regions of the genome, seemed specific to the crosses. The map constructed by Brown et al. (2005), using an F2 population produced from the cross between Sca6 and ICS1, showed very different SDRs. Only a distortion in a region of LG3 was common to both maps between mTcCIR 254 and mTcCIR 81 (12.1 cM). This SDR has been recognized as the largest for both maps.

The rate of segregation distortion on our map (11.5% of total markers) was slightly lower than on the map of Brown et al. (2005), where they found 14.8% of skewed markers at P < 0.05. These rates were relatively low compared with those found in other species such as potato (Gebhart et al. 1989) with 25.5% of skewed markers, or maize with 19% to 36% (Lu et al. 2002). The origin of these distortions in T. cacao remains unknown. The hypothesis most often proposed is the direct selection of genes closely linked to lethal or sub-lethal genes, or the presence of incompatibility alleles (Pugh et al. 2004). The presence of loci in segregation distortion in one or both parents could reveal chromosomal regions containing gametophytic factors subjected to selection (Lu et al. 2002). Distortions of this type have already been observed in other tree species (Barreneche et al. 1998; Dettori et al. 2001). Segregation distortions are generally very high for interspecific crosses, (Paterson et al. 1988), but can also be high in some intraspecific crosses (Hall and Willis 2005). The extent of segregation distortion seems to be related to the type of cross and more particularly to the degree of genome divergence from the parents. In T. cacao, structural changes have never been reported and the genes responsible for gametic selection are not known.

Classification of EST-SSRs according to gene ontology

The mapped EST-SSRs were classified by comparing their similarity to Arabidopsis known function genes according to the gene ontology tool (Conesa et al. 2005). Genes were distributed according to three criteria: cellular component, biological process, and molecular function. For each criterion, genes were grouped into different classes (ontology) for each level. The higher the level, the more precise is the gene classification. A gene may be represented in several classes. Here, a large panel of genes was represented in the different criteria from level 2 or 3 (Fig. 2).

Fig. 2
figure 2

Classification of EST-SSRs in three criteria according to gene ontology. a Biological process (level 2 + 3). b Cellular component (level 3). c Molecular function (level 3)

Mapping of EST-SSRs involved in cocoa quality traits or plant defense

Improving or maintaining the aromatic qualities of chocolate and providing sustainable resistance to very devastating diseases, especially those caused by Moniliophthora or Phytophthora pathogens are among the main goals of breeding programs. Of the 115 new EST-SSRs mapped, 14 constitute good candidate genes that could participate in plant defense/resistance mechanisms or cocoa quality elaboration.

Eight EST-SSRs were putatively involved in biochemical processes important for the quality of chocolate. mTcCIR 297 (oleosin), 305 (acyl-acp thioesterase), 306 (3-ketoacyl-ACP synthase), and 310 (2.4-sterol C-methyltransferase) were involved in lipid biosynthesis and seed storage. Cocoa seed storage lipids account for 50% of dry seed weight and participate in enhancing cocoa flavor. mTcCIR 298 and mTcCIR 304 were located in vicilin and aspartic proteinase genes, respectively. Vicilin is one of the major seed storage proteins. During fermentation, and after an increase in acidity, aspartic proteinase genes participate in the proteolytic digestion of vicilin, enabling the constitution of cocoa-specific aroma precursors (Voigt et al. 1994). Another important class of biochemical compounds involved in chocolate quality is the class of flavonoid compounds. It is a diverse group that plays an important role in plant development (Pourcel et al. 2007). In cocoa, proanthocyanidins have been shown to be beneficial to human health by improving cardiovascular condition (Rimbach et al. 2009). The EST-SSR approach has led to the mapping of a putative dihydroflavonol reductase enzyme and a WD-repeat family transduction factor involved in phenylpropanoid biosynthesis regulation (mTcCIR 299 and 429).

Another important goal for cocoa growing and production is a reduction in yield losses due to fungal diseases. Three pathogens, Phytophthora sp. (black pod), M. perniciosa (witches’ broom), and Moniliophthora roreri (frosty pod) cause yield losses of 20% to 80% in the different production regions worldwide.

This study led to the mapping of 7 EST-SSRs located in genes putatively involved in plant defense or resistance. Six of them were involved in conventional reactions that lead to defense activation in plants, such as glutamine synthase (mTcCIR 292), pathogenesis-related protein (mTcCIR 293) (Stintzi et al. 1993), leucine-rich repeats protein (mTcCIR 301) involved in protein/protein interactions and which could play a role in plant resistance to pathogens (Fritz-Laylin et al. 2005), bacterial induced lipoxygenase (mTcCIR 351) activated in plant-pathogen interactions (Melan et al. 1993; Véronési et al. 1996), calmodulin-binding (mTcCIR 433), and a putative WRKY transcription factor (mTcCIR 439). The other one (mTcCIR 304) was located in a gene encoding vicilin, a 7S globulin protein stored in the seed. This protein is very important for cocoa aromatic quality and is also involved in seed defense as an antifungal agent in some plants (Marcus et al. 1999; Ribeiro et al. 2007) and this role has also been described in Malva palmiflora, belonging to the same family as T. cacao (Wang et al. 2001).

Interestingly, the majority of these markers (mTcCIR 292, 293, 301, 304, and 439) were colocalized with quantitative trait loci (QTL) related to resistance identified in a meta-QTL analysis of disease resistance traits recently conducted in T. cacao (Lanaud et al. 2009).

These EST-SSRs could be considered as potential candidate genes for studying resistance and quality traits.

EST-SSRs from genomic sequences

To increase the number of EST-SSRs, we compared the SSRs (mTcCIR 1 to mTcCIR 291) previously identified in genomic sequences and already published (Lanaud et al. 1999; Pugh et al. 2004) with cocoa ESTs present in international databases. In total, 249 genomic sequences containing an SSR were analyzed with the FASTA program (Pearson and Lipman 1988). Fifty-six sequences, or 22.5% of the genomic sequences tested, revealed similarity with cocoa ESTs. The sequence homology was often partial, which may have been due to a common domain between two genes or to the presence of introns in the genomic sequence. Nineteen (7.6%) genomic sequences (genomic SSRs) were found to have complete homology with a cacao EST (Table 5).

Table 5 EST-SSRs from genomic sequences

Conclusions

With the development and mapping of a new set of 115 EST-SSR markers which were integrated with a set of previously mapped codominant markers (SSRs, RFLPs, R genes-RFLPs, and isoenzymes), a new linkage map was obtained with 582 codominant markers divided into ten linkage groups. This map contains 134 EST-SSRs (23% of total markers), 115 from EST sequences and 19 from genomic sequences. This map includes a set of 384 SSRs, molecular markers easily transferable to other laboratories, especially in cocoa-producing countries. This large number of codominant markers will be useful for all genetic analyses, such as fingerprinting, mapping, QTLs, diversity analyses, and marker-assisted selection in T. cacao. The increased density of SSR markers on the genetic map will facilitate association mapping studies aimed at the identification of marker/trait associations from natural populations or collections for which a larger number of generations of recombination occurred between ancestors, compared with a controlled progeny. EST-SSRs also represent useful tools to study the functional diversity of populations. Indeed, they could have a role in gene expression or function, and depending on their position (5′ UTR, coding region, 3′UTR), they could affect gene transcription, activate or delete genes, and be responsible for gene silencing or transcription slippage (Varshney et al. 2005). Due to the conservation between species belonging to the same family, they are also useful tools for comparative genomic and phylogenetic studies.