Introduction

Although the genus Coffea contains approximately 80 taxa (Bridson and Verdcourt, 1988), only two species are commercially exploited, Coffea arabica L. and Coffea canephora Pierre ex Froehner. C. arabica is an autogamous species and the only allotetraploid (2n = 4x = 44) of the genus while the other species are diploids (2n = 2x = 22) and generally self-incompatible (Krug and Carvalho, 1951). C. arabica is native of the highlands of Ethiopia (Sylvain, 1955), but there are also records of wild arabica coffee in the Boma Plateau of Sudan (Thomas, 1942) and Mount Marsabit of Kenya (Berthaud and Charrier, 1988).

The arabica coffee cultivation started in Arabia, specifically in Yemen, five centuries ago. In the early 18th century, progenies from a single Indonesia plant cultivated in Europe were spread out to South America and turned out to be the genetic basis of main cultivars of Brazil and other countries (Chevalier and Dagron, 1928; Carvalho, 1945).

In view of this narrow genetic basis and the necessity of genetic resources conservation, FAO organized in 1964–65 a mission to collect spontaneous and subspontaneous coffee germplasm in probable native regions of the species (FAO, 1968). This initiative resulted in 621 samples of seeds from various collecting sites in Ethiopia and the Republic of Eritrea which were carefully documented and sent to six institutions in India, Tanzania, Ethiopia, Costa Rica, Peru and Portugal for further research (FAO, 1968).

The accessions represented in the Coffee Germplasm Collection of Instituto Agronômico de Campinas (IAC) include plant material originated from 308 of those seed sample, obtained from plants of the Centro Agronómico Tropical de Investigación y Enseñanza (CATIE) at Turrialba, Costa Rica, collected in the years 1973 and 1987. There is also a substantial number of cultivated coffee germplasm in the IAC’s collection. Besides this, several samples of plants from Yemen were introduced by A.B. Eskes, a researcher from IRCC, Montpellier, after a detailed morphologic and agronomic characterization (Eskes and Mukred, 1990).

Since their introduction at IAC, all those materials are under continuous evaluation regarding botanic, agronomic, and technologic aspects (Carvalho et al., 1983; Mazzafera et al., 1990; Silvarolla et al., 2000). However, despite all these studies, the degree of genetic diversity and structure of the IAC collection is poorly understood. In order to fulfill this analysis, molecular markers could help reveal the structure of genetic diversity, which is an important information for setting up a core collection and to improve the conservation, accessibility and use of genetic resources of this collection (Hamon et al., 1999).

Previous studies have already evaluated the genetic diversity of some Ethiopian and FAO accessions through morphologic and agronomic characteristics (Montagnon and Bouharmont, 1996), RAPD markers (Lashermes et al., 1996; Anthony et al., 2001; Chaparro et al., 2004; Aga et al., 2003) and ISSR markers (Aga et al., 2005). However, in these studies only few accessions from Yemen were analyzed (Montagnon and Bouharmont, 1996; Lashermes et al., 1996; Anthony et al., 2002) and were reported different results regarding grouping and genetic structure of those accessions collected in different regions from Ethiopia and Yemen. Chaparro et al. (2004) did not found an association between grouping and the site of origin in Ethiopia. Montagnon and Bouharmont (1996) showed that there is a separation between coffee trees growing east and west of the Great Rift Valley, with accessions from south–eastern and southern of Ethiopia being grouped together with cultivated plants from Yemen. Anthony et al. (2001) and Aga et al. (2003, 2005) also distinguished eastern and western groups in accessions from Ethiopia, but with a low genetic differentiation between them and cultivated plants arising as a group apart (Anthony et al., 2001).

Microsatellites markers (SSR) have shown good efficiency to assess genetic diversity and relationships among coffee trees (Moncada and McCouch 2004; Maluf et al., 2005). Thus, we thought SSR could be an alternative tool to elucidate the questions about genetic structure of C. arabica as well to characterize diversity levels of genetic resources included in the Coffea Germplasm Collection of IAC. Furthermore, this is the first study analyzing a representative sample of both materials collected from the center of the origin and diversity of the C. arabica species and cultivated plants from Yemen and Brazil through microsatellites markers.

Material and methods

Plant material

The evaluated plant material consisted of 115 coffee accessions from the Coffea Germplasm Collection of IAC. These included 73 accessions of C. arabica L. derived from spontaneous and subspontaneous trees from various regions of Ethiopia and Eritrea (Fig. 1), 13 accessions of C. arabica from Yemen, 1 accession of C. arabica cv. ‘Geisha’ (Table 1), and 13 commercial cultivars of C. arabica developed by the IAC Breeding Program (Table 2). Also, 5 accessions of C. eugenioides S. Moore, 4 accessions of C. racemosa Lour. and 6 accessions of Coffea canephora Pierre ex Froehner were included as outgroup species due to their importance in the evolutionary history of coffee and as source of target genes for breeding programs. Sampling of spontaneous and subspontaneous accessions from Ethiopia and Eritrea was representative of total accession number at the Germplasm Collection of IAC.

Fig. 1
figure 1

Collecting sites of Coffea arabica accessions evaluated. Ethiopia political organization is according to FAO (1968), except by separation of Eritrea

Table 1 List of Coffea arabica accessions from the Coffea Germplasm Collection of IAC analyzed through SSR markers. “E” identifies material collected in Ethiopia and Eritrea by FAO and “T” identifies material from the CATIE’s collection. Spontaneous (spont) and subspontaneous-derived (sub) material originated from single plant, random or representative sample according to FAO (1968)
Table 2 List of commercial Coffea arabica cultivars developed by IAC evaluated through SSR markers

Genomic DNA extraction

Total genomic DNA was extracted from frozen young leaves according to Stewart Jr. and Via (1993), using CTAB as detergent. All DNA samples were diluted to a final concentration of 20 ng/μl.

SSR amplification

Primer sequences to amplify SSR locus were obtained from Combes et al. (2000) and Rovelli et al. (2000) (Table 3). Final reaction conditions were 40 ng of genomic DNA, 1.5× reaction buffer, 0.1 mmol l−1 of dNTP, 2 mmol l−1 of MgCl2, 5 ρmol of each primer and 1.5 U of Taq DNA polymerase. The complete thermal cycle program was 5 min at 95°C, followed by 30 cycles of 1 min at 95°C, 1 min at 60°C and 1 min at 72°C, and a final 5 min of elongation time at 72°C.

Table 3 Access number of microsatellite loci in the “Genbank”, locus code and respective forward (F) and reverse (R) primer sequences

Primers were fluorescently labelled and amplified products were separated on a 5% acrylamide gel using an ABI 377 automated sequencer (Applied Biosystems). Products were detected by the applicative Gene Scan (Applied Biosystems) using internal molecular size markers (GENESCAN-500 ROX), and analyzed by the applicative Genotyper v 2.0 (Applied Biosystems).

Data analysis

Gels were scored by presence or absence of bands or alleles. Due to the tetraploid condition of C. arabica, it is impossible to distinguish between diallelic duplex and simplex and among different types of triallelic combinations of SSR loci. Therefore, for each individual plant, fragment frequencies were analyzed as multilocus fingerprints, in which each allele was either scored present or absent.

Genetic diversity within groups of C. arabica and within diploids species was evaluated by average number of alleles per locus (A), proportion of polymorphic loci (P) and the Shannon’s genetic index (H′) (Bussel, 1999). Groups of C. arabica consisted of accessions of the same collecting region (Table 1) and commercial cultivars developed by IAC. Cultivar ‘Geisha’ was excluded of the genetic diversity analysis because it was represented by only one accession/group. Estimation of A was an exception to the multilocus fingerprints approach once amplified products of each pair of primers were considered as alleles of the same locus. P was calculated dividing the number of polymorphic bands by total number of amplified bands in each group. Shannon’s genetic index for each marker was calculated for each group as:

H′ = - ∑ p i *log2 p i

where p i is the frequency of the presence or absence of a band in that group.

Following the method of Bussel (1999), the partitioning of genetic variation within and between groups of C. arabica was estimated. The average diversity over all populations for each locus (Hpop) and the total diversity in the 99 accessions of C. arabica for each locus (Hsp) were calculated (for more details see Bussel, 1999). Then the component of diversity within populations (Hpop/Hsp) and the component between populations (GST = (HspHpop)/Hsp) were estimated.

H′, Hpop, Hsp and GST were the average values per locus calculated over all loci, including monomorphic ones, according to Bussel (1999). We categorized the partitioning of genetic diversity analysis according to the groups of accessions analyzed. These groups included all accessions (Analysis 1), all accessions without Eritrea group (Analysis 2), spontaneous and subspontaneous accessions (Analysis 3) and spontaneous and subspontaneous accessions without Eritrea group (Analysis 4). Eritrea group was excluded from Analysis 2 and 4 to improve understanding of species genetic structure. This was carried out because Eritrea group contains just one individual and this could lead to bias GST values by decreasing Hpop.

Genetic distance among all accessions (including C. arabica cv. ‘Geisha’) was estimated as the complement of Jaccard’s (1908) coefficient (Link et al., 1995). Genetic distances were also estimated using Dice coefficient (Dice, 1945), which is equivalent to Nei and Li (1979), in order to compare values with other studies. Cluster analysis was performed using the matrix distance based on the complement of Jaccard’s coefficient employing the UPGMA method. Bootstrap analysis (Felsenstein, 1985) was performed to evaluate the tree topology reliability for 1,000 simulations using the software Treecon (Van de Peer and Watcher, 1994).

Results

The multilocus fingerprints approach was used to analyze the genetic diversity and structure of coffee accessions. In spite of this, specific results for each primers pairs were pointed out for characterization of individual SSR locus.

SSR locus characterization

The E12-3CTG (SSR9) locus showed a profile with 3 and 4 peaks even in diploid species such as C. canephora and C. eugenioides. The 4-1CTG (SSR5) locus was monomorphic (band of 97 bp) in C. arabica and C. eugenioides and did not amplify any fragment in C. racemosa and C. canephora species. Primers E6-3CTG, E8-3CTG and M3 also amplified monomorphic bands within C. arabica accessions (a total of 12 bands), and those were polymorphic among evaluated species.

Genetic diversity

Sixteen SSR primers pairs produced a total of 121 bands or alleles. All tested markers detected polymorphisms among the accessions evaluated, being 54 bands among C. arabica accessions and 15 bands among cultivated plants (Yemen, ‘Geisha’ and Brazilian cultivars). Also, 7 out of 54 polymorphic bands of C. arabica, were present at high frequencies (0.8 or more), 31 as rare alleles in low frequency (0.2 or less) and 16 with frequencies between 0.2 and 0.8.

Genetic diversity analysis showed the highest values of H′ in diploid species (Table 4). Values for A in diploid species were similar or even lower than those observed in C. arabica groups, probably due to the allotetraploid nature of C. arabica which results in duplicate A values per plant. H′ values were relatively high in accessions from Kaffa, Illubabor and Sidamo provinces, moderate in Gojjam and Harar, and very low in cultivated plants. The P index was higher in diploid species and within Kaffa and Illubabor groups. The discrepancy of H′ relative to A and P values in Kaffa and Illubabor groups occurred due to the presence of rare alleles (Table 4).

Table 4 Genetic diversity within species and groups of coffee species assessed by average number of alleles per locus (A), proportion of polymorphic loci (P) and Shannon’s genetic index averaged over all loci (H′)

Genetic distances ranged from 0 to 0.88 between all possible pairs of genotypes, from 0 to 0.37 among C. arabica accessions, from 0 to 0.30 among spontaneous and subspontaneous accessions from Ethiopia and Eritrea, and from 0 to 0.19 among cultivated accessions (Fig. 1). Genetic distances among C. arabica accessions were also calculated using only the polymorphic bands exclusive to C. arabica (44.6% of total fragments). Results showed that this distance ranged from 0 to 0.65 using Jaccard’s coefficient or its complement and from 0 to 0.49 using Dice’s coefficient.

Genetic structure

GST values showed a strong genetic structure in all accessions of C. arabica (Table 5). A strong genetic structure was also observed in spontaneous and subspontaneous accessions from Ethiopian. Exclusion of group Eritrea, which contains just one individual, increased Hpop values and decreased GST values but yet results showed the strong genetic structure in the evaluated accessions (Table 5).

Table 5 Partitioning of genetic diversity generated by 121 SSR bands within and between groups of Coffea arabica accessions. Hpop, Hsp and GST are the average per locus values calculated over all loci

The hierarchical clustering analysis presented in Fig. 2 showed four major clusters comprising grouped accessions of each species. The analysis of species relationships showed C. canephora closer to C. arabica, followed by C. eugenioides. Also, C. racemosa was distantly related to C. arabica.

Fig. 2
figure 2

Dendrogram of the 115 Coffea accessions listed in Tables 1 and 2 based on Jaccard genetic distance obtained from SSR markers using the UPGMA method. Numbers (%) on the branches correspond to bootstrap values above 50% (1,000 replications). Letters indicate the geographical origin of accessions: W (West of the Great Rift Valley) and E (East of the Great Rift Valley)

Grouping of C. arabica accessions (Fig. 2) revealed two main clusters. The first contains only cultivated plants, including the accessions of Yemen, cultivar ‘Geisha’ and commercial cultivars from Brazil. An exception was the accession of Sidamo (SI4) also included in this group with materials from Yemen. Yemen accessions of type Tessawi (Table 1) were distinguished from the others in a separated cluster. The second group encompassed basically all spontaneous and subspontaneous accessions from Ethiopia and Eritrea. Despite the low boostrap values (Fig. 2), there was a clear separation of these accessions in two large subgroups: the first included mainly accessions from Sidamo and the second subgroup included all other accessions from the west side of the Great Rift Valley.

Discussion

SSR locus characterization

In general, the amplification patterns for each SSR locus evaluated corresponded to those previously reported. The profile of E12-3CTG (SSR9) locus with 3 and 4 peaks in the diploid species C. canephora and C. eugenioides confirmed that this pair of primers amplified a number of independent loci (Rovelli et al., 2000). Considering this, E12-3CTG locus has not been used to estimate number of alleles per locus (A).

The locus 4-1CTG (SSR5) was monomorphic in C. arabica and C. eugenioides and did not amplify any fragment in C. racemosa and C. canephora species. According Rovelli et al. (2000) this locus showed diploid type segregation in C. arabica. Thus, it is probable that the amplification has just occurred in putative genome provided by C. eugenioides (Lashermes et al., 1999). However, in other analysis Poncet et al. (2004) using different primer sequences to amplify this same SSR locus identified a monomorphic band of 239 bp also in C. canephora and other diploid species, including C. eugenioides. This new primer pair amplified the total SSR sequence present at GENBANK, while Rovelli et al. (2000)’s primers amplified only part of the SSR sequence.

These results suggested that the 4-1CTG locus is present in C. arabica, C. eugenioides and C. canephora and other Coffea species (see Poncet et al., 2004). However, there is an interspecific polymorphism in the flanking regions of SSR that prevent amplification in C. canephora, C. racemosa and in the putative C. canephora genome of C. arabica (Lashermes et al., 1999). Interestingly, Rovelli et al. (2000) identified heterozygotes in the accessions of C. arabica var. Caturra showing a polymorphism in repeat motifs within cultivated plants of C. arabica that was not observed in our study.

Genetic diversity

The high variability detected in spontaneous and subspontaneous accessions of C. arabica was observed mainly in coffee trees from Sidamo, Kaffa and Illubabor provinces although it also must be noted that 50% of accessions were sampled from Kaffa and Illubabor regions. Significant levels of genetic diversity in coffee plants from Kaffa and Illubabor were also reported by Anthony et al. (2001) and Chaparro et al. (2004). Indeed, the high variability among accessions from these regions in the collections is a consequence of the great effort in collecting samples with as much visual, botanical and agronomical diversity as possible (FAO, 1968). The high genetic diversity detected in Sidamo accessions could be visualized by the proportion of shared alleles among genotypes within both divergent groups of C. arabica, cultivated plants and spontaneous and subspontaneous accessions from Ethiopia.

Comparing the genetic distance values among C. arabica accessions, we found values similar to those reported by Orozco-Castillo et al. (1994), Anthony et al. (2001, 2002), but lower values than Moncada and McCouch (2004). Genetic diversity evaluated by allele distribution (29.6% of alleles with frequencies between 0.2 and 0.8) revealed a diversity degree intermediate between that identified by Anthony et al. (2001) and that found by Chaparro et al. (2004), with 17 and 39.6% of alleles with frequencies between 0.2 and 0.8, respectively.

Proportion of polymorphic loci and Shannon’s index values estimated in this study can be considered in the same range of a similar analysis of arabica coffee in natural populations from Ethiopia (Aga et al., 2003). These authors determined P and H′ values ranging from 37% to 73% and 0.2 to 0.4, respectively, in 9 plants per population. However, P and H′ were calculated based on C. arabica amplified bands while in the present study all amplified bands, including those specific of diploid species and monomorphic to C. arabica were included in the index calculation. On the other hand, if only C. arabica amplified bands were considered for calculating P and H′ values, these would be for Kaffa group, for example, P= 55.6% and H′= 0.328, identical to that observed by Aga et al. (2003). Besides this, observed values of genetic distances among accessions were similar in both studies.

Therefore, these results suggest that there is a significant genetic diversity in the coffee collection of IAC. Furthermore, the genetic diversity estimated for Ethiopian accessions was higher than that of cultivated plants. Thus there is a large variation that can be source for introgression of desirable characteristics in commercial cultivars that has already been used by the coffee breeding program of IAC (Bettencourt and Carvalho, 1968; Moraes et al., 1974; Fazuoli, 1981; Silvarolla et al., 2004).

Genetic structure

Results of amplification of 4-1CTG locus in C. arabica, C. eugenioides and C. canephora and analysis of species relationship observed in this work are in agreement with the hypothesis about the botanic origin of C. arabica species as a natural hybrid between C. canephora and C. eugenioides (Lashermes et al., 1999). Despite contrasting results concerning which of both species is closer to C. arabica, several molecular markers as well as phylogenetics studies have recognized C. canephora and C. eugenioides closer related to C. arabica than other Coffea species (Lashermes et al., 1993, 1997; Cros et al., 1998; Ruas et al., 2000, 2003). These diversity analyses also identified C. racemosa as the most genetically distant species of C. arabica.

According to Bussel (1999) calculation of GST using the Shannon’s index are in accordance with GST estimated by other methods, such as AMOVA and modified F-statistics. Also, GST is a good estimate to evaluate genetic structure of tetraploid, autogamous collections such as this of C. arabica plants, where heterozygosity cannot be assessed and Hardy–Weinberg equilibrium assumptions cannot be considered. GST values of C. arabica observed in this work fitted into expected values based on the breeding system. In that case, according to Bussel (1999), GST values for autogamous species from natural populations is around 60% and around 15% for allogamous species.

Results of GST for C. arabica accessions associated with cluster analysis indicated a strong genetic structure in the species. Therefore, the genetic diversity was observed among groups rather than within groups. Cluster analysis clearly showed the separation of cultivated plants from Ethiopian and Eritrean spontaneous and subspontaneous accessions. Besides, the analysis allowed distinguishing a morphologic type of Yemen (Tessawi) and Yemen accessions from Brazilian cultivars, although both groups exhibited a very low genetic diversity. These results are in agreement with the well-described narrow genetic basis of cultivated plants of C. arabica (Lashermes et al., 1996; Anthony et al., 2001; Moncada and McCouch 2004; Maluf et al., 2005) and historical data. Brazilian coffee originated from a few plants introduced in the early 18th century and these plants were originated from the first cultivated plants in Yemen (Chevalier and Dagron, 1928; Carvalho, 1945).

Partitioning of genetic diversity of the spontaneous and subspontaneous accessions showed a lower value of GST (0.464) than that observed when all accessions were analyzed. However, the GST value still indicated a strong genetic structure in these accessions. Hence, accessions of Gojjam were grouped altogether and there was a separation between Sidamo accessions and the others from the opposite side of Great Rift Valley: Kaffa, Gojjam, and Illubabor. However, these grouping associations were not supported by high bootstrap values indicating that the cluster was established based on a few distinct markers.

A genetic structure and low differentiation between southern/south–eastern and south–western coffee trees of Ethiopia were also verified by Anthony et al. (2001) and Aga et al. (2003, 2005). All together, these analyses may support the hypothesis that southern and south–eastern coffee trees of Ethiopia were introduced from the South West (Montagnon and Bouharmont, 1996; Anthony et al., 2001).

Based on morphological and agronomical traits, Montagnon and Bouharmont (1996) were the first to point out a genetic structure in C. arabica accessions, with a division between accessions from west and east side of the Great Rift Valley. However, accessions from east side of Ethiopia were grouped with cultivated plants of Yemen. According to Montagnon and Bouharmont (1996), most of the evaluated characteristics were affected by domestication and the similarity found could be the result from the fact that plants collected from Sidamo, Harar and other provinces of the East are cultivated rather than wild-type once primary forests were eradicated in this region. The authors suggested then two hypotheses to explain the origin of the cultivated plants of C. arabica: eastern plants could have been introduced from the West or could have been selected from wild-type plants situated to the East of the Rift. Nevertheless, the authors claimed unlikely the existence of a single center of wild coffee trees in Ethiopia located west of the Great Rift Valley. They also suggested that arabica plants transferred to Yemen could be originated from south–eastern of Ethiopia.

In the present study, cluster analysis showed accessions from Sidamo closer related to the cultivated plants, with one accession from Sidamo (SI4) being grouped within Yemen group. Similar results were achieved by Moncada and McCouch (2004). It is interesting that the accession from Eritrea and two accessions from Shoa were clustered with Sidamo’s accessions (Fig. 2). According to FAO (1968), E579 (ER) was introduced in Eritrea probably from Yemen, E37 (SH3) was a cultivated plant and E16 (SH1) was also a cultivated plant that is originated from Harar seeds. This observed similarity between accessions from Yemen and from eastern Ethiopia agrees with results from Montagnon and Bouharmont (1996). In the same way, these relationships corroborate the putative separation between Ethiopian accessions from east and west of the Great Rift Valley, but with a low differentiation between both groups (Anthony et al., 2001).

The cultivar ‘Geisha’, derived of plants from Kaffa province (Jones, 1956; FAO, 1968), was grouped with accessions from Yemen and was very distant from Kaffa accessions. This suggests that plants of the same origin were genetically separated by the domestication process and that the West is the primary origin of plants from the East.

Hence, our results suggest that there was a straight evolutionary pathway in domestication of C. arabica trees. The species origin is the Southwest highlands of Ethiopia, and thereafter it was introduced into the South and South East (natural colonization or/and by humans). Later on, some plants were transferred by Arabs to Yemen. After a long period in Yemen, these cultivated plants were spread out around the world. Also, possible occurrences of wild plants in south and south–east of Ethiopia cannot be ruled out.

SSR markers confirmed previously reported phylogenetic relationships among Coffea species as well as they demonstrated to be an efficient method to analyze genetic diversity and structure within C. arabica species. Our results agreed with the well-described narrow genetic basis of cultivated plants of C. arabica and showed a significant genetic diversity of accessions from Ethiopia and Eritrea included in the Coffea Germplasm Collection of IAC. These results indicate the importance of non-cultivated accessions as source of genetic variability for coffee improvement.