Introduction

The Coffea genus belongs to the Rubiaceae family that includes around 124 species, most of them are diploids (2n = 2x = 22). The only allotetraploid is C. arabica L., with 2n = 4x = 44 (Davis et al. 2011) which was originated from the natural cross between Coffea eugenioides S. Moore and C. canephora Pierre ex A. Froehner (Lashermes et al. 1999), and it is autogamous with approximately 10% cross-fertilization (Carvalho and Krug 1949). C. arabica and C. canephora are the most important agronomic species, which supplied 63 and 37% of the world coffee production in 2016, respectively (ICO 2017).

One of the main objectives of breeding programs is to create more productive cultivars, adapted to the local conditions of interest. Some limitations faced by coffee breeders are the long time (about 25 years) and the considerable resources needed to develop new varieties due to the perennial nature of these species (Moreno 2004). An important challenge is the reduced genetic variability available in commercial plantations (Moncada et al. 2016).

Coffee plantations in Mexico include the cultivars Typica, Bourbon, Caturra Rojo, Mundo Novo, Garnica and Caturra Amarillo, which are susceptible to coffee leaf rust (Hemileia vastratix Berk & Br.) (Escamilla et al. 2005; López-García et al. 2016). Plant breeding for resistance to this disease is the best long-term solution (Avelino et al. 2015). Thus, applying molecular markers is particularly desirable for C. arabica due to its narrow genetic base (Ferrão et al. 2015). Molecular information, when combined with phenotypic variables, allows the selection of superior genotypes and maximizes the selection gains (Sousa et al. 2017) as the breeders select more diverse germplasm and avoid the crossing of closely related accessions (Pailles et al. 2017). Molecular markers have shown that the genetic diversity of C. arabica is lower than C. canephora (Cubry et al. 2008; Lashermes et al. 2011; Ferrão et al. 2015).

There is a new highly informative and high-performance genome marker technology, called DArT genotyping by sequencing (DArTseq™). This technology uses the DArT markers platform and, combined with next-generation sequencing, allows for rapid identification of single nucleotide polymorphism (SNP) (Kilian et al. 2012; Cruz et al. 2013; Raman et al. 2014). Compared to simple sequence repeat (SSR) markers, SNP analysis can be done without requiring DNA separation by size and can, therefore, be automated in high-throughput assay formats. The genotyping profiles of SNPs can be compared across different laboratories and genotyping platforms (Zhou et al. 2016). DArTseq™ has been applied successfully in the evaluation of the genetic diversity of Solanum lycopersicum (Pailles et al. 2017), Solanum tuberosum (Berdugo-Cely et al. 2017), Allium sativum (Egea et al. 2017) and in the Coffea genus is reported with C. canephora (Garavito et al. 2016).

Since the Mexican Coffee Institute (INMECAFE) closed down in 1989, Mexico has implemented few coffee breeding programs, importing most of the leaf rust coffee resistant cultivars to face the sanitary crisis of 2012. The term "Central Collection" refers to the subset of accessions of a larger collection that includes, with minimum redundancy, the majority of the genetic diversity of a crop, a wild species or a group of species (Van Hintum et al. 2000). In this sense, this work tries to develop a “Core Collection” representative of the Central Collection that is conserved in the National Bank of Coffee Germplasm located in Huatusco, Veracruz, Mexico. The objectives of this study are: (1) to evaluate the diversity and genetic structure of the central coffee collection; (2) to assess the reproducibility and error rates of the markers and their broad representation in the genome and (3) to propose a new collection with representative and divergent promising genotypes for stablishing a coffee breeding program in Mexico.

Materials and methods

Plant material and DNA extraction

A total of 87 accessions of Coffea spp. (Table 1) from the National Bank of Coffee Germplasm located at 19° 10′ 27″ N and 96° 57′ 50″ W and 1345 masl, in Huatusco, Veracruz, Mexico were characterized by DArTseq™ method and SNP markers. Six young and fully expanded leaves were collected from a single individual plant per accession and stored in a freezer at – 80 °C until use. Genomic DNA was extracted from previously freezed leaves by the CTAB method (cetyltrimethylammonium bromide) (Hoisington et al. 1994), with two additional chloroform washes for further cleaning. The DNA concentration was measured with the NanoDrop 8000 V 2.1.0 spectrophotometer and the quality was evaluated on a 1% agarose gel.

Table 1 List of 87 accessions of the Coffea genus genotyped by DArTseq

DArTseq analysis based on SNP

For genotypic characterization, the next-generation sequencing technology DArTseq™ was used. DArTseq™ represents a combination of DArT complexity reduction methods, based on methyl filtration and next generation sequencing platforms (Kilian et al. 2012). A genomic representation of the samples was generated by digestion/ligation of the genomic DNA by combination of two restriction enzymes (Pst1-CTGCAG-, HpaII-C/CGG y GGC/C-) and adapters linked by bar code to identify each sample to operate within a single lane on the Illumina HiSeq2500 instrument (Illumina Inc., San Diego, CA).

The site for HpaII was effectively amplified in 30 iterations of PCR, the following reaction profile was used: (1) denaturation at 94 °C for 1 min; (2) 30 cycles [94 °C for 20 s (denaturation), 58 °C for 30 s (pairing by primer) and 72 °C for 45 s (primer’s extension)] and (3) final polymerization at 72 °C for 7 min. Equimolar quantities of the amplified fragments were polled by PCR from each reaction of the samples in 96-well microliter plates and applied to the c-Bot bridging PCR (Illumina), followed by sequencing in the Illumina HiSeq2500 system (Illumina Inc., San Diego, CA).

The amplified fragments were successfully sequenced up to 77 base pairs, generating approximately 500,000 unique readings per sample. The analytical program developed and patented by DArT Pvt. Ltd., Australia, was used to generate two types of data, (1) scores for “presence/absence” markers (dominant), called SilicoDArTs (PAVs) and (2) SNP markers. The FASTQ files (full 77 bp readings) were filtered by quality parameters to select high quality markers for this specific study. The PAV markers generated by DArTseq™ were not used in this study.

Data analysis

The dartR package of the R software automatically calculates several quality parameters for each SNP marker, such as call rate, polymorphic information content (PIC) and reproducibility. For the data analysis, the final number of polymorphic SNP markers was taken. The average rate of missing values per marker was 14.7%. Markers with more than 10% of missing values were eliminated, the remaining markers were imputed using the allelic frequencies observed by the statistical software R (R Core Team 2018). The missing genotypes were imputed generating random samples of the marginal distribution of the observed genotypes, that is:

$${x}_{ij}\sim Bernoulli({\widehat{p}}_{j}),$$

where \(Bernoulli\left({\widehat{p}}_{j}\right)\) denotes a Bernoulli random variable with parameter \({\widehat{p}}_{j}\) and represents the allelic frequency calculated using the non-missing genotypes (Crossa et al. 2010). Once the markers were imputed, the frequencies of the minor allele (MAF) were obtained and all those markers with MAF < 5% were eliminated. To investigate the relationship between genotypes, a hierarchical grouping based on Euclidean distances and Ward.D2 method between groups, as a measure of similarity (Murthag and Legendre 2014), was performed based on all polymorphic SNP markers. For the heat map, the genomic relations matrix G can be easily calculated using the following expression:

$${\varvec{G}}=\frac{{\varvec{Z}}{{\varvec{Z}}}^{\boldsymbol{^{\prime}}}}{p},$$

where Z is the matrix of markers of dimension n = 87 rows (individuals) and p = 1739 columns (markers), which is obtained by centering and standardizing the columns of the matrix of markers (Kaufman and Rousseeuw 2005; López-Cruz et al. 2015). After, a genlight object was generated by using the dartR and adegenet packages of the R software (Gruber et al. 2017). Then the principal coordinate analysis (PCoA) was performed, PCoA explains the genetic distances among the accessions.

The population structure of the germplasm was analyzed using STRUCTURE v.2.3.4 (Pritchard et al. 2000). The number of hypothetical subpopulations (K) was estimated with the STRUCTURE software through the application of a Bayesian clustering approach for the organization of genetically similar accessions into the same subgroups. A series of Markov Chain Monte Carlo (MCMC) simulations were conducted for each K-value from 1 to 5 with a burn-in length of 10,000, followed by 10,000 iterations. The best K-value was estimated based on the membership coefficient (Q) for each individual in each cluster. The Q values indicate the level of relatedness of each accession to various subgroups.

Results

Genetic-statistical analyses

As a result, 16,995 SNP markers, derived from 34,000 unique sequences, were obtained by DArTseq™ from 87 accessions of different Coffea spp. The number of missing values for the 16,995 SNP markers was found in more than 8000 markers. Most of the markers showed reproducibility > 95%, a call rate > 85% and an average PIC of 0.10. The average of monomorphic markers and missing data were 40.95% and 14.7%, respectively. Because it is not possible to work with missing value rates per marker, an imputation was made based on the frequencies of the observed markers. After removing the markers with more than 10% of the missing data and MAF < 5% (Fig. 1), there were 1739 polymorphic SNP markers for the analysis. The technical and biological replicas allowed evaluating the reliability of the DArTseq™ method in coffee species.

Fig. 1
figure 1

Frequency distribution of the minor allele (MAF)

Clustering analysis

After imputation and elimination of markers based on MAF, a heat map of the 87 accessions was obtained by using the genomic relations matrix G (Fig. 2). Figure 3a and b shows a graph of the first two principal components based on the Euclidean distance matrix, which are identified with different colors in the graph. For PCoA there were 1639 polymorphic SNP markers in the genlight object. PCoA illustrated the genetic divergence among the cultivars and the two main components explain only 32.2% of the total variability. The population distribution determined by these markers is consistent with the output of hierarchical grouping and population structure analysis. C. arabica accessions were located in the top two quadrants, while Coffea liberica Bull ex. Hiern and C. canephora were mainly located in the bottom quadrants.

Fig. 2
figure 2

Heat map for the 87 accessions of Coffea spp. from the National Bank of Coffee Germplasm in Mexico using DArTseq Technology

Fig. 3
figure 3

a Principal component 1 vs Principal component 2 of the PCoA explain 32.2% of the variability, b PCA represents the grouping for the 87 accessions of Coffea spp.

The accessions of Coffea spp. were grouped by the hierarchical method using the Ward.D2 criterion (Murtagh and Legendre 2014) as a measure of proximity between groups (agglomeration method), the resulting dendrogram is shown in Fig. 4. Five well-defined groups can be identified in the dendrogram by drawing a horizontal line to cut the tree at a height of 95 (Table 2). The accessions belonging to each group were obtained using the routine “cutree” of the statistical package R (R Core Team 2018). Based on the genomic relations matrix G and the dendrogram, it was shown that there is genetic diversity among the accessions of Coffea spp. and these materials could be promising for use in future breeding programs.

Fig. 4
figure 4

Dendrogram of 87 accessions of Coffea spp. obtained with Euclidean distances calculated from SNP and Ward.D2 method with proximity criterion between groups

Table 2 Compact grouping of the 87 accessions of Coffea spp. product of the dendrogram

Population structure analysis

The model-based Bayesian cluster analysis in STRUCTURE visualized the population structure under examination (Fig. 5). Five distinct sub-populations were found across cultivars. The sub-populations were denoted as Pop1, Pop2, Pop3, Pop4 and Pop5. The genetic diversity within each sub-population was explained through the estimation of the expected heterozygosity, which varied from 0.07 (Pop2) to 0.28 (Pop1). The expected heterozygosity of Pop3 was 0.09, Pop4 was 0.16 and Pop5 was 0.24. The genetic divergence among the populations revealed by Nei’s net nucleotide distance (D) indicated that Pop2 was widely related to the rest of sub-populations, Pop1 (D = 0.34), Pop3 (D = 0.32), Pop4 (D = 0.31) and Pop5 (D = 0.23), respectively. The genetic distance observed between Pop2 and Pop5 (D = 0.18) was the least among the pairs of populations examined (Table 3).

Fig. 5
figure 5

Population structure of 87 coffee accessions using SNP marker data

Table 3 Genetic divergence among (net nucleotide distance) and within (expected heterozygosity) populations, and the proportion of membership of the population samples

The proportion of membership of individual accessions to each sub-population is illustrated in the bar plot of the population assignment test in structure analysis (Fig. 5). The estimated proportion of membership (Q) suggested that two different species (C. liberica [83] and C. canephora [85, 86 and 87], red color) were assigned entirely in Pop1. Mostly of C. arabica accessions comprised Pop2 (green color). CIRAD F1 hybrids were included in Pop3 (blue color). The remaining accessions showed intermediate and/or highly mixed genetic composition and were hence determined as heterogeneous (Pop4 [yellow color] and Pop5 [pink color]). One CIRAD F1 hybrid accession (76) also shared large amounts of genetic information with Pop4 and Pop5 (Table 3).

Discussion

Genetic-statistical analyses

A total of 1739 SNP markers were used in the present study to provide detailed molecular characterization of 87 accessions of Coffea spp. that are in the National Bank of Coffee Germplasm in Mexico. The relationship between genotypes that resulted from different statistical approaches yielded similar results.

The quality parameters of SNP markers in Coffea spp. were comparable with other species: watermelon (Yang et al. 2016), Physaria spp. (Von Mark et al. 2013), Sorghum bicolor (Mace et al. 2008), cassava (Xia et al. 2005) and wheat (Akbari et al. 2006). Based on the polymorphism value, PIC is classified into three categories, high (PIC value higher than 0.5), medium (value between 0.25 and 0.5) and low (lower than 0.25) (Vaiman et al. 1994; Xie et al. 2010). The mean PIC value of the 1739 SNP markers in this population was of 0.10. Moncada and McCouch (2004) also observed low PIC value (0.30) in arabica cultivars using SSR markers. Mishra et al. (2012) obtained the mean PIC values (0.346) in Indian commercial coffee cultivars using polymorphic SRAP markers. Sousa et al. (2017) found a mean PIC value of 0.35 with 11,187 SNP markers. The low PIC value evidences the narrow genetic base of C. arabica. The average PIC values of Coffea spp. were similar to values identified in SNP markers of watermelon (0.13) and Physaria spp. (0.12), but lower than Sorghum bicolor (0.41), cassava (0.42), and wheat (0.44).

The SNP markers used in this study have greater abundance and a co-dominant inheritance pattern, which increases their effectiveness in the discrimination of accessions compared to the AFLP, RAPD, SSR and ISSR markers used in previous studies of genetic diversity of coffee. (Lashermes et al. 2011; Garavito et al. 2016; Sant'Ana et al. 2018). Sant'Ana et al. (2018) identified 6696 SNPs from a collection of 107 wild accessions of C. arabica from Ethiopia and confirmed great allelic richness in wild accessions, especially in accessions from forests located on the west side of the Great Rift Valley. Sousa et al. (2017) selected 11,187 SNP markers from the coffee population resulting from crosses between the Catuaí and Hybrid of Timor genotypes, the genotyping data provided detailed information on parental genotypes and led to the identification of new candidates as parents for a breeding program.

Our work was done only with a subset of the complete collection of the National Bank of Coffee Germplasm in Mexico. Future studies using the entire collection would be of great value in increasing knowledge about the phenotypic and genotypic diversity of C. arabica and related species in Mexico. This study shows that there are genetic differences between C. arabica groups, so the selection of genetically diverse parents’ lines and exploitation of the heterosis resulting from targeted crosses are promising alternatives in a coffee breeding program.

Clustering analysis

The genomic relations matrix clustering and principal coordinate analysis were used to identify both between and within-species diversity. These analyses grouped the 87 genotypes into five diverse clusters on a principal component plot. The first two components accounted for the 32.2% of the total variation, these data may be understood as follows: there is high genetic distance between C. canephora (accessions 94, 95 and 96) and C. liberica (93), revealing inter-species diversity. This was shown by Steiger et al. (2002) using AFLP markers. They reported that C. canephora and C. liberica were more genetically distinct. Finally, it seems to be low genetic distance within C. arabica accessions, but in the F1 CIRAD’S hybrids sub population, accession number 83 it’s more distant than the rest, could be that belongs to different progenitors. Anagbogu et al. (2019) applied multidimensional scaling (MDS) and found a 36.2% of variation in the re-classification of 46 genotypes of C. canephora through genotyping-by-sequencing-single nucleotide polymorphism (GBS-SNP) analysis. Also, the genomic relations matrix G can be used for studies of the structure of the population of interest or in genomic prediction.

The dendrogram obtained by the Ward.D2’s method showed that the 87 genotypes were separated into five dissimilar groups: the first group comprised mostly C. arabica genotypes, the second group comprised C. arabica genotypes with a C. liberica genotype included, the third group comprised a small set of C. arabica genotypes and a C. canephora genotype was included, the forth group compiles the F1 Hybrids (CIRAD, France) and the fifth comprised two C. canephora genotypes. The formation of five distinct groups based on these results made possible to observe that the clustered genotypes form homogeneous groups with similar characteristics and the distinct groups are those among which we find genetic diversity. Bikila et al. (2017) showed genetic diversity in a core collection of 50 C. canephora clones and obtained six different groups, which were genotyped with 46,074 SNPs molecular markers.

Population structure analysis

Similar to the dendrogram analysis with previous genotypic characterization of this central collection using SNP markers, population structure analysis, using K = 5, formed five different groups. The first group clustered C. liberica and C. canephora species, the second group clustered mostly C. arabica accessions of the central collection, which evidenced the greater dissimilarity of these accessions with C. liberica and C. canephora species; the third group clustered CIRAD’s F1 hybrids. Also, it was shown by Steiger et al. (2002), using AFLP markers, that C. canephora and C. arabica were more genetically similar, revealing inter-species diversity even though C. arabica resulted from a recent hybridization between C. canephora and C. eugenioides (Lashermes et al. 1999). Fourth and fifth clusters compiled different C. arabica accessions among them. SNP markers and this type of genetic-statistical analysis provide more accurate and less subjective genetic information than that generated from phenotypic data, which is useful in breeding programs (Sousa et al. 2017).

The results obtained from this Coffea spp. central collection are similar to those reported in the study of Sant'Ana et al. (2018), who found in the population structure analyses the presence of two to three groups (K = 2 and K = 3), corresponding to the east and west sides of the Great Rift Valley and an additional group formed by wild C. arabica accessions collected in the western forests Sousa et al. (2017). analyzed the population structure of coffee genotypes of interest for breeding studies, they used 11,187 SNP markers from which two groups (K = 2) were obtained.

Conclusion

DArTseq™ technology identified 1739 SNP polymorphic markers, which discriminated five divergent groups at a distance of 95 and detected low genetic variation among the Coffea spp. of the central collection. The identified groups have promising genotypes within them and could be useful for the establishment of a coffee breeding program in Mexico. Our study confirmed that the genotyping method by DArTseq™ can be successfully used in studies of genetic diversity.