Introduction

The use of molecular markers to detect polymorphism at DNA level is one of the most significant developments in molecular biology techniques. Molecular markers are more advantageous than phenotypic markers for presenting the highest selection gain for the traits of interest and for being uninfluenced by the environment. In addition, they can be used at the plant development stage (Gartner et al. 2013).

Among the several types of molecular markers, SNP have stood out for being the most abundant type of polymorphism in genomes, for being codominant and biallelic, and for automatizing the technique used to obtain these markers (Resende et al. 2008; Liao and Lee 2010). In recent years, these markers have frequently been used in studies on humans (Gabriel et al. 2002; Ojopi et al. 2004; Pinto et al. 2008; Gnirke et al. 2009) and in some animal and plant species (Abatepaulo et al. 2008; Caetano 2009; Yang et al. 2013); (Zhang et al. 2011; Neves et al. 2013, 2014; Zhang et al. 2015; Resende et al. 2016). However, they have not yet been identified in several plant species of economic relevance, such as the coffee species Coffea canephora, as they are relatively recent.

With the development of new next-generation sequencing (NGS) technologies, the discovery of SNP became easier and faster, and the costs of data point were reduced (Carvalho and Silva 2010).

The company RAPiD Genomics, located in Florida, USA, developed a methodology to obtain SNP using NGS developed for humans (Gnirke et al. 2009) and adapted to plants (Neves et al. 2013, 2014). This technology uses a method of genotyping-by-sequencing of specific regions of the genome.

Once obtained, the SNP can be used in several relevant studies of the species, such as in the analyses of genetic diversity and population structure. These analyses validate SNP to evaluate the discriminatory capacity in genetically related populations, besides the genetic variability and population structure of the species studied.

The presence of genetic variability is a basic and necessary condition for success in a genetic breeding program. The genetic variability observed in C. canephora species is mostly due to the large geographic distribution (especially of the genus Coffea), the domestication process, and the natural allogamy observed in the species (Davis et al. 2006; Babova et al. 2016).

The diversity of C. canephora was first described by Berthaud (1986). The author identified two distinct genetic groups based on their respective centers of diversity: the Guinean group, formed by West African genotypes (Guinea and Ivory Coast), and the Congolese group, made up of Central African genotypes. Other studies have also indicated that the division of C. canephora into two groups is strongly related to its geographic isolation and historical events of glaciations that occurred 18,000 years ago (Gomez et al. 2009; Cubry et al. 2013).

The Congolese group, unlike the Guinean group, was disseminated and progressively improved throughout the domestication and breeding processes of C. canephora. The Congolese group was divided into five subgroups, namely SG1, SG2, B, C, and UW (recently discovered and made up of wild accessions from Uganda) (Musoli et al. 2009).

The subgroup SG1 is formed by genotypes occurring from Benin to Gabon, also known as Conilon coffee. SG1 is the subgroup mostly adapted to Brazil and is present in the main varieties developed in the country. The subgroups SG2 (from the Democratic Republic of Congo), B (from the Central African Republic), and C (from Cameroon) consist of genotypes known as Robusta coffee. These coffee plants are tall, vigorous, with large leaves and fruits, are resistant to coffee rust, and more susceptible to drought (Marraccini et al. 2012).

Conilon and Robusta are considered as two heterotic groups with distinct and complementary traits within the C. canephora species. Thus, the expressive genetic variability observed between and within C. canephora population makes this species extremely important in breeding programs that aim at new varieties that benefit from heterosis (Lashermes et al. 2000; Leroy et al. 2005).

Rapid advances in molecular technologies have opened new perspectives for the evaluation of genetic diversity, involving different statistical approaches (Ferrão et al. 2015). A genetic dissimilarity matrix (or similarity) can be constructed by the analysis of all possible pairwise genotypic combinations, which can subsequently be used to establish patterns of similarities and dissimilarities (Kosman and Leonard 2005). A Bayesian approach implemented in the STRUCTURE software (Pritchard et al. 2000) has also been used to study the genetic diversity and population structure. In this software, genotypic data are used for the probabilistic classification of genotypes considering K populations (in which K may be unknown).

Therefore, this study aimed to identify SNP molecular markers for C. canephora species, validate them for genetic studies, and use them to select parents and hybrids for breeding.

Materials and methods

Genetic material

The population was selected to represent a broad genetic variability of the species C. canephora. To this end, divergent materials with agronomic traits of interest were selected.

An interpopulational partial diallel was carried out to form base populations to initiate the recurrent selection program of C. canephora in the breeding program of the Agricultural Research Company of Minas Gerais (Epamig), in association with the Federal University of Viçosa (UFV) and the Brazilian Company of Agricultural Research—Coffee (Embrapa Café). Five genotypes of Conilon (male parent) and five genotypes of Robusta (female parent) were used in the diallel. Besides these parents, five Conilon genotypes, four Robusta genotypes (Table 1), and 20 Hybrids (Conilon × Robusta) obtained from the diallel (Table 2) were also included in the study population. Twenty-one genotypes of the Conilon varietal group and 22 of the Robusta varietal group (Table 3) of the Epamig/UFV/Embrapa breeding program were selected, totaling 72 genotypes.

Table 1 Parents Conilon and Robusta
Table 2 Description of the crosses to obtain Hybrids that make up the population
Table 3 Genotypes of the Conilon and Robusta varietal groups

The genetic material of Conilon was obtained from the Capixaba Institute of Research, Technical Assistance and Rural Extension (INCAPER), and the Robusta material was obtained from the Centro Agronómico Tropical de Investigación y Enseñanza (CATIE). The materials were brought to EPAMIG (Viçosa, MG) germplasm bank by the researcher Antonio Alves Pereira.

Genomic DNA extraction

Young and fully expanded leaves from the 72 genotypes were collected. Genomic DNA was extracted using the methodology described by (Diniz et al. 2005). DNA concentration was verified in NanoDrop 2000, and its quality was evaluated in 1% agarose gel. DNA concentration of the samples was standardized and taken to RAPiD Genomics for the construction of probes, sequencing, and identification of SNP molecular markers.

Probes design and SNP identification

The sequencing-based genotyping was carried out using targeted enrichment, followed by next-generation sequencing, as formerly described (Gnirke et al. 2009; Song et al. 2016). A total of 40,000 120-bp probes were designed, using a combination of genomic resources available for C. arabica and C. canephora, aiming to reduce the complexity of the genome and enrich the libraries for the targeted regions. The genomic resources consisted of C. arabica specific expressed sequence tags (ESTs) and C. canephora specific ESTs obtained from a database of the Brazilian Coffee Genome Project, consisting of more than 200 thousand ESTs and corresponding to about 33 thousand transcripts (Vieira et al. 2006). Moreover, probes from the reference genome of C. canephora (Denoeud et al. 2014) were also developed, with an estimated size of 710 Mb. The EST database was also mapped against the reference genome to determine candidate SNP that represented good candidate regions to become a polymorphic probe. Furthermore, all the probes were designed in regions that were not repetitive in the C. canephora genome and absent in the plastid genome.

The process consisted of targeting ESTs that were unique to C. arabica and did not match the canephora reference genome. Based on these unigenes, 1879 probes derived exclusively from C. arabica contigs were designed. However, the reference genome of C. canephora was published with a total length of about 568.6 Mb, corresponding to 80% of the estimated total length (710 Mb) of the genome this species (Denoeud et al. 2014). Thus, the sequences of these probes were incorporated into the reference genome of this species. Afterward, we identified 8236 probes mapping to 7347 annotated genes that contained at least one SNP between the EST database and the reference genome. Then, the reference genome was used to design an additional set of 11,879 probes that were mapped uniquely in the genome and represented 11,879 genes, besides the 7347 genes formerly described. Finally, to add up to 40,000, 18,006 probes were selected in non-genic regions of the reference genome by breaking the genome in 10 kb intervals and selecting probes that are well-distributed and that covered the entire genome.

The 72 coffee samples were prepared for NGS and hybridized against the probes synthesized in-solution. The analysis used the protocols described by Neves et al. (2013). C. canephora and C. arabica genotypes were sequenced using the sequencing platform Illumina Hi-Seq after capturing 40,000 target regions in the coffee genome. The sequencing product was separated into each individual barcodes, corresponding to each genotyped sample. Low-quality bases with less than 20 quality score in the 3′ end were trimmed out, followed by a low-quality filter that removed reads with more than 10% of the read with less than 20 quality score. Filtered reads were aligned against the reference genome of C. canephora using Mosaik (Lee et al. 2014). Single Nucleotide Polymorphisms (SNP) were identified using diploid settings on Freebayes (Garrison and Marth 2012), leading to the identification of 162,026 SNPs (SNP specific to C. arabica, specific to C. canephora, and common to both species) in 27,651 polymorphic probes, with a mean of five SNP per probe.

Quality analysis of SNP

The VCFtools software (Danecek et al. 2011) was used for quality analysis of the SNPs (Resende et al. 2016). The following quality parameters were used: MinDP—defined as the minimum sequencing depth per SNP and per individual, in which SNP with values lower than that determined in the filter were considered as missing data; DPrange—mean sequencing depth range of the population for which an SNP is maintained. SNP markers with mean coverage in the population outside the range determined in each filter were removed; Miss—percentage of missing data accepted in the population after filtering by the MinDP parameter. SNP markers with Miss values above that set in the parameter were removed; MAF—minor allele frequency that refers to the frequency at which the alternative allele occurs in a given population; and MinQ—minimum quality accepted for SNP. SNP with quality values lower than those established by this criterion were removed.

Two data filtering analyses were applied. To obtain the dataset of the first filter, the parameters MinDP3, DPrange15-750, Miss0.4, MAF0.01, and MinQ10 were considered. For the second filtering, the parameters MinDP5, DPrange20-350, Miss0.2, MAF0.03, and MinQ20 were considered. The graphical analysis was performed using the Circos (Krzywinski et al. 2009).

Analysis of genetic diversity and population structure

The SNP data were coded as 0, 1, and 2 for being biallelic and codominant markers. Thus, considering the genotypes of the individuals as A1A1, A1A2, or A2A2 in each locus, these genotypes were coded as 0, 1, or 2, respectively.

The genetic dissimilarity (or similarity) matrix was obtained by the arithmetic complement of the unweighted index using the GENES software (Cruz 2013). Genetic distance can be estimated by the following equation:

$$D_{{ii^{'} }} = 1 - (\frac{1}{2}\mathop \sum \limits_{j = 1}^{L} p_{j} c_{j} )$$

where: Dii′ = genetic distance between pairs of accessions i and i′;

$$\mathop \sum \limits_{j = 1}^{L} p_{j} c_{j} = 1$$

\(p_{j} = \frac{{a_{j} }}{A}\): weight associated with loco j, determined by: aj: total number of loco alleles j; A: total number of alleles studied; cj: number of common alleles between pairs of accessions i and i′.

The cluster analysis to construct the dendrograms was performed using the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) and Neighbor-Joining (NJ), in the statistical software MEGA 7.0 (Kumar et al. 2016).

Another cluster analysis was performed using the STRUCTURE software (Pritchard et al. 2000), which considered the K values ranging from 1 to 6, with ten replicates. Each run was carried out with a 25,000 burn-in period, followed by 75,000 MCMC (Markov Chain Monte Carlo). The best number of genetic groups was estimated by calculating the ΔK, as described by Evanno et al. (2005), using the software STRUCTURE HARVESTER (Earl and VonHoldt 2012).

Principal Components Analysis (PCA) was performed using the pcaMethods R package (Stacklies et al. 2007) implemented in ClustVis (Metsalu and Vilo 2015).

Results

The median alignment was 95% and a total of 117,450 SNP markers for C. canephora were identified with the probes distributed throughout the coffee genome. After the quality analyses, 57,529 SNP markers were obtained with the first filter, and 33,485 SNP markers were obtained with the second filter (Fig. 1). For the genetic analyses, the 33,485 SNPs with the strictest quality parameters were considered in the study. These 33,485 SNP, as well as their respective positions in the genome, were made available (Online Resource 1). The number of SNP per chromosome, considering the second filter, ranged from 124 to 5076. The highest number of SNP was observed on chromosomes 0 and 2 (Fig. 1).

Fig. 1
figure 1

Number of SNP markers by chromosomes obtained from raw data and filters 1 and 2. SNP markers distributed throughout the UNIGENES from the EST sequences of Coffea arabica and from the 11 Coffea canephora chromosomes and chromosome 0 of Coffea canephora

A summary of the distribution of SNP in the genome of C. canephora is presented in layers (Fig. 2). The first (black, from outer to inner layers) represents each chromosome. The second layer corresponds to the number of raw SNP (blue). The last two layers (red and green) express the density of SNP after the first and second filters, respectively.

Fig. 2
figure 2

Circular visualization of the SNP distribution in the Coffea canephora genome. From the outer to the inner layers, the graphic is separated into four layers: (i) Chromosomes; (ii) number of raw SNP; (iii) number of SNP after first filter; (iv) number of SNP after second filter

The genetic distance matrix was generated using the molecular data, and two large groups (I and II) were formed with the dendrogram using the UPGMA technique. The first group was formed by accessions of the varietal group Robusta, and the second group consisted of accessions of the varietal group Conilon and the Hybrids (Conilon × Robusta). The second group can be divided into two subgroups, II. a and II. b, consisting of the accessions of Conilon and Hybrids, respectively (Fig. 3).

Fig. 3
figure 3

Dendrogram obtained using the UPGMA technique, based on the arithmetic complement of the unweighted index of 72 Coffea canephora genotypes (26 Conilons, 26 Robustas, and 20 Hybrids) and 33,485 SNP markers. Green, red, and blue lines represent the varietal groups (Conilon and Robusta) and the Hybrids, respectively

The Hybrid H0915-1 was allocated in the group of Robusta accessions. The genotypes Conilon UFV 3628-2 and UFV 3628-45, together with the genotype Robusta UFV 3374-29 were allocated in the subgroup of Hybrids (Fig. 3).

Results of the NJ algorithm for cluster analysis were equivalent to those obtained by the UPGMA methodology. The two dendrograms allocated the genotypes in the Conilon, Robusta, and Hybrid groups, respectively. The Hybrid H0915-1 was also allocated in the group of Robusta genotypes. The genotypes Conilon UFV 3628-2 and UFV 3628-45 and the genotype Robusta UFV 3374-29 were allocated closer to the subgroup of the Hybrids (Fig. 4).

Fig. 4
figure 4

Neighbor Joining (NJ) dendrogram of the 72 C. canephora genotypes (26 Conilons, 26 Robustas, and 20 Hybrids). Green, red, and blue lines represent the varietal groups (Conilon and Robusta) and the Hybrids, respectively

The Bayesian approach, performed by the STRUCTURE software (Pritchard et al. 2000), confirmed the results of the clustering methods. According to the methodology proposed by Evanno et al. (2005), the highest value of ΔK was obtained for K = 2 (Fig. 5), showing that genotypes can be separated into two groups, based on the Conilon or Robusta varietal group (Fig. 6a).

Fig. 5
figure 5

Graphic obtained with the values of ΔK for visualization of the best K (K = 2), according to the methodology proposed by Evanno et al. (2005)

Fig. 6
figure 6

Bar graphic of the STRUCTURE software used to study the diversity of the 72 Coffea canephora genotypes (26 Conilons, 26 Robustas, and 20 Hybrids). The 72 genotypes are represented below the graphic, and were divided into two (K = 2) and three (K = 3) groups, Fig. 6a, b respectively, according to the Conilon and Robusta varietal group, and also allowed discriminating Hybrid genotypes resulting from crosses between the two heterotic groups. The red bar is associated to the Robusta genotypes, the green bar to the Conilon genotypes and the blue bar to the Hybrid genotypes

The Hybrid H0915-1 presents approximately 80% genetic similarity with Robusta genotypes. The genotypes Conilon UFV 3628-2 and UFV 3628-45 showed approximately 50% genetic similarity with the Conilon group and 50% genetic similarity with the Robusta group. The genotype Robusta UFV 3374-29 showed approximately 80% genetic similarity with the Robusta group and 20% genetic similarity with the Conilon group (Fig. 6a).

Considering the division of the population into three groups (K = 3) (Fig. 6b), Robusta genotypes were allocated to group I (red bars). Group II (blue bars) was composed of the Hybrids and the genotypes Conilon UFV 3628-2 and UFV 3628-45 and the genotypes Robusta UFV 3368-58 and UFV 3374-29. Group III (green bars) clustered the Conilon genotypes. Hybrid H0915-1 was allocated to group II but presented about 38% of genetic mixture with group I. The genotype Robusta UFV 3368-58 was allocated to group II but presented about 42% of genetic mixture with group I. The genotype Robusta UFV 3374-29 was allocated to group II but presented about 31% of genetic mixture with group I and 5% of genetic mixture with group III.

Values of genetic dissimilarities estimated between and within the varietal groups (Conilon and Robusta) and the Hybrids are presented in Tables 4 and 5. Table 4 shows the lowest genetic dissimilarities and the mean distance observed between and within the groups. Table 5 shows the five major genetic dissimilarities obtained between and within the varietal groups and the Hybrids.

Table 4 Lower values of genetic dissimilarity and mean genetic dissimilarity obtained between and within the varietal groups (Conilon and Robusta) and the Hybrids
Table 5 Greater values of genetic dissimilarity obtained between and within the varietal groups (Conilon and Robusta) and the Hybrids

The lowest estimates of genetic distances were verified among the genotypes of the Conilon varietal group, and the lowest distance (0.0071) was observed between the accessions Conilon UFV 3628-45 and UFV 3628-2 (Table 4). These genotypes were grouped in the Hybrids subgroup, based on the analyses of the dendrograms.

The highest values of genetic distances were detected between the genotypes of the Conilon and Robusta varietal groups. The highest distance (0.3923) was obtained between the genotypes Conilon UFV 3629-11 and Robusta UFV 3371-19 (Table 5). The lowest genetic distance was obtained between the genotypes Conilon UFV 3628-45 and Robusta UFV 3374-29. According to the analyses of the dendrograms, the two genotypes were allocated to the Hybrid subgroup. The overall mean genetic distance was 0.2395.

The Hybrids H0915-1, H0919-2, and H0921-1 presented the highest values of genetic dissimilarity when compared with other Hybrids and the genotypes of the Conilon and Robusta groups. Among these three Hybrids, the greatest genetic distance was observed between H0915-1 and H0921-1 (0.2284) (Table 5). According to the analyses of the dendrograms, Hybrid H0915-1 was allocated to the Robusta group.

Table 6 shows the five largest distances between and within the varietal groups and Hybrids, disregarding the genotypes Conilon UFV 3628-2 and UFV 3628-45, the genotype Robusta UFV 3374-29, and the Hybrid H0915-1 (for having been mistakenly allocated in their respective groups).

Table 6 Greater values of genetic dissimilarity obtained between and within the varietal groups (Conilon and Robusta) and the Hybrids, disregarding the accessions Conilon UFV 3628-2 and Conilon UFV 3628-45, Robusta UFV 3374-29, and the Hybrid H0915-1

Based on Table 6, genotypes of the Conilon and Robusta varietal groups with greater genetic dissimilarities can be selected to form base populations. Hybrids can also be selected; for instance, Hybrids H0911-2, H0919-2, H0922-1, H0912-1, H0921-1, and H0918-1 stood out for their higher genetic distance values. These Hybrids can be used to compose a set of hybrid varieties and thus prevent problems with genetic autoincompatibility present in C. canephora species.

A PCA was performed using SNP markers, revealing two principal components (Fig. 7). PC1 and PC2 contributed with variations of 68.4 and 19.0%, respectively. Cumulatively, these two PCs contributed with 87.4% of the total variation observed. PCA formed three groups. The genotypes Conilon UFV 3628-45, UFV 3628-2, and UFV 3629-11; the genotype Robusta UFV 3368-58; and the Hybrid H0915-1 were not allocated to any group. The genotype Robusta UFV 3374-29 was allocated to the Hybrids group. The genotypes Conilon UFV 3628-45 and UFV 3628-2, although not allocated to any group, were closer to the Hybrid group. Hybrid H0915-1 was close to the Robusta group. Although genotype Robusta UFV 3368-58 was allocated in any group, it was located between the Robusta and Hybrid groups, corroborating the results of the software STRUCTURE for K = 3 (Fig. 6b).

Fig. 7
figure 7

Analysis of Principal Components of the 72 genotypes of Coffea canephora, showing the formations of three groups. X and Y axis show principal component 1 and principal component 2 that explain 68.4 and 19% of the total variance, respectively

PCA results agree with those obtained in the previous analyses, except for genotype Conilon UFV 3629-11. Although this genotype was not allocated in any group, it was located close to the Conilon group by the PCA analysis. Results also show that Robusta genotypes have greater spatial dispersion (genetic variability) in relation to the Conilon genotypes.

Discussion

Due to the large number of data obtained from genotyping with thousands of molecular markers, even a low error rate can be detrimental in subsequent analyses. Thus, some parameters for quality control and data filtering must be used to remove these errors. These evaluations allow identifying below-standard markers and samples, which must be removed before the analysis to reduce false positive error (type I error) and false negative error (type II error) (Anderson et al. 2010).

The RAPiD Genomics platform presented a large potential for SNP identification and genotyping since a significant number of these markers was identified in C. canephora when compared with other studies (Zhou et al. 2016; Garavito et al. 2016). A greater number of SNP was identified on chromosomes 0 and 2. However, chromosome 0 is not a true chromosome, but a set of unsorted sequence scaffolds of C. canephora (Denoeud et al. 2014). A study using a mixed model to multiple harvest-location trials applied to genomic prediction in Coffea canephora identified the highest number of SNP on chromosome 2 after filtering the data (Ferrão et al. 2017). These results may be due to the length of chromosome 2 in the genome of C. canephora (Denoeud et al. 2014).

These markers were able to divide the genotypes of the studied population into distinct groups. Thus, the efficiency of these SNP markers has been proven in the analyses of genetic diversity and population structure.

A technique based on the reduction of complexity by using the restriction enzymes that aim at gene-rich regions and by using the NGS technology, known as DArTseq, was applied to test the performance of the markers derived from this method in coffee plants. By studying the genetic diversity of the species C. canephora cultivated in Vietnam and Mexico, 4,021 SNPs were identified (Garavito et al. 2016). These SNP markers constitute a valuable tool for breeders. In addition, they increase the knowledge on the genetic diversity of C. canephora and contribute to the understanding of the genetic background of important varieties from major coffee producers (Garavito et al. 2016).

In another study, a set of SNP markers was developed and used to assist the identification of coffee germplasm, demonstrating the usefulness of EST sequences as an approach for the rapid development of a high-throughput targeted genotyping tool (Zhou et al. 2016). By using ESTs of the C. arabica, C. canephora, and C. racemosa obtained from public databases, 7538 SNP markers were developed, of which 180 were selected for validation using 25 accessions of C. arabica and C. canephora from Puerto Rico. Based on the validation result, a panel of 55 SNP markers that are polymorphic between both coffee species was constructed (Zhou et al. 2016).

In the UPGMA, the SNP were able to classify the genotypes into different groups, based on the varietal group (Conilon and Robusta). It also discriminated the Hybrid genotypes obtained from the cross between Conilon and Robusta genotypes. The results were confirmed by the analysis using the NJ algorithm; by the Bayesian approach, using the STRUCTURE software (Pritchard et al. 2000); and by the PCA analysis. These three analyses were also efficient in separating the Conilon and Robusta varietal groups. However, as in the previous UPGMA cluster analysis, the Hybrid genotypes were also discriminated.

Robusta and Conilon heterotic groups present distinct and striking phenotypic traits. Nevertheless, the classification of these materials is no easy task. This fact is due to their strictly allogamous reproduction form, high phenotypic amplitude, and heterozygosis, in addition to possible natural crosses between the two varietal groups. Thus, studies on the genetic diversity using molecular markers are crucial in breeding programs for the conservation of genetic sources and the correct evaluation of genotypes (Ferrão et al. 2013; Ren et al. 2013; Ferrão et al. 2015).

Three accessions (Conilon UFV 3628-2, Conilon UFV 3628-45, and Robusta UFV 3374-29) were not allocated to their respective varietal groups, but to the subgroup of the Hybrids. This result suggests that these accessions probably derived from natural crossings and were mistakenly classified.

Conilon genotypes, including UFV 3628-2 and UFV 3628-45, are derived from half-sib seeds. Since Conilon is allogamous and has an autoincompatibility system, the two seeds that originated these two genotypes may have been originated from the pollination with a Robusta parent, instead of Conilon, forming natural Hybrids in their collection site.

The genotype Conilon UFV 3628-2 was chosen as the parent to obtain Hybrids in the interpopulational partial diallel due to its prominence in yield, resistance to rust, and earliness. The genotype Conilon UFV 3628-45 also stood out for these traits. Resistance to rust is evidence that these accessions may be natural Hybrids since Conilon genotypes are usually susceptible to rust. Resistance to this disease is usually obtained from Robusta (Cubry et al. 2008; Musoli et al. 2009). In addition, relatively high yield was observed in these accessions when compared with the other Conilon genotypes. In the mean of 5 years of evaluations (2012–2016) (data not shown), the genotypes Conilon UFV 3628-2 and Conilon UFV 3628-45 were in the fifth and eighth position, respectively, in yield data, when compared with the other Conilon genotypes evaluated in the breeding program. Yield increase may also be the result of Hybrid vigor, corroborating the molecular data that indicate these natural Hybrid genotypes.

The genotype Robusta UFV 3368-58 probably derived from natural crossings and was also mistakenly classified. When using the software STRUCTURE for K = 3, this genotype was allocated to group II; by the NJ algorithm, it is also located close to the Hybrid genotypes; and in the PCA analysis, despite not being allocated to any group, this genotype is located between the Robusta and Hybrid groups.

The accession Robusta UFV 3374-29, also indicated as a probable natural Hybrid by the molecular data, stood out for the mean yield of the 5 years of evaluations, although data were slightly inferior when compared with those of the accessions Conilon UFV 3628-2 and Conilon UFV 3628-45. This accession is prominent for its earliness, which was probably obtained from its possible parent Conilon. The analysis using the STRUCTURE software for K = 2 revealed approximately 78% similarity with Robusta and approximately 22% similarity with Conilon. This contribution of the Conilon parent appears to be sufficient to guarantee the earliness of the genotype Robusta UFV 3374-29. In the analysis using the software STRUCTURE for K = 3, the genotype Robusta UFV 3374-29 was allocated to group II, proving that this genotype is a natural Hybrids.

The Hybrid H0915-1 was allocated to the Robusta group in both cluster analyses. Based on the molecular data, one of the parents of Hybrid H0915-1 is the genotype Conilon 3628-2, which is a natural Hybrid. Thus, this Hybrid was formed by the cross [(Robusta × Conilon) × Robusta]. This explains why the Hybrid H0915-1 was genetically closer to Robusta genotypes in the cluster analyses. This result was confirmed by the STRUCTURE software for K = 2, which shows approximately 84% similarity with Robusta and approximately 16% similarity with Conilon. Using the STRUCTURE software for K = 3, Hybrid H0915-1 was allocated to group II but presented about 38% of genetic mixture with group I, which comprises Robusta genotypes.

The genetic distance matrix showed that the Conilon varietal group presented the lowest values of genetic dissimilarity. The presence of genetic variability within the base population is responsible for the success of the breeding programs (Grandillo 2014). Thus, to obtain the base population of genotypes of the Conilon varietal group, genotypes with the highest values of genetic distance observed in the distance matrix must be selected. Therefore, in this germplasm, the five crosses presented in Table 6 should be prioritized, if the objective is to obtain crosses only between accessions of the Conilon group. The possibility of increasing the variability of this group in the Germplasm Bank, by introducing new accessions, should also be considered.

To form the base population of the Robusta varietal group, the accessions presented in Table 6 must also be prioritized, since they have the highest values of genetic dissimilarity and are important sources of variability for breeding programs.

Higher genetic distance between genotypes of the Conilon and Robusta heterotic groups were expected. Therefore, the higher distance value (0.3923) between the genotype Conilon UFV 3629-11 and the genotype Robusta UFV 3371-19 is explained by the fact that these genotypes present distinct and complementary traits.

In breeding programs, such as that of Epamig/UFV/Embrapa, which aim at identifying heterotic groups and contrasting parents to explore Hybrid vigor, the correct evaluation of diversity is a key factor for the choice of parents. The expectation that divergent or contrasting parents generate good Hybrids is because if two parents are genetically close, there is a trend for them to share many common genes or alleles. Thus, while crossing them, little complementarity and low Hybrid vigor will occur due to the low level of allelic heterozygosity in the cross (Ferrão et al. 2013).

One of the main focuses of the Epamig/UFV/Embrapa breeding programs is to recommend a combination of Hybrid coffee seeds that avoid genetic incompatibility problems; are more drought-tolerant than clonal varieties; and are associated with rust resistance, maturation uniformity, and greater fruit size. Thus, by observing the results and disregarding H0915-1, which was allocated to the Robusta varietal group, the Hybrids H0911-2, H0919-2, H0922-1, H0912-1, H0921-1, and H0918-1 (Table 6) may be remarkable for the breeding program since they present the highest values of genetic dissimilarity.

Results revealed the great efficiency of technique used in the present study for the identification and genotyping of SNP markers, as well as the ability os these SNP to evaluate the genetic diversity. The most common molecular markers used in diversity studies in the genus Coffea are the Random Amplified Polymorphic DNA (Sera et al. 2003), Amplified Fragment Length Polymorphism (Prakash et al. 2005; Brito et al. 2010; Diola et al. 2011), and Simple Sequence Repeat (Prakash et al. 2005; Ferrão et al. 2013). A successful study has been carried out using SNP molecular markers, obtained by the genotyping-by-sequencing (GBS) approach (Ferrão et al. 2017). The authors investigated genomic selection models suitable for use in C. canephora. According to the authors, the results can be used as a basis for further studies on the genus Coffea and can be expanded for similar perennial crops (Ferrão et al. 2017). Another recent work, using SNP markers in coffee, obtained by the GBS technique, enabled a more comprehensive and significant study of the evolutionary history of the genus Coffea (Hamon et al. 2017).

Conclusions

The methodology of SNP identification and genotyping used in this study presents great potential to detect and select an expressive number of SNP markers.

The markers were efficient in evaluating the genetic diversity and population structure of C. canephora. Based on their analysis, promising crosses were selected within and between the varietal groups, according to the C. canephora breeding programs.

Due to the low genetic diversity of genotypes of the Conilon varietal group, when the objective is to increase the genetic base, genotypes with the highest values of genetic distance must be selected, and the variability of this group in the Germplasm Bank must be amplified by introducing new accessions to the Conilon population.

The highest values of genetic distances were observed between genotypes of Conilon and Robusta heterotic groups.

Hybrids with higher values of genetic distances can be selected, which are important for C. canephora breeding programs since they avoid genetic incompatibility problems.