Introduction

Many are the practical applications of genomics, such as the use of molecular tools aiming at individual identification, quantification of genetic variability in diverse germplasm and assisted selection for common bean disease (Blair et al. 2007, 2013; Cardoso et al. 2013, 2014). These applications are being incorporated routinely in several breeding programs for common bean, as a new paradigm with direct and quantifiable impact on the optimization of processes to accelerate the cycles of selection, as well as, the management and enrichment of genetic diversity in breeding populations. Common bean genomic approaches have the opportunity to reach new levels since the availability of the whole genome sequence of two important genotypes representatives of the Andean and Mesoamerican gene pools (Schmutz et al. 2014; Vlasova et al. 2016), opened new perspectives for the development of molecular tools with a high aggregated genetic value for plant breeding. Taking advantage of these advances, the SNP markers are useful in turning the approaches of whole-genome genotyping more accessible and affordable. Genome-wide SNP discovery by re-sequencing efforts has been performed in important crops and applied for different purposes such as genome-wide diversity studies, association mapping (Zhou et al. 2015) and genotyping in a genomic selection context (Poland et al. 2012; Crossa et al. 2013; Jarquin et al. 2014). For common bean, the continuous efforts to develop molecular tool useful in capturing the broad genetic diversity are moving towards the effective genomic assisted breeding for agronomic and quality traits (Gepts 2014; Rodriguez et al. 2015).

An increasing number of single-nucleotide polymorphism (SNP) markers for the common bean (Phaseolus vulgaris L.) crop, most derived from the polymorphic inter-Andean and Mesoamerican gene pools, have been described in the last ten years. The first effort on SNP discovery was based on the analysis of expressed sequence tags (ESTs) (Ramírez et al. 2005). Later, Gaitán-Solís et al. (2008) used the strategies of target genome sequencing and complexity reduction to identify a SNP set. Subsequently Galeano et al. (2009a) developed SNPs from SSCP (single strand conformation polymorphisms) markers and mismatch cleavage nuclease (Galeano et al. 2009b). McConnell et al. (2010) generated over a thousand gene-based SNP markers for the common bean. The first study that discovered a large number of SNPs (3487 SNPs) in the common bean was conducted by Hyten et al. (2010) by high-throughput DNA sequencing and was based on a reduced representation library. Souza et al. (2012) compared sequence-tagged sites (STSs) from a group of six diverse genotypes to identify a new set of 677 SNPs for genetic analysis. Based on conserved inter-gene pool gene sequences, Blair et al. (2013) evaluated hundreds of SNPs and determined their usefulness for genetic analysis and mapping saturation. More recently, Müller et al. (2015) filtered the SNP bank made available by Hyten et al. (2010) to genotype and perform a comparative analysis with SSR (simple sequence repeats) to estimate genetic diversity and population structure.

Based on SNP genotyping data, several studies have concluded that most SNP markers derived from the inter-gene pool accessions are useful for distinguishing Andean and Mesoamerican genotypes but are less useful for distinguishing within each gene pool. The tendency for the sampled accessions to cluster and follow the same pattern of differentiation of the target genotypes used to derive the SNPs is a problem associated with SNP development when only a few individuals are used for this purpose, as reported for barley (Moragues et al. 2010), rice (Thomson et al. 2012) and the common bean (Blair et al. 2013; Müller et al. 2015). This problem is a consequence of ascertainment biases existing in the panel of SNPs developed when exploring polymorphisms between genotypes used for their development. For the common bean, although SNP assay success has been high, the proportion of polymorphic SNPs tends to be high when contrasting Andean and Mesoamerican gene pools are considered, and decreases as genetic diversity is accessed within the gene pool (Cortés et al. 2011; Blair et al. 2013). In this context, depending on the purpose of the study and the origin of germplasm evaluated, different sets of SNPs should be developed, as previously proposed by Nagasaki et al. (2010).

The rapid and increasing advance in SNP marker discovery and genotyping has largely been facilitated by progress in new DNA sequencing technologies, the next-generation sequencing (NGS) (Margulies et al. 2005; Mardis 2008). The available NGS platforms reduced the DNA sequencing costs and time, requiring lower DNA quantities and increasing the efficiency and accuracy of high-throughput DNA sequencing (Ganal et al. 2009). However, the cost fully sequence individual genomes is still prohibitive, mainly for plants/crops with lower economic value and, consequently, with scarce financial support for genomic research. Accessible technologies that allow broad sequence polymorphism genotyping for a larger set of samples that is robust and cost-effective are desirable. The application of methods that reduce genome complexity followed by deep sequencing has been successfully used for SNP discovery in different organisms/plant species (Baird et al. 2008; Etter et al. 2011). Restriction site-associated DNA sequencing (RAD-seq) is a reduced-representation sequencing method that uses NGS to generate hundreds of thousands of short sequence tags to simultaneously discover and score SNP markers in hundreds of individuals for a reasonable cost (Willing et al. 2011). RAD-seq has been used to identify SNPs that are useful for population genetics and QTL mapping studies in eggplant (Barchi et al. 2011), barley (Chutimanitsakun et al. 2011), sunflower (Talukder et al. 2014), and other crops. In this work, we used RAD-seq to discover polymorphic SNPs across a set of 12 diverse common bean genotypes, followed by the evaluation of common bean genetic diversity and structure using the Golden Gate genotyping system (Illumina). Additionally, we assessed the potential of RAD for direct genotyping-by-sequencing in the common bean. In conclusion, 1.032 RAD-SNPs were identified in common bean, aligned in coding and noncoding sequences and positioned along the bean genome. For a subset of 384 RAD-SNPs, ≥96 % presented a successful amplification rate and were broaden evaluated for a diverse set of genetic parameters, linkage disequilibrium, giving important insights to be used for the breeding programs. After integration with the previously developed SNPs (totaling 673 SNPs), under the perspective of the common bean breeding programs, the operational SNP panels made available provide the means to characterize the diverse common bean germplasm at the DNA level addressing different research questions.

Materials and methods

Plant material and DNA extraction

A set of 12 common bean genotypes from the EMBRAPA Rice and Beans gene bank were used for RAD-seq development and included nine commercial cultivars and three breeding lines introduced by research institutions from Brazil and abroad (Supplementary Material 1). Total genomic DNA was extracted using the DNeasy Plant Mini Kit (Qiagen, Hilden, NRW, Germany) according to the manufacturer’s protocol. To perform SNP genotyping derived from RAD-seq, we sampled 95 common bean genotypes, including 45 commercial cultivars and 14 breeding lines developed by research institutions from Brazil (47) and abroad (12), 34 Brazilian landraces and two wild accessions, all belonging to Mesoamerican (74 genotypes) and Andean (21 genotypes) gene pools (Supplementary Material 1). Genomic DNA was isolated as described by Grattapaglia and Sederoff (1994). For both sets of samples, quantifications were performed in a Qubit® spectrophotometer (Invitrogen, Carlsbad, CA, USA).

RAD library development and sequencing

For SNP discovery, high quality DNA samples from the 12 selected genotypes above were divided in two equimolar pools with six individuals. The RAD-seq experiment was carried out by FLORAGENEX, Inc (Oregon, USA). In brief, the RAD-seq reduced representation libraries were prepared following digestion using the PstI enzyme, followed by adaptor/barcode ligation, DNA purification and selective DNA amplification and size selection. The two resulting RAD libraries from each pool were sequenced on an Illumina HiSeq platform (Illumina, San Diego, CA, USA) using 75 bp paired-end reads.

Identification of SNPs

RAD-seq derived sequencing reads were subjected to a series of steps in order to obtain a high quality SNP set, geared toward elimination of false positives and assay success for the Golden Gate Genotyping Technology (GGGT, Illumina) (Grattapaglia et al. 2011). The reads were submitted to systematic in silico SNP filtering leveraging the availability of a complete genome sequence of the Mesoamerican variety BAT93 (Vlasova et al. 2016). The following filtering scheme was used to retain polymorphic sites adhering to the following constraints: (1) presence in non-repeat region of the genome (avoiding transposable elements); (2) reads uniquely mapping in the reference genome (to avoid non-specific amplification); (3) only bi-allelic SNPs were maintained; (4) only SNPs with an allele frequency ranging from 40 to 60 %; and (5) adjacent SNPs separated by at least 50 kb along the genome. In the end, the selected set of SNPs was submitted to the proprietary Assay Design Tool (ADT, http://icom.illumina.com/Custom/Index) to select the subset of SNPs with the highest recommended design scores. The list of the sequences from selected primers and their positions in the BAT93 genome assembly is presented in Supplementary Material 2.

SNP genotyping

Based on the high quality SNPs above, a subset of 384 loci were used to synthetize an Oligo Pool Assay (OPA VC0013574) specific for use with the Illumina BeadXpress genotyping platform (Kim and Misra 2007), according to the manufacturer’s instructions. Automatic allele calling was performed using Genome Studio version 1.8.4 (Illumina). The individual SNP genotyping performance was assessed by the call rate using a GeneCall score cutoff ≥0.25. The clustering quality was performed by the GenTrain algorithm, provided as a module in the proprietary software Genome Studio (Illumina).

Genetic diversity and cluster analysis

The SNPs were characterized for genetic diversity by genotyping 95 common bean genotypes (Supplementary Material 1) and estimating several parameters, such as number of alleles (A), number of alleles per locus (Ā), number of private alleles (Ap), expected heterozygosity or gene diversity (H E), observed heterozygosity (H O), fixation index (F IS), genetic probability of identity (PI) and genetic probability of exclusion (PE) for each locus and multiloci, using the program GenAlexv6.5 (Peakall and Smouse 2012). A cluster analysis using simple matching dissimilarity based on neighbor-joining with 1000 bootstrap replicates was obtained by DARwin v5.0.158 (Perrier and Jacquemoud-Collet 2006). All statistical analyses (overall significances) were performed using the non-parametric Wilcoxon test implemented by the Real Statistics Resource Pack software (Release 4.3) (Zaiontz 2013).

Population structure analysis

The Principal Coordinates Analysis (PCoA) and the genetic differentiation index (F ST) were estimated with a confidence interval of 95 % for 9999 permutations, calculated using GenAlex v6.5 (Peakall and Smouse 2012). Analysis of PCoA was calculated using a simple matching dissimilarity matrix. The program STRUCTURE v2.3.4 (Pritchard et al. 2000) was used to infer the genotypes structure. A population number (K) ranging from 1 to 10 was assumed, with 10 interactions each, to identify the K with the highest log likelihood. In each run, the admixture model was applied under the condition of 1,000,000 Markov chain Monte Carlo (MCMC) replications followed by 500,000 burn-in periods. The criteria proposed by Evanno et al. (2005) was used to determine the most probable K through the program STRUCTURE HARVESTER v0.6.93 (Earl and vonHoldt 2012), followed by analysis using the software CLUMPP v1.1.2 (Jakobsson and Rosenberg 2007) to determine the consensus among the 10 most probable K interactions. The program DISTRUCT v1.1 (Rosenberg 2004) was used for the graphical display of the population structure. An analysis of molecular variance (AMOVA) was performed to test the structure of the genetic diversity of the genotypes through the program Arlequin v3.5 (Excoffier and Lischer 2010), with 10,000 permutations.

Linkage disequilibrium

Linkage disequilibrium (LD) was estimated using Tassel v.5.0 (Bradbury et al. 2007; Glaubitz et al. 2014). The standardized disequilibrium coefficient r 2 (parameter r 2 > 0.10) and corresponding p-values (two-sided Fisher’s exact test) were estimated. For the calculation of LD, SNP loci with MAF < 0.05 (Minor Allele Frequency) were discarded from the analysis. The heterozygous genotypes were considered missing data, and the loci were considered to have a significant LD if r 2 > 0.10 and p value <0.001. Haploview 4.2 (Barrett et al. 2005) was used to assess haplotypes based on 363 SNPs with MAF > 0.05 and missing data ≤25 %. The heterozygous alleles were considered missing data. Only SNPs incorporated up to a distance of 20 Mb were considered for the determination of haplotype blocks and D′ and r 2 with confidence intervals according to Gabriel et al. (2002). The LD parameters and haplotype blocks were calculated for the entire sample (95 genotypes) and, separately, for the Andean (21 genotypes) and Mesoamerican (74 genotypes) gene pools. In addition, the haplotype blocks were estimated using the RAD-SNP loci made available by this work and the BARC-SNP loci reported by Müller et al. (2015), totaling 708 SNPs (363 RAD-SNPs and 345 BARC-SNPs) analyzed using a common set of 88 genotypes.

Alignment of the SNPs in the P. vulgaris genome

Using BLASTN (Altschul et al. 1997) with E-value ≤1.0E−10, the flanking sequences of the SNPs were aligned against the genome of P. vulgaris, Andean variety (G19833) (Schmutz et al. 2014). Annotation routines were run with the BLAST2GO tool (Conesa et al. 2005), and the ESTs were aligned to the non-redundant (Nr) Viridiplantae protein database at NCBI ([http://www.ncbi.nlm.nih.gov/]) and categorized by Gene Ontology (GO) terms.

Results

RAD-SNP performance

A total of 23,748 putative SNPs were identified between the two groups derived from the RAD-seq libraries. From these identified SNPs, approximately 58 % proved to be transitions, and a transition/transversion ratio of 1.40 was estimated. By applying successive filters, a set of 3357 (14 %) SNPs were adequate for Illumina Golden Gate genotyping, and 1032 SNPs with the highest score attributed by ADT (assay design tool) are available in this article (Supplementary Material 2). A sample of 384 SNPs, selected based on genome coverage, was assembled into a GoldenGate assay and used to genotype 95 common bean germplasms.

Of the 384 SNPs evaluated in this study, 371 (96.61 %) presented successful amplification with a call rate ranging from 0.6198 to 0.9917 and GenTrain ≥0.2620, as provided by Genome Studio (Ritchie et al. 2011); while 363 SNPs (94.53 %) were polymorphic and eight (2.08 %) were monomorphic considering the 95 common bean genotypes. Of the 13 SNPs excluded from analyses, three had amplification failure and 10 did not have well-defined clusters. An analysis of the genotypes grouped by gene pool showed 264 (68.75 %) polymorphic SNPs within the Andean group and 302 (78.64 %) in the Mesoamerican group. With the greatest allele frequency estimated at 0.728, the mean H E calculated was 0.384 ± 0.004, ranging from 0.093 (EMBRA_1.0_Pv01_49615062_A_G) to 0.506 (EMBRA_1.0_Pv01_6538061_A_G and EMBRA_1.0_Pv05_5947407_T_C). The average estimates of H E for the Brazilian germplasm (n = 81) was 0.372 and significantly different (p ≤ 0.05) from the overall value (n = 95; H E = 0.384). For the Andean gene pool (n = 21), the H E value was estimated at 0.161, while for the Mesoamerican gene pool (n = 74), it was 0.156. Considering only the cultivated/breeding lines (n = 47), the total H E was 0.316 for the Brazilian (Andean n = 7, H E = 0.139 and Meso n = 40, H E = 0.135) and significantly lower (p ≤ 0.05) than the abroad accessions (n = 12; H E = 0.438). Considering only the Brazilian landraces (n = 34), the total H E was 0.428 (Andean n = 11, H E = 0.195 and Meso n = 23, H E = 0.130). With regard to the H O and F IS indices, the values were estimated at 0.009 ± 0.001 and 0.976 ± 0.003, respectively, for the entire set of samples (Table 1).

Table 1 Descriptive statistics of the 363 SNPs characterized in 95 genotypes of the common bean

Considering the germplasm grouped by origin, of the 363 polymorphic SNPs, a set of 156 differentiated the Andean (57 private alleles) from the Mesoamerican group (99 private alleles) based on the allelic pattern. The power of exclusion estimated was high, reaching 100 % (a set of 131 SNPs differentiate all genotypes). For the 363 SNPs, the combined PI was estimated at 1.6 × 10−123 (Table 1), with individual values ranging from 0.375 to 0.829. The genetic differentiation (F ST) between the Andean and Mesoamerican population was estimated as 0.656 (p < 0.01). A comparative analysis of descriptive statistics performed between the SNPs previously published by Müller et al. (2015) and the data presented in this work for the same set of 88 accessions of common bean (Table 2).

Table 2 Comparison between the descriptive statistics for the 363 RAD-SNPs and 345 BARC-SNPs characterized for 88 genotypes of the common bean

For the Andean group, which is composed of cultivars, breeding lines and landraces, a high number of polymorphic SNPs was identified for the landraces (68.3 % polymorphic SNPs, n = 11), while for the Mesoamerican group, composed of the same subgroups, the higher number (70.0 %) was identified with the line/cultivar germplasm (n = 49). For the two wild accessions evaluated, 25.6 % of SNPs were polymorphic. A slightly increased of He was observed for the Andean landraces (n = 11, 0.195), compared to the Mesoamerican line/cultivar (n = 49, 0.152) and wild accessions (n = 2, 0.170). The genetic differentiation (F ST), which is directly related to the variance in allele frequency among populations, indicated an overall estimate of ≥0.525 (p < 0.01) between the Andean and Mesoamerican gene pool accessions. Among the subgroups of accessions grouped by gene pool (line, cultivar and landrace), higher differentiation was observed between the Mesoamerican landrace and Andean line/cultivar subgroups (0.704, p < 0.01), and lower differentiation was observed between the Mesoamerican line/cultivar and Mesoamerican landrace subgroups (0.009, p > 0.05) (Table 2).

Allelic polymorphism of SNP markers was tested in eight biparental crosses, five involving parental lines of the Mesoamerican gene pool and three inter-gene pool crosses (Table 3). On average, 59 (16.25 %) and 208 (57.30 %) of the 363 SNPs were polymorphic among the intra- and inter-gene pool crosses, respectively. Also considering these 363 SNPs, the proportion of common polymorphic SNP markers was 40.50 % (147) in the inter-gene pool accessions and only 0.55 % (two) in the intra-gene pool accessions.

Table 3 Evaluation of SNP polymorphism among the biparental crosses between the Andean (A) and Mesoamerican (M) inter- (A × M) and intra- (M × M) gene pools

Cluster analysis

Genetic relationships calculated between common bean accessions on the panel of 384 SNPs showed a large genetic dissimilarity between gene pools and relatively small genetic dissimilarity within each gene pool, as verified in the neighbor-joining tree build (Fig. 1). This tree was generated for the group of 95 genotypes (Fig. 1) and clustered the genotypes into two major groups corresponding to the Andean and Mesoamerican gene pools. We also analyzed the formation of an intermediate subgroup composed of accessions classified as Mesoamerican (composed of cultivar/lines and landraces), indicating that during the development of such materials, it were done introgressions from Andean gene pool genotypes by artificial or natural crosses. The Mesoamerican wild accessions G23554 and PHAS8328 grouped together and were distinguished from the domesticated accession of the Mesoamerican gene pool. The genetic dissimilarity coefficient varied between the genotypes, with an overall observed value of 0.375, ranged from a maximum of 0.267 (Mesoamerican wild) to a minimum of 0.135 (Andean line/cultivar).

Fig. 1
figure 1

Neighbor-joining trees representing common bean relationships based on 363 SNPs. The Andean genotypes are represented in dark green (landrace) and light green (cultivar/line); the Mesoamerican genotypes are shown in red (cultivar/line) and yellow (landrace); and the wild genotypes are indicated in light blue

Population structure and AMOVA

As well as by cluster analysis, PCoA showed a clear structuring of the genotypes based on Mesoamerican and Andean origins, corroborating the high F ST values calculated. The first axis, which explained the largest molecular variation of the data, consistently separated the accessions by gene pool (Fig. 2) with a value of 91.4 %. In the structure analysis, K = 2 was selected as the best partition with a subdivision based on Andean (green) and Mesoamerican (red) origin (Fig. 3). The Andean group (n = 21) presented a strong genetic structuration among 18 accessions and the remaining three were hybrids resulting from hybridization between Mesoamerican and Andean. At K = 3 the group of Mesoamerican was subdivided, represented by Mesoamerican genotypes with a prevalence of black commercial grain type (63.6 %) and Mesoamerican with a prevalence of Carioca commercial grain type (61.1 %). The group with predominance of black grain type was subdivided in two (K = 4), with no evident explanation for this split, probably due to directed crosses of breeding programs. At K = 5, an additional fragmentation of the Mesoamerican group was observed and seven accessions were assigned as distinct from the remaining. From this seven, five accessions were classified as hybrids, resulting from hybridization among different germplasm not characterized at present study, and the remaining two accessions presented distinct genetic composition (US Pinto 111 and PHAS 8328). Through AMOVA, the RAD-SNPs explained, approximately, 80 % of the differentiation of the accessions based on Andean and Mesoamerican origin, while 20 % was explained by populations. However, when the analysis was based on breeding (cultivar/line) and traditional (landraces) germplasms (2.84 %), greater differentiation (97.15 %) was observed between individuals (within populations).

Fig. 2
figure 2

Principal coordinates analysis (PCoA) of the 95 common bean genotypes with 363 SNPs

Fig. 3
figure 3

Population structure inferred by the Bayesian approach based on RAD-SNP for K = 2 to 8. Each individual is represented by a vertical line that is divided into colored segments based on the proportion of the division identified for 2–8 subpopulations. The groups include: A1 (green): Andean; M1 (pink): cultivars BRS Marfim and IPR88 Uirapuru and landrace Bico Ouro Mulatinho-MG) and one Landrace Andean (Jalo-GO); M2 (purple): Mesoamerican with a prevalence of 63.6 % of the cultivars with Black commercial grain type; M3 (red): Mesoamerican with a prevalence of 61.1 % of the cultivars with Carioca commercial grain type; M4 (different colors per genotype): line from Guatemala PI 181996 (brown), Canadian line 153315 (yellow), North American lines Red Mexican (purple) and US Pinto 111 (yellow), Mexican line SEA (yellow/red) and wild genotypes of Mexican origin PHAS 8328 (orange) and G23554 (light green)

Linkage disequilibrium and haplotype blocks

Of the 363 polymorphic SNP loci, four did not align to the reference genome of P. vulgaris (Andean v.1.0) through Blast analysis. From the remaining 359 SNPs placed in the genome, only one, with MAF < 0.05, was removed. For the entire set of samples, of the 358 SNPs (total of 63,903 comparisons), 58.77 % were in LD (37,556 comparisons with r 2 > 0.1 and p < 0.001). This LD estimate decreased to 0.16 % (33 comparisons with r 2 > 0.1 and p < 0.001) in the Andean (21 genotypes) and 0.93 % (174 comparisons with r 2 > 0.1 and p < 0.001) in the Mesoamerican (74 genotypes) gene pools. The total number of excluded SNPs was 152 and 164 for the Andean and Mesoamerican genotype sets, respectively. Another comparative analysis between the RAD-SNP loci presented in this work and the SNPs previously published by Müller et al. (2015) is presented in Table 4.

Table 4 Estimates of linkage disequilibrium calculated for pairs of loci (r 2 > 0.1 and p < 0.001) based on 345 BARC-SNPs, 363 RAD-SNPs and 580 BARC-SNPs and RAD-SNPs

Of the 359 RAD-SNP markers that were genotyped and aligned against the P. vulgaris genome, 290 with a call rate ≥0.75 and MAF ≥ 0.05 were used for the haplotype analysis. A total of 21 haplotype blocks was identified in eight chromosomes, ranging from one (Chromosomes 7, 8 and 10) to six (Chromosome 1) and comprised 140 RAD-SNPs (40 %, average of 6.67 SNP/block). The largest block was identified on chromosome 2, with 18,735 kb and coverage of 14 SNPs, and the smallest block was on chromosome 1, with 13 kb and containing two SNPs. The total blocks represented 20 % of the entire genome. The frequency of haplotypes ranged from 0.011 to 0.810, with the most frequent haplotype located on chromosome 1. For the Andean accessions (n = 21), three blocks were formed: one located on chromosome 1 (with 12 different haplotypes, one of which was common to 70 % of the genotypes) and two blocks located on chromosome 9 (with four distinct haplotypes each, of which one haplotype was frequently in 80 % of the genotypes). For the Mesoamerican pool, no blocks were identified.

Considering all 704 SNPs (359 RAD-SNPs and 345 BARC-SNPs-PV loci) able of aligning the genome, a selected set of 580 SNPs with a call rate ≥0.75 and MAF ≥ 0.05 were selected and analyzed, presenting a LD of 70.21 % (117,883 comparisons with r 2 > 0.1 and p < 0.001). A total of 42 blocks were identified to span 11 chromosomes (Table 5), and the largest block was on chromosome 2 (9929 kb and 30 SNPs) and the smallest block was on chromosome 6 (5 kb and two SNPs). The haplotype frequencies ranged from 0.795 to 0.011, and the highest frequency belonged to block 1 of chromosome 1. P. vulgaris had haplotypes with an average block length of 8828 kb/block that covered 60 % of the sequence length. For the Andean group (Table 6), from a set of 325 SNPs (call rate ≥0.75 and MAF ≥ 0.05), four blocks were identified with the largest on chromosome 1 (13,964 kb and 12 SNPs). For the Mesoamerican gene pool, from a set of 348 SNPs (call rate ≥0.75 and MAF ≥ 0.05), six blocks were identified with the largest on chromosome 4 (11,011 kb and 11 SNPs) and the smallest on chromosome 6 (2180 kb and 4 SNPs).

Table 5 Overall common bean haplotype blocks based on 580 SNPs
Table 6 Andean (AND) and Mesoamerican (MESO) haplotype block based on 580 SNPs covering the common bean genome

SNP marker distribution in the common bean genome and annotation

Alignments of the 1377 SNPs against the genomes of P. vulgaris, including the 1032 SNPs made available by this work and the 345 SNPs previously reported by Müller et al. (2015), are represented in Fig. 4. A total of 1368 SNPs (1023 RAD-SNPs and 345 BARC-SNPs) mapped across the 11 common bean chromosomes and had high and low representation on chromosomes 1 (176 SNPs) and 4 (74 SNPs). From 1032 SNPs, 485 (47.00 %) were located in genes (Supplementary Material 3), and nine genes were represented by two SNPs. The GO enrichment analysis reveals that for the biological processes term, a large number of genes (70.31 %) encoded proteins associated with cellular and metabolic process, such as the specifics terms “growth” (15 genes), “reproduction” (12 genes), “signaling” (28 genes) and “response to stimulus” (55 genes). The main distributions in the molecular function and cellular components were catalytic activity (38.35 %) and cell (28.66 %), respectively (Fig. 5).

Fig. 4
figure 4

SNP density physically mapped across eleven chromosomes of P. vulgaris L. and represented by the number of SNPs within 1 megabase (Mb) intervals. A total of 1377 SNPs were represented throughout the genome. The number of genotyped RAD-SNPs, BARC-SNPs and combined SNPs were 359, 345 and 673, respectively

Fig. 5
figure 5

Functional classification of 1032 RAD-SNP sequences derived from P. vulgaris annotated with 2nd level GO terms, including attributes based on cellular component, molecular function and biological process

Discussion

For common bean, RAD sequencing was very useful to discover genome-wide polymorphic SNP markers. The identification of a large number of RAD-SNPs (3357) and the amplification success rate of 96.61 % for the 384 tested SNPs demonstrate the potential use of this set of markers made available to the scientific community. The amplification success rate was comparable to the rate reported for common beans by Müller et al. (2015) (96.37 %) and higher than that described for soybean by Wu et al. (2010). The filter settings applied ensured a stringent selection of 3357 SNPs with high-probability amplification (≥3242 SNPs). However, further analyses are needed to validate these SNPs over the level of polymorphism. This study indicated that SNPs identified by the RAD strategy could offer an opportunity to genotype a population with different genetic structures. In addition, the use of variable restriction enzymes to cleave the DNA would improve polymorphism detection between genetically related genotypes.

The proportion of polymorphic RAD-SNP loci increased between gene pools (~7.0 %) and, overall, 363 RAD-SNP loci (H E = 0.396) presented similar levels of genetic diversity for the same set of 88 genotypes of common bean when compared with the previously developed 345 SNPs (H E = 0.390) described by Müller et al. (2015). Based on RAD-SNP markers, the number of polymorphic SNPs within the Andean group increased from 35.94 to 72.7 % (264 SNPs), and a slight decrease occurred in the Mesoamerican group (from 96.23 % to the actual proportion of 83.3 %). Consequently, the He value for the Andean group (n = 21; 0.163) was 2.2-fold higher compared to estimates previously obtained with BARC-SNP loci (He = 0.072) (Müller et al. 2015). Regarding the domesticated germplasm, the genetic diversity of the Andean accessions from Brazil were high (n = 19, H E = 0.161) compared with those observed (n = 60, H E = 0.089) by Rodriguez et al. (2015) including domesticated accessions from 12 countries. The differences most likely were consequence of the different sets of SNP markers used. The increased estimate of genetic diversity lies in the ease with which different samples belonging to the Andean group were pooled and sequenced (n = 4), allowing the detection of sequence variation among these genotypes and, consequently, captured a higher allele frequency distribution than previously obtained. We also detected higher genetic diversity (p ≤ 0.05) for the cultivated germplasm from abroad (n = 12; H E = 0.438) compared with the Brazilian ones (n = 47; H E = 0.316), which is probably due to diverse nature of the international germplasm, representatives of six countries. Despite the higher recombination within Mesoamerican genepool, and the expectation of being detected high LDs due to the inbreeding nature of the P. vulgaris, the high level of LD in the whole sample was predominantly due to the gene pool structure. The slightly increased LD in Mesoamerican compared to Andean (Table 4), and the high number of haplotype blocks with reduced size (Table 6), were probably favored by the process of selection imposed over the breeding programs. The use of crosses among elite parents, the most common type of cross made in a typical bean breeding program, could reduce the genetic gains with the selection, due to the high genetic similarity among genitors from the same gene pool. In Brazil, all the common bean grain consumed in the country is from the Mesoamerican gene pool. As expected, the genetic bottleneck following the successive breeding efforts in this gene pool have allowed the occurrence of increased linkage disequilibrium, when compared to accessions from the Andean gene pool (Hudson 2001). It is expected that the number of haplotypes drastically increases with an increase in the number of genotyped SNPs.

Considering crosses involving common bean parents from distinct origins (Andean × Mesoamerican), RAD-SNP loci were considerably less polymorphic (~57 %) when compared to the previously identified BARC-SNP loci (~78.0 %, Müller et al. 2015), while for the inter-gene pool, the rate of polymorphism was similar between these markers (~16.0 and ~17.0 %, respectively). Thereby, these set of markers complement each other for linkage analysis. An important common bean mapping population for drought tolerance (F9 RIL population generated by SSD), derived from the cross between the Durango (SEA5; Singh et al. 2001) and Nueva Granada races (AND277; Blair et al. 2009) developed at the International Center for Tropical Agriculture (CIAT, Cali, Colombia), identified 495 high quality polymorphic SNPs to be used for linkage analysis. Focusing on bacterial wilt resistance genes, the biparental population Ouro Branco × CNFP10132 (F5:7 generation; Müller et al. 2015), of which the parental lines in this study were genotyped, also has great potential to be analyzed with a set of 442 polymorphic SNPs made available in the present study. Obviously, integrating genotyping with sequencing technology for QTL mapping of large recombinant inbred line (RIL) populations will provide quicker advances (Spindel et al. 2013). However, the immediate availability of molecular tool scan minimize efforts and accelerate the studies of QTL mapping.

The RAD-SNPs developed through two pools of DNA was very informative to capture the genetic diversity between and within gene pools, compared to the SNPs obtained from the sequence alignment of individual genotypes. Within the Andean group, the number of SNP loci with MAF > 0.10 was considerably higher for the RAD-SNP (total of 205 RAD-SNP loci vs. 55 BARC-SNPs), while for the Mesoamerican group, the total number of SNPs was similar (193 RAD-SNP loci vs. 141 BARC-SNPs). As these estimates were obtained for the same sample set, the efficient use of RAD sequencing to enable SNP development with adequate MAF (≥1 %) was demonstrated in this study. Gautier et al. (2013) showed that NGS of a pool of diploid individuals provides more accurate SNP allele frequency estimates than individual-based analysis and, in addition, provides a very cost-effective approach to estimate patterns of genetic diversity for a large set of populations. While 363 RAD-SNP markers tended to amplify the differences between the Andean and Mesoamerican gene pool (F ST = 0.656), the dendrogram analysis demonstrated that landraces and line/cultivars within each gene pool share a high proportion of SNP loci. The differentiation was slightly reduced than previously estimated for the same set of samples (F ST = 0.759, Müller et al. 2015). The strategy of pooled samples for SNP design showed to be more efficient to capture genetic variation in the Andean germplasm; however, the structuring by origin was still the predominant (K = 2). Even based on the screening of 12 individuals for the development of SNPs in the present work, a bias towards common polymorphisms and an overestimation of genetic differentiation was observed. Thus, according to the proposition of Goretti et al. (2014), more representative sample set should be used to develop the SNP markers, allowing more realistic estimates of genetic diversity and genetic differentiation, which will reduce the ascertainment bias.

As the set of genotypes used in this study is representative of the Brazilian germplasm, it could be supposed that due to the high extension of the LD (58.77 %) and small number of haplotypes, a reduced number of SNP markers will be required to detect linkage between the marker and target genes related to the phenotypic variation. Certainly, due to the autogamous nature of the common bean and domestication process to which this crop was submitted (Bitocchi et al. 2013), this statement is true. Despite the significant LD (70 %) and haplotype block genome coverage (60 %) identified through analysis of 580 SNPs for the combined germplasm, the results suggest that dense SNP coverage will be needed to identify haplotypes that capture most of the genetic variation across inter-gene pool populations. SNP filtering by MAF (≥0.05) reduced the number of useful SNPs (~330) for LD and haplotype analysis by half and, consequently, limited block identification (four and six blocks for Andean and Mesoamerican, respectively). The identification of representative SNP tags spanning the entire genome in common beans has been a challenge and is dependent on the characterization of a larger number of SNPs; however, few SNPs will likely be required for whole genome sampling. This fact has implications for the design of genetic association studies. As the idea behind GWAS is to genotype a large number of markers distributed across the genome, the high LD observed within a gene pool could be efficiently explored in the breeding programs to increase the efficiency of selection. As more plants are genotyped and phenotyped, functional nucleotide polymorphisms, changes in specific genes causing desired phenotypes, will be identified for the development of functional markers. However, an additional effort to detect which gene included in the haplotype block is related to the evaluated trait could be performed by sequence similarity search against model plant databases or the development of tilling mutants (Porch et al. 2009).

In this study, we were interested in assessing the potential of RAD for direct genotyping-by-sequencing in the common bean. The developed markers added to the previously published SNPs (Müller et al. 2015) resulted in 708 high quality and genetically informative SNPs allowing a uniform physical distribution across the P. vulgaris genome. Both SNP panels have broad application in breeding and are recommended for use in genetic analyses of common bean germplasm. In addition, with the availability of several genotyping methods made available in recent years, the demand for SNP genotyping can be increasingly met. This work has allowed an accurate assessment of structuration and quantification of genetic diversity, and may be useful in whole-genome association studies. It is noteworthy that most of the germplasm used in the present study is from Brazilian origin (85 %) and, for this reason, the genetic diversity captured was suitable for answering questions related to such access. However, the public availability of these markers will enable a wide variety of additional analyses to be carried out by scientists investigating and comparing the genetic variation across germplasm from other countries, as well as to explore the genetic basis of complex traits.