Introduction

With consumer sales of $2 billion in the USA from 2006 to 2008 (ers.usda.gov), common bean is the most important edible legume in the Americas, Africa and Europe (Osorno and McClean 2014). Beans are rich in protein and fiber, and an excellent source of minerals, such as potassium and iron, and vitamins like thiamine, vitamin B6, and folate (Garden-Robinson and McNeal 2013; Bennink and Rondini 2008). Considering the importance of dry edible beans for the human diet, as well as their economic impact, the development of genomic resources, such as high-resolution genetic maps and markers, plays a pivotal role in dry bean breeding to help breeders to improve their germplasm.

In recent years, there have been vast improvements in dry bean genetics, such as the development of the customized 6000-Golden Gate iSelect BeadChip panel (Hyten et al. 2010), which drastically improved mapping quality, providing more than 5000 SNPs. Moghaddam et al. (2014) developed 2687 market class specific InDel (Insertion/Deletion) markers, distributed across the genome, which can be used for mapping, as well as for phylogeny. In addition, the release of the Phaseolus vulgaris genome by Schmutz et al. (2014) significantly helps to facilitate genomics research.

Another helpful tool to improve genetic analyses is GBS (Elshire et al. 2011), which is based on next-generation sequencing (NGS), that captures SNP data using a reduced-representation library (RRL). It has become an important tool to analyze genomes and generally provides improved genomic data in terms of marker distribution and density. However, the quality of the GBS results depends on two main factors: genome size and library preparation (Hamblin and Rabbi 2014). Preparing a GBS library consists of a digestion step, adapter annealing, PCR amplification, and sample-pooling. The pooled samples are then sequenced using the Illumina platform, whereby a specific fraction, rather than the entire genome, is sequenced.

The original GBS methodology by Elshire et al. (2011) uses a single restriction enzyme (e.g. ApeKI) to digest DNA samples. Then, the barcoded adapter is ligated to one side, and a common adapter to other side of the restricted DNA. However, since both cut-sites are identical, 25 % of the fragments will have common adapters on both sides, and another 25 % of the fragments will have two barcoded adapters on both sides. In both cases, bridge amplification in the Illumina flow cell is not possible (Illumina.com) and results in a loss of data. The use of two enzymes and Y-adapters as a common adapter, as described by Poland et al. (2012), helps to prevent these issues. Additionally, the enzyme choice is highly important, since it influences the DNA fragment size and the number of fragments represented in the GBS library. For example, a frequent cutter produces many small DNA fragments, resulting in a GBS library with a low coverage per read. Ideal fragment sizes ranges from 150 bp to 300 bp for single-end reads, and from 250 to 500 bp for paired-end reads (support.illumina.com; Hamblin and Rabbi 2014).

Thanks to the availability of the P. vulgaris genome sequence (Schmutz et al. 2014), GBS optimizations such as an in silico digest analyses can facilitate the optimization of DNA fragment size distribution for the library. The chromosome scale assembly of the common bean genome is 521 Mb and consists of 41 % repetitive DNA. The gene models are organized in gene islands and can also be found in heterochromatic regions (Schmutz et al. 2014). For a uniform distribution of markers, non-methylation-sensitive enzymes are favorable to use for a Phaseolus GBS library. Methylation-sensitive enzymes, in contrast, as used by Huang et al. (2014) for studies on green foxtail [Setaria viridis (L.) P. Beauv.], using a PstI and MspI double digest, generated 39,416 SNPs from 252 genotypes. This SNP abundance increased coverage in particular loci, mainly in non-methylated regions, for the purpose of a better detection of heterozygotes. Increased coverage in particular loci was also the reason to develop a double-digest library to study blackcurrant (Ribes nigrum L.; Russell et al. 2014). While the use of methylation-sensitive enzymes would lose information about the gene islands existing in Phaseolus in methylated regions of the chromosomes, it would increase coverage at certain loci in the non-methylated areas of the chromosomes. Sonah et al. (2013) used single digests of ApeKI, MspI, and PstI for in silico digestion of the soybean [Glycine max (L.) Merr.] and found ApeKI the most suitable candidate for GBS library construction and Illumina sequencing, producing 800,000 DNA fragments between 100 and 400 bp, and 10,120 SNPs from eight genotypes. In common bean, Hart and Griffiths (2015) also made a comparison of digestion enzymes, such as ApeKI and PstI, and used eight different adapter concentrations for each enzyme in order to optimize GBS results. This approach resulted in 7530 high-quality SNPs, after imputation and selection for minor allele frequency of ≥0.05, from a 96-plex ApeKI GBS library, which they found superior to the PstI library. For this study a RIL population of 84 lines and 12 parental checks were used.

The objective of this study was to improve GBS quality and SNP density for dry edible bean libraries by comparing various library preparation methods.

Materials and methods

A priori genome analysis

The following factors were considered in order to optimize the GBS protocol for dry beans: (1) the genome size and structure of P. vulgaris; (2) DNA methylation and restriction sites; (3) restriction fragment size selection; (4) the Illumina sequencing method (HiSeq 200 rapid single-end run); and (5) number of samples.

Due to the Phaseolus genome structure, which contains gene islands within the heterochromatic regions (Schmutz et al. 2014) a uniform distribution of markers within both, the euchromatic and heterochromatic regions is preferable for the purpose of mapping. To optimize coverage, the sequencing method as well as the number of DNA fragments going into the GBS library plays an important role. A HiSeq 200 rapid single-end run has an approximate output of 130 million 200-bp reads (Illumina.com). Commonly, 96 samples are run at a time, providing more than 1.3 million reads per sample. Depending on the number of DNA fragments used for sequencing, the theoretical coverage can be calculated by reads per sample divided by the number of fragments. Also of benefit are double-digest libraries, constructed with a Y-adapter (Poland et al. 2012), to prevent data loss caused by unassigned reads.

In order to evaluate different enzymes and enzyme combinations and to better estimate how certain combinations may benefit GBS library construction, the reference genome was digested in silico, using an in-house software (https://github.com/mrmckain/REDFreq; Table 1; Fig. 1a–d). After digesting the P. vulgaris genome (phytozome.net) in silico with five enzyme combinations (Table 1), TaqαI and MseI double digestion appeared to be the best fit for library construction, considering the five criteria described above. Neither of these enzymes are methylation sensitive. The combination of TaqαI and MseI provides coverage across all the chromosomes, including the heterochromatic regions, and an optimized amount of fragments between 300 and 800 bp (Fig. 1) for bridge amplification.

Table 1 Enzymes and enzyme combinations used for in silico digestion of the P. vulgaris L. genome
Fig. 1
figure 1

In silico digestion of the P. vulgaris L. genome with ApeKI, using the unmasked genome (a), which includes all repetitive genome sequences, and (b) the hardmasked genome, which hides repetitive sequences by using “N’s” instead and therefore simulating methylation in this region. In silico digestion with TaqαI/MseI, (c) unmasked and (d) hardmasked. The red rectangle represents the number and length of DNA fragments for size selection. (Color figure online)

DNA extraction and GBS library preparation

Twenty-five dry bean genotypes (supplemental table S1) out of a pool of 96 samples were chosen according to DNA quality (260/280 nm absorbance ratio >1.8) and digested with MseI/TaqαI to develop a GBS library. A second library was developed using DNA digest with ApeKI. All samples are part of the Mesoamerican Diversity Panel or MDP (www.beancap.org). The plants were grown in the greenhouse at 22 °C and additional light (600 W high-pressure sodium lamps) from 6:00 am to 8:00 pm, to the first trifoliate leaf stage until sampling. High molecular weight DNA of each individual was extracted from young leaves using a CTAB protocol (Doyle and Doyle 1987).

Both GBS libraries (ApeKI and MseI/TaqαI) were constructed based on a modified protocol of Poland et al. (2012). The only differences in library construction were the enzymes themselves, and the corresponding adapters for ligation. Both libraries were size selected for ideal bridge amplification. After ligation, DNA fragments <300 bp were removed from individual samples using 0.7 volumes of Sera-Mag™ Magnetic SpeedBeads prepared according to Rohland and Reich (2012). Individual samples (4 µl of sample solution) were checked via PCR, using a 34 s extension time and visualized on a 3 % agarose gel to estimate product size range and quantity. Barcoded samples were pooled and used for library construction. The PCR extension time during the library preparation step was limited to 17 s to reduce amplification of DNA fragments larger than 800 bp. The pools were sequenced by the HudsonAlpha Genome Sequencing Center, Huntsville, AL, USA, as 200 bp single-end reads on one lane of an Illumina Hi-Seq 2500 using the high-output run mode or two lanes (on-board clustering) of a HiSeq using the rapid run mode, respectively.

Data processing and handling

The read quality was checked with FastQC 0.11.2 (bioinformatics.babraham.ac.uk). Raw fastq reads of all accessions were split into separate fastq files, based on their barcodes, using either an in-house barcode splitter (for MseI/TaqαI) or Stacks 1.30 (for ApeKI) (Catchen et al. 2013). All reads were trimmed to 190 bp at the 3′ end based on the Phred scale quality scores >20. The trimmed sequences were aligned to the non-masked (repetitive sequences not masked with “N’s”) reference genome of P. vulgaris (phytozome.net) using bowtie2 (Langmead and Salzberg 2012). SNPs were called using VarScan (Koboldt et al. 2012) (supplemental figure S1). Several filters were applied to minimize the number of false positives SNPs using VCFtools 0.1.12b (Danecek et al. 2011): Only those SNPs with all of the following characteristics were retained for analysis: (1) with missing data less than 50 %; (2) with only one alternative allele; (3) a minor allele frequency of more than 5 %; (4) mapped to one of the 11 pseudo-chromosomes; (5) with Phred scale mapping quality greater than 25; and (6) total read depth >100×.

Individual genotypes were regarded as low quality if individual read depth was smaller than 3×, 5× and 8×, respectively. In order to make both runs comparable (total number of SNPs, average and maximum SNP distance), ApeKI and MseI/TaqαI HiSeq runs were normalized to an average of 1,000,000 reads per sample after mapping to the reference genome, to exclude sequencing effects, such as number of reads.

Results

This study describes an optimized GBS method, using in silico digestion of the Phaseolus genome for fragment size optimization. This analysis compared both single-enzyme (Elshire et al. 2011) and double-enzyme digests (Poland et al. 2012).

The MseI/TaqαI enzyme combination covers both, the euchromatic and the heterochromatic chromosome regions, and creates approximately 35,000 fragments in the desired length range from 300 to 800 bp. In comparison, a GBS library constructed using ApeKI, which is widely used for GBS, produced more than 60,000 fragments, even after size selecting for DNA fragments >300 bp. This would result in a theoretical coverage of approximately 20×, considering 1.3 million reads per sample. Unfortunately, the number of DNA fragments going into the ApeKI GBS library is hard to predict, due to the partial methylation sensitivity of the enzyme. For this reason, the unmasked P. vulgaris genome was also used for in silico digestion. The unmasked genome digest with ApeKI showed 185,000 DNA fragments in the same size range, lowering the theoretical coverage drastically. Due to the increased number of DNA fragments, the resulting theoretical coverage is therefore less than half of a TaqαI/MseI GBS library, since more fragments have to be covered by only a limited amount of reads.

The use of Y-adapter ensures that all sequenced fragments will be flanked by one barcoded adapter and one common adapter. The fragments are also size selected to narrow the fragment pool for sequencing. The upper size limit is determined by the length of the PCR elongation step, and the lower size limit by removal of small fragments using magnetic beads. Size selection was adjusted for genome size, and project interests (mapping), and the Illumina sequencing technique used (HiSeq 2500 rapid single-end run). The pool of size selected fragments was multiplexed, and a library was prepared for sequencing. To analyze the data obtained from sequencing, an in-house computing pipeline was used (supplemental figure S1).

From the two GBS libraries, we obtained a total of 17,784,641 (ApeKI) and 63,633,785 (MseI/TaqαI) raw reads, respectively. The read count of the ApeKI library varied from 23,530 to 3,473,482 reads per sample with an average of 711,385 reads per sample. The number of reads of the MseI/TaqαI library varied from 1,837,775 to 4,998,047 reads per sample, averaging 2,545,351 reads (supplemental table S1). With an average of about 50 % of mapped reads generated by the ApeKI library, compared to almost 67 % mapped reads generated by the MseI/TaqαI library, 4.76 times more reads mapped to the reference genome, using a MseI/TaqαI GBS library compared to the ApeKI library. After filtering, the average coverage for the ApeKI library was 22.4×, and for the MseI/TaqαI library it was 16.1× (Table 2). Despite the lower coverage of mapped reads, the MseI/TaqαI library shows significantly more mapped read sites (247,482) compared to ApeKI (2444) where read mapping is distorted along the chromosomes (Fig. 2a, b).

Table 2 Number of sites with mapped reads per sample and library construction and their corresponding average coverage. Total number of sites and average coverage per library
Fig. 2
figure 2

Mapped read distribution across the chromosomes. The x-axis represents the number of tags along the chromosome, while the y-axis shows the tag location on the chromosome in bp. (a) ApeKI library; (b) MseI/TaqαI library

The SNP call is strongly correlated with raw read counts as well as to the percentage of mapped reads. For 3× coverage, 6779 SNPs out of 112,513 sites (6.0 %) were kept after filtering the data generated by the ApeKI library, contrasting with 121,740 SNPs out of 523,605 sites (23.3 %) obtained by the MseI/TaqαI library. After filtering the data using 5× coverage, 4080 SNPs out of 112,516 sites (3.6 %) of the ApeKI library, and 97,329 SNPs out of 523,602 sites (18.6 %) of the MseI/TaqαI library were kept. In total, 937 SNPs out of 69,733 sites (1.3 %) were kept of the ApeKI library applying filters and 8× coverage, and 55,752 SNPs out of 360,482 sites (15.5 %) of the MseI/TaqαI library. After normalization to an average of 1,000,000 reads per library and sample, 18,981 (3×), 11,424 (5×) and 2624 (8×) SNPs, respectively, were obtained from the ApeKI library, and 71,827 (3×), 57,424 (5×) and 32,894 (8×) from the MseI/TaqαI library.

SNPs could be detected on average every 69,559 bp in the ApeKI library for 3×, and every 126,295 bp for 5× coverage. The maximum distance between adjacent SNPs was 4.5 and 6.2 Mbp, respectively. For 8× coverage, SNPs were detected averaging every 541,330 bp, and the maximum distance between two SNPs was 13.2 Mbp. SNP distribution in the MseI/TaqαI library is far denser, detecting SNPs 4,307 bp apart from each other on average, and 487,180 bp at maximum for 3×, and 5,399 bp in average, and 610,250 bp at maximum for 5× coverage, respectively. The distance between two SNPs in the MseI/TaqαI library, filtered for 8× coverage, was 9,449 bp, with a maximum distance between two adjacent SNPs of 1.8 Mbp.

Discussion

Based on the in silico digests, an increased coverage with double-digested fragments (MseI/TaqαI) compared those generated using a single-digested fragments (ApeKI) was expected. However, increasing the number of DNA fragments, going into the GBS library, does not necessarily result in a decrease in detected SNPs. This is due to the fact that a calculated 20× coverage for ApeKI (using DNA fragments in the optimal size for bridge amplification) still can be considered useful for SNP detection. Only rare DNA fragments, which were either underrepresented in the GBS library, or which did not amplify well during bridge amplification, would be left undetected.

However, ApeKI’s methylation sensitivity makes predictions about restriction fragment sizes hard. In silico digestion of the unmasked P. vulgaris genome (Fig. 1a; Table 1) shows a much increased number of fragments (185,000) that are ideal for bridge amplification, resulting in a decrease in mapped reads with high coverage. The mapped read distribution (Fig. 2) of the ApeKI library compared to the MseI/TaqαI library shows, that both methods have a similar and uniform distribution of tags along the chromosomes. This indicates that ApeKI is not much affected by methylation, resulting in more restriction fragments than originally anticipated, leading to a low coverage in many parts of the genome. Still, the single-digest library also shows a distorted tag distribution, which is caused by low read counts and a low number of mapping reads compared to the MseI/TaqαI library.

The SNP distribution was also found to be similar between the two GBS libraries. While ApeKI is partially methylation sensitive, it was expected to cut insufficiently in methylated DNA regions such as centromeres and telomeres. However, the distribution of SNPs is concentrated within the centromeric regions of the chromosomes, similar to the enzyme combination MseI/TaqαI (Fig. 3a, b).

Fig. 3
figure 3

SNP distribution for 3× (a, b) and 5× (c, d) coverage across the chromosomes, normalized to an average of 1 million mapped reads. Red line represents moving average trend line. Not shown is the 8× coverage, due to low SNP count for the ApeKI library. (Color figure online)

Size selection via magnetic beads is another method to improve GBS results, not only because DNA fragments have the optimum length for bridge amplification, but it also prevents tedious adapter adjustments. If ApeKI was not size selected, approximately 470,000 DNA fragments (Fig. 1) per sample would go into the GBS library, resulting in a calculated average coverage of 2.8×. For ApeKI, this theoretical value was more than doubled, considering 185,000 fragments per sample. Size selection is even more important for creating the MseI/TaqαI library. Here, without size selection, 2.5 million fragments per sample would be included in the library and reduce coverage to 0.52×. However, there is no study which suggests what value this theoretical coverage should be in order to achieve good GBS data. In this study, a theoretical coverage of more than 20× was considered to be sufficient. In soybean, ApeKI digestion produced approximately 800,000 DNA fragments for library preparation (Sonah et al. 2013). Fragment representation was further reduced using primers with an additional base or bases. The approach of optimizing the digest for an increased number of fragments ideal for bridge amplification, with a latter reduction of fragments for higher coverage, makes the size selection step redundant. However, adapter adjustments for library preparation still remain an obstacle.

Both GBS runs performed very differently during sequencing and generated different amounts of reads. Because comparing an average of about 710,000 reads, mapping to 50 % (≈356,000 of total reads mapped) for the ApeKI digest, to an average of 2.5 million reads, mapping to 67 % (≈1.7 million reads of total reads mapped) from the double digest is problematic, both GBS runs were normalized to 1 million reads per sample in average, to eliminate sequencing effects. After normalization and depending on coverage (3×, 5× and 8×), the GBS library generated by the double digest using the enzyme combination MseI/TaqαI provided a 3.8–12.5 times more SNPs compared to the single digest using ApeKI for library preparation (Fig. 3). Hence, this study introduces an optimized GBS protocol for dry edible beans, which takes several measures such as in silico digest of the reference genome, the use of Y-adapters and size selection into account, to provide dense SNP coverage that is useful for QTL mapping and GWAS.