Key words

1 Introduction

It was a significant achievement when the first plant genome sequence of Arabidopsis thaliana was published in 2000 [1] and heralded the application of genomics tools to plant research. The choice of this first species, with one of the smallest plant genomes and limited dispersed repetitive DNA, was partly driven by the cost and efficiency of available sequencing technologies. Today the transformative advances in sequencing platforms and chemistries, which have led to dramatic reductions in cost per base, have played a major role in deciphering multiple complex genomes. To date as many as 55 plant genomes have been sequenced and made publicly available [2] (http://www.phytozome.net/). Combined with such reference genome sequences next generation sequencing (NGS) has allowed a multitude of new approaches to be applied to the identification, analyses, and visualization of fundamental genetic variation. Identifying and utilizing natural and induced genetic variation remains a prime objective in plant research with important implications in population genetics, evolution, and crop breeding. The most abundant and perhaps most informative variation that can be exploited are single nucleotide polymorphisms (SNPs ) that have proven ideal markers for the study of plant genomes [3].

A number of approaches have been described to capture genome wide natural and induced genetic variation by NGS. The majority of these approaches rely on the use of reduced representation, which delimits the portion of large and complex genomes to be assessed to a manageable size. Initially proposed by Altshuler et al. [4] reduced representation allowed a high density SNP map to be generated for a genome previously thought to be too large for such analyses. However, it has been the combination of reduced representation, NGS and multiple indexing of samples that has provided the ability to study extremely large genomes at reasonable cost. The relative simplicity and cost-effectiveness of the genotype-by-sequencing (GBS) approach has encouraged its application in multiple species, including both model and non-model plants [58]. Also the increased marker density that is offered has led to its growing use in the anchoring of genome sequence assemblies, effectively removing the necessity to generate expensive and error prone physical maps [911]. The only current limitation is the bioinformatic and computational burden that is generated, with regard to both data processing and storage.

GBS now takes many forms, the first GBS data was generated using restriction site associated DNA sequencing (RAD-seq) [12] which utilized a single restriction enzyme combined with shearing of the digested DNA to capture a suitable portion of the genome. By optimizing enzyme choice and eliminating the necessity for DNA shearing the Cornell group simplified the approach and allowed more extensive multiplexing, which reduced costs further [13]. There have been several modifications to the basic protocols, predominantly incorporating the use of two enzyme digestion, including 2b-RAD [14], ddRAD-seq [15], and a variant to the Cornell GBS approach by Poland et al. [6] that utilizes methylation sensitive enzymes to further reduce the representation of the target genome. There have been several reviews describing the different approaches to GBS in plants [1619].

The common feature of all the approaches is the type and volume of data that is produced, since all have exploited the Illumina sequencing platforms, generating millions of sequence reads usually of 100 bp or less for each indexed sample. Thus the bioinformatics pipeline described in the following chapter would be applicable to any of the published protocols in either single-end or paired-end read format. All the methods can be used in the absence of a reference genome; however, the use of a reference genome is generally far more effective in ensuring the robust identification of genome wide SNPs . The following chapter will focus on the analyses of GBS data where there is access to a complete or draft genome; although tools (see Publicly Available Software and Tools for GBS) that have been developed to analyze GBS in the absence of a reference genome are listed.

2 Materials

In this chapter, we discuss a Bioinformatics pipeline (Fig. 1) that is designed to identify genetic variants such as SNPs and insertions/deletions (InDels) from NGS data generated by most major RAD and GBS approaches. This pipeline uses a suite of publicly available software and custom Perl scripts. There are alternative pipelines that have been developed and are listed in Publicly Available Software and Tools for GBS.

Fig. 1
figure 1

Bioinformatics workflow for genetic variant discovery using next generation sequencing based genotyping approaches such as RADseq and GBS. The genetic variant calling pipeline comprises three major steps, including raw data processing, read mapping to a reference genome, and variant discovery. Each of these steps is further divided into multiple sub-steps. The bioinformatics tools (shown in purple), input and output file formats (green), and the purpose, methodology, or general outcome of each sub-step (bullet points) in the workflow are presented

2.1 Publicly Available Software and Tools for GBS

  1. 1.

    Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic) is a multithreaded command line tool that can be used for trimming adapter sequences and low quality regions from Illumina sequencing reads [20].

  2. 2.

    Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) is an ultrafast short read alignment tool that can be used for aligning sequencing reads against a reference genome [21]. It should be noted that other alignment tools are available for this application, most commonly BWA [22].

  3. 3.

    SAMtools (http://samtools.sourceforge.net/) is a package of utilities designed for manipulating alignments in the SAM (Sequence alignment/Map) or BAM (Binary alignment/Map) format, including sorting, merging, indexing, and generating alignments in a per-position format [23].

  4. 4.

    BCFtools (http://samtools.github.io/bcftools/) is a set of utilities that manipulate variant calls in the Variant Call Format (VCF ) and its binary counterpart (BCF).

  5. 5.

    GATK ( Genome Analysis Toolkit) genotyper (http://www.broadinstitute.org/gatk/) provides a wide variety of tools for variant discovery and genotyping [2426].

  6. 6.

    STACKS (http://creskolab.uoregon.edu/stacks/) allows de novo assembly of short read GBS data and the identification of genetic variation in the absence of a reference genome [27].

  7. 7.

    TASSEL-GBS (http://www.maizegenetics.net/) is an implementation of a GBS analysis pipeline in the TASSEL software package [28].

2.2 In House Tools

A set of utility Perl scripts (listed in Table 1) were written to perform various tasks associated with data processing, read alignment, and SNP discovery . These scripts are open source and freely available upon request.

Table 1 List of utility Perl scripts designed to perform various tasks associated with genetic variant discovery using RAD-Seq and GBS data sets

3 Methods

The basic workflow for variant discovery using NGS data generated by RAD-seq and GBS approaches can be divided into three sequential steps: (1) raw data processing, (2) read alignment to a reference genome or de novo assembly of the sequence tags, and (3) variant discovery and annotation. In general, these three steps are shared by most of the currently availably genotyping pipelines. In the following subsections, each of these steps are reviewed to provide background information for the available bioinformatics tools that are customized to perform various tasks associated with these steps.

3.1 Raw Data Processing

RAD-seq and GBS employ a highly multiplexed sequencing strategy for constructing reduced representation libraries for the Illumina NGS platform (see Note 1 ). Demultiplexing is the first key step of processing raw sequencing data, which separates reads into their corresponding samples based on barcode matching. Demultiplexing of Illumina reads is generally carried out using Illumina CASAVA or MiSeq reporter software; however, CASAVA cannot demultiplex RAD-seq and GBS reads which contain customized inline barcodes in only one of the adapter sequences. We have developed a Perl script util_barcode_splitter.pl (Table 1) to demultiplex RAD-seq and GBS reads.

Raw sequencing data often contain various types of errors and artifacts, such as base calling errors, low quality bases, adaptor contamination and duplicate reads [29]. Thus it is necessary to perform quality assessment and correction of reads by filtering or trimming of low quality reads or regions. There are numerous publicly available software that can be used for pre-processing of sequencing reads, such as Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic), PRINSEQ (http://prinseq.sourceforge.net/), FastqMcf (http://code.google.com/p/ea-utils/wiki/FastqMcf), FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), and cutadapt (http://code.google.com/p/cutadapt/). In our pipeline (Fig. 1), we have adopted Trimmomatic, which is a fast, multithreaded command line tool that can be used to (1) remove adapter sequences, (2) trim leading and trailing low quality regions (below a user defined quality threshold), (3) scan the read with a user defined base-pair size sliding window and cut when the average quality per base has dropped below a threshold, and (4) keeping only those read-pairs where both reads were longer than the specified minimal length. Trimmomatic is also designed to handle “read-through” for paired-end data. A “read-through” is when a fragment size smaller than the read length is sequenced and hence results in overlapping read-pairs that include both the target fragment and adapter sequence. It is essential to remove one of the reads in this case in order to avoid over-stating read-depth for variant calling.

Amplification by polymerase chain reaction (PCR) is often used for target enrichment during the preparation of libraries for next-generation sequencing. PCR duplicates resulting from the original DNA templates being sequenced many times can have a detrimental effect on the quality of variant calls especially when the coverage is low (see Note 2 ). Computational methods for the detection and removal of PCR duplicates have become available that generally rely on the observation of identical alignment positions of reads to the reference genome. Read map ping being a computationally intensive process (see Note 3 ), the development of an alternate method for detection of PCR duplicates based on direct comparison of read sequences is essential, especially when the proportion of PCR duplicates is very high. To this end, we have developed a Perl script util_find_uniq_reads.pl (Table 1) that compares read sequences and removes duplicate reads.

3.2 Read Alignment to a Reference Genome

After read cleanup, alignment of short reads to a reference genome is the first step in a high-throughput genotyping workflow. In the absence of a reference genome, paired-end sequencing data generated by RAD-seq or GBS approaches can be assembled de novo using software packages such as STACKS [27], UNEAK [30], or RApiD [31] to produce mini-contigs that can be used as a reference for read mapping and genotyping (see Note 4 ). In the last few years, a myriad of efficient short-read alignment programs, such as MAQ [32], mrsFast [33], STAMPY [34] Bowtie2 [35], BWA [22], and SOAP2 [36], have been developed. Most of these widely used aligners utilize hashing algorithms (MAQ, mrsFast, STAMPY) or Burrows–Wheeler transform (BWT ) [37] based indexing (Bowtie2, BWA, and SOAP2) for short read mapping. The hash-based aligners use hash tables to store the information of either the reference genome or short reads. A major drawback of the hash-based aligners is that they require prohibitive amount of memory (see Note 3 ). The second generation BWT-based aligners are preferred as they consume only a limited amount of memory [38, 39].

In our genotyping workflow (Fig. 1), we have adopted Bowtie2 which is faster, more sensitive, and more accurate than BWA and SOAP2 across a wide range of parameter settings [35]. Bowtie2 supports both local and global (end-to-end) modes of alignment of short reads [35]. A local alignment considers only a short segment of the read and clips unaligned characters from one or both ends of the read to maximize the alignment score. Conversely, global alignment involves alignment of all characters in the read. In our experience, local mode of alignment of the reads is faster and useful for mapping reads generated by GBS, although less accurate (due to increased multi-mapping) than global alignment. GBS does not involve size fractionation of the sequencing library and hence sometimes results in the generation of fragments that are either too short to be useful or result in paired-end sequencing reads that overlap completely. On the other hand, the RAD-seq protocol includes a size fractionation step and most reads generated by this nonoverlapping approach can be aligned in an end-to-end manner. An example of the variation in the distribution of predicted enzyme sites for both RAD-seq (EcoRI) and GBS (PstI and MspI), together with a representation of relative genome coverage of each method, has been demonstrated for the Brassica oleracea genome [11]. RAD captured a greater portion of the genome with a high percentage of the potential sites being tagged and sequenced, while GBS coverage was impacted by the degree of cytosine methylation.

Multi-mapped reads are those that align to multiple locations within the reference genome sequence [40]. Most eukaryotic organisms, especially plants with polyploid genomes, carry orthologous and paralogous gene families that contain multiple isoforms of the same gene with nearly identical or similar sequences. Shorter reads being less specific tend to have more multi-mapping events. In polyploid plant species, the proportion of multi-mapped reads ranges from 20 to 60 %. Discarding such a high proportion of multi-mapping reads will result in a significant loss of valuable information. Bowtie2 searches and reports all valid alignments that score better than a given cutoff. We use Perl utility scripts bowtie2_extract_best_global_hit.pl or bowtie2_extract_best_local_hit.pl to go through the SAM files and identify the best hit from multi-mapped reads as having the top most hit with at least X = 6 (end-to-end) or X = 12 (local) penalty score better than the runner up. The larger the X score, the more confident a read is uniquely mapped but more alignments get discarded as a consequence.

Bowtie2 outputs alignments in SAM format which contains alignment data in human readable tab-delimited text. SAM files generally tend to be very large. BAM , a compressed binary version of SAM format, is a preferred format for the downstream variant detection analyses due to its relatively smaller size. We use the “view” command of SAMtools to convert mapped reads from SAM to BAM format. For downstream analysis the alignments in BAM files must be sorted and indexed according to the chromosomal positions. To achieve this, we use the sort and index utilities of SAMtools.

3.3 Variant Discovery

The next step after mapping reads to a reference genome is to call sequence variants (SNPs and InDels) from the processed BAM file. Multiple software tools for variant-calling are available, including SAMtools:mpileup/BCFtools [23], GATK [2426], SOAP [41], SNVer [42], and GNUMAP [43]. A recent study performed systematic evaluation of these commonly used variant-calling bioinformatics pipelines and found a very poor concordance between variants called by each of these methods [44]. Each of the SNP call ing methods is designed based on different sets of assumptions about the reference genome and reads, and their suitability in different situations depends upon various factors, including the nature of genotypes, presence or absence of multi-allelic SNPs, and sensitivity and specificity of detecting SNPs. In our variant-calling workflow, we have implemented two of the most commonly used SNP callers; SAMtools:mpileup/BCFtools [23] and GATK [24, 25]. Both of these pipelines also call InDels.

SAMtools:mpileup computes the likelihood of each possible genotype by generating a consensus sequence using the MAQ (Mapping and Assembly with Quality) model framework, which uses a general Bayesian framework for picking the base that maximizes the posterior probability with the highest Phred quality score, and outputs the information in the BCF format (binary variant call format). However, it does not call the variants. BCFtools does the actual calling and estimating allele frequency by applying the genotype likelihood information in BCF files. It generates output in the VCF (variant call format) format, which is the emerging standard for storing variant data. Identification of InDels from paired-end reads is relatively more challenging than that of SNPs as incorrect placement of insertions or deletions during read alignment to a reference genome may lead to false positive SNPs. SAMtools:mpileup deploys a concept called Base Alignment Quality (BAQ; [45]) to provide an efficient and effective way to rule out false positive SNPs caused by alignment artifacts. With the BAQ strategy which is invoked by default in mpileup, the probability of a base being misaligned can be accurately measured. Although the combination of SAMtools:mpileup and BCFtools offers a straightforward way of calling SNPs and InDels, this approach is limited to only diploid calling as SAMtools:mpileup is designed to compute and handle only biallelic variants [45]. We have successfully used SAMtools:mpileup for variant-calling and genetic linkage mapping of populations produced from biparental crosses (Bollina et al., In preparation; [10, 11]).

GATK is similar to SAMtools but utilizes additional processing steps, such as local realignment around InDel loci in order to clean up alignment artifacts, marking non-informative duplicate reads, and quality recalibration of both base quality and variant quality to improve overall accuracy of variant-calling [2426, 44]. GATK includes two variant calling tools, UnifiedGenotyper and HaplotypeCaller. The UnifiedGenotyper uses a Bayesian genotype likelihood model to estimate posterior probability of allele frequency at each locus. Additionally it utilizes information from multiple samples and supports SNP call ing from non-diploid samples. The HaplotypeCaller, which combines a local de novo assembler with a more advanced hidden Markov model (HMM) likelihood function, outperforms the UnifiedGenotyper in discovering sequence variants. However, it currently supports only diploid calling and lacks multithreading support.

Filtering raw SNP candidates is an essential step in the genotyping workflow as its helps in reducing false positive calls made from biases in the sequencing data and removing those calls that do not fulfil specific thresholds for SNP and genotype properties. Filtering of false positive calls based on read depth and quality threshold is embedded within some of the currently available variant calling pipelines such as SAMtools and GATK. We perform additional filtering based on missing genotyping calls and minor allele frequency (MAF). The level of missing data depends upon sequencing coverage which is influenced by the multiplexing level and the output from sequencing platform [18, 46]. Missing data can be reduced by sequencing at higher depth and reducing the multiplexing level. An alternative method for replacing missing data is to impute missing values with plausible substitutes (see Note 5 ). In recent years, algorithms [4749] have been developed for imputation of missing genotype data with great accuracy. MAF refers to the frequency at which the least common allele occurs in a given population [50]. We use the Perl utility script filter_vcf.pl (Table 1) to perform filtering based on missing genotype and MAF generally ignoring SNPs with a MAF less than 5 %. The final output from the majority of the variant calling pipelines is generally in the VCF format which can be viewed using genomic viewers such as Tablet [51] or IGV [52] (Fig. 2). We have also developed Perl scripts to generate genotype scores in tab delimited file formats for ease of downstream processing and analysis. The last step of our genotyping workflow involves merging SNPs based on identical segregation patterns. The cartoon in Fig. 3 depicts the logic as well as our approach for creating haplotypes blocks by merging closely linked SNP markers with identical segregation patterns to provide a recombination bin framework that can be easily incorporated into genetic mapping analysis.

Fig. 2
figure 2

Genomeviewer (IGV; Thorvaldsdottir et al. [52]) images illustrating alignment to the reference genome of short paired-end reads generated by RAD-seq (a) and GBS (b) approaches. The top two/three tracks represent the reference contig and positions of restrictions site(s): EcoRI (RAD-seq) or PstI and MspI (GBS). The following tracks show reads from each individual library aligned back to the reference using Bowtie2. Read bases that match the reference are displayed in gray and those that do not match (sequence variants) are shown in yellow

Fig. 3
figure 3

Overview of the approach used for generating haplotypes by merging SNPs with identical segregation patterns. As per the example shown in this cartoon, 5 RAD SNPs (at positions 100, 200, 300, 400, and 500 bp) were identified on scaffold1234. SNP #1 and SNP#2 have identical segregation pattern, except for the missing data points, so as SNP#3 to SNP#5. Instead of using all 5 SNPs for genetic mapping, we combine SNPs with identical scores. The locus name of each merged RAD SNP (haplotype) provides additional information: the first part of the name includes the scaffold name, the next number indicates the order of the SNP pattern identified in the scaffold, the next two numbers indicate the base pair positions between which this haplotype pattern was found, and the final number indicates the count of independent SNPs that had this pattern

3.4 Conclusion

The advent of very high throughput NGS platforms together with new technical methodologies to take advantage of these gains provided an opportunity for establishing high resolution genetic analysis in any species. The ability to profile large numbers of targeted loci for sequence variation in highly multiplexed sets of discrete individuals provided a platform for a range of applications. An initial limitation for the full deployment of these approaches have been the dearth of readily available bioinformatics tools to process the raw data to yield output that can be readily incorporated into classical genetic analyses. This chapter has outlined some of the recently available bioinformatics resources to enable researchers to establish GBS applications for genetic analysis in their laboratories, provided an example pipeline that could be utilized for this purpose, and also a description of key factors that need to be considered in experimental design.

4 Notes

  1. 1.

    Assessing sequencing data requirements: In many instances both RAD and GBS have been attempted with a number of restriction enzymes. However, the choice of a particular enzyme and the volume of sequencing data required depends on several factors such as, the genome size, sample multiplexing needs, GC content, frequency of the cut site (frequent to rare) and desired frequency of the sites throughout the genome. In silico analysis of a genome with a choice of an enzyme cut site would provide a glimpse prior to a selection. The RAD Counter tool provided on the RAD wiki website (https://www.wiki.ed.ac.uk/display/RADSequencing/Home;jsessionid=14E3C4ECD753766FC8E4EA41274A9BF1) provides the user with a simple Excel format to input relevant information with respect to the above parameters to establish the optimal experimental design to ensure appropriate read depth is reached.

  2. 2.

    Removal of duplicate reads: advantages and limitations: Duplicate reads arising from PCR amplification during library preparation can result in perfect copies of the DNA template being sequenced multiple times. The proportion of duplicate reads can vary enormously and duplicate reads can artificially inflate read coverage which may have detrimental effect on the quality of variant calls. Hence the dataset used for variant calling should include only one copy per duplicate set of reads. Duplicate reads can be detected and removed by comparison of either the read sequences or their alignment coordinates. However, the risk of removal of identical or almost identical reads arising from duplicated genomic regions, especially in organisms carrying polyploid genomes, poses a serious challenge. Additionally, it is impossible to differentiate duplicate reads arising due to amplification bias and identical GBS tags originating from the same restriction site(s) at a particular genomic location. This is not an issue in the case of paired-end RAD tags as the additional DNA fragmentation combined with size fractionation step in RAD-sequencing protocol leads to the production of paired-end tags with at least one variable end. Thus we advise against removal of duplicate GBS tags, whereas the decision on removal of duplicate RAD tags should depend upon the ploidy status or the level of segmental duplication in the organism under consideration.

  3. 3.

    Computational resources: The analysis of GBS and RAD data requires nontrivial computational resources. In order to reduce analysis time, the use of multiple CPU cores is recommended. Many desktop computers will be limited in the number of samples they can process by the available RAM. Additionally, the output of the analysis steps requires significantly more hard disk space than that of the raw sequencing data. As an example of computational requirements, 96 GBS samples were processed using 16 CPU cores for Trimmomatic, Bowtie2, and GATK. The total time required to process the samples was approximately 13 h and required at most 21GB of RAM. The samples were demultiplexed from 9.7GB of compressed fastq data and resulted in approximately 68 GB of uncompressed output using a pipeline optimized to reduce production of intermediary output files.

  4. 4.

    Single-end or paired-end mapping: Variant calling can be done using either single or paired-end data with resulting benefits in increased coverage with paired-end data. It is also difficult to accurately map single reads originating from regions with significantly higher sequence homology, such as repeat rich or duplicated genomic regions. Sequencing reads from both ends can partly overcome this difficulty. Filtering of paired-end sequencing data based on adapter contamination and quality as well as length thresholds results in the generation of a small proportion of single end reads. In such case both single-end and paired-end mapping followed by merging of separately generated SAM files before the variant discovery step is possible.

  5. 5.

    Data imputation: One issue with both RAD and GBS is the amount of missing data that can result from the sequencing, especially when this is carried out at a low level of coverage/depth. Hopefully such an outcome can be avoided in the first place by ensuring optimal levels of depth are reached by adopting an appropriate experimental design (see Note 1 ). However, when high levels of missing data result it is possible to adopt imputation approaches that are currently available for different experimental approaches with various population structures [49, 53]. As well, it is possible to limit the amount of missing data in some types of populations; for example biparental genetic mapping populations as described in the main text. In this case the merging of SNP loci based on identical segregation patterns can be carried out to create haplotypes blocks with minimal missing data and a resultant recombination bin framework for genetic mapping analysis.