Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Introduction

The sequence-driven research in molecular biology started with pathbreaking research by two groups led by Sanger and Gilbert (Sanger et al. 1977; Maxam and Gilbert 1977). High-throughput sequencing (HTS) techniques popularly known as next-generation sequencing (NGS) were introduced in 2005 and have revolutionized the biomedical research by substantial increase in scale and resolution of various biological applications. They provide manyfold reads at a markedly reduced cost per sequenced nucleotide than conventional Sanger sequencing. Next-generation sequencing generates a huge amount of data necessitating the development of powerful computing and efficient algorithms. All commercial platforms have three common phages in their development, namely, preparation of sequencing library by adding adapters (defined sequences), immobilization of DNA fragments of sequencing library to a solid surface, and sequencing (Myllykangas et al. 2011). It can be used for whole genome sequencing, targeted resequencing, and identification of transcription factor binding sites and expression of noncoding RNA. There are several commercial platforms available such as 454 pyrosequencing (Roche Applied Science), the genome analyzer (Illumina), and SOLiD (Applied Biosystems). Next-generation sequencing can be applied to detect molecular variants such as single nucleotide variants, genomic insertions and deletions, and genomic rearrangements. RNA-seq can be used to determine the expression level of known genes and discovery of novel genes. ChIP-seq can be used for screening protein-DNA interaction at genome-wide scale. The whole genome sequencing and assembly of an organism are performed in various phases (Fig. 1). The major focus of this article is to introduce the reader with three common high-throughput sequencing platforms with more emphasis on various computational methods to analyze the next-generation sequencing data obtained from plant genomes.

Fig. 1
figure 1

Flow chart showing various steps in genome sequencing and assembly using next-generation sequencing technology

Next-Generation Sequencing Platforms

The three most popular sequencing platforms widely used to date are Roche 454 pyrosequencing, Illumina (Solexa), and SOLiD (Applied Biosystems).

454 Pyrosequencing

454 is the first next-generation technology introduced by Roche/454 Life Sciences which is based on pyrosequencing. In pyrosequencing, a double-stranded DNA is generated from a single-stranded DNA template by the addition of nucleotides. The addition of nucleotides is detected by the emission of light. It achieves a high throughput (~500 Mbp/run) with 400 bp read lengths. The major demerit associated with this platform is high error rate in homopolymer regions.

Illumina

The Illumina generates a much higher throughput (~1.5 Gbp) with a lower read length (~150 bp) when compared with 454. Although the read length is short, the platform generates high-quality sequences with an error rate less than 1 %. The principle of this method is based on sequencing by synthesis (reversible terminator chemistry). Long homopolymer runs do not affect the quality of sequence due to its chemistry. However, the major challenge associated with this technology is to process a large number of short reads, difficulty in de novo assembly, and in resequencing.

SOLiD

The ABI/SOLiD platform generates maximum high throughput achieved to date by any method (~3 Gbp/run) with a read length of 75 bp. This platform requires two kinds of library preparation: fragment or mate paired. Clonal bead populations are prepared in microreactors, and modified microbeads are deposited onto a glass slide. The sequencing was done using multiple cycles of ligation, detection, and cleavage.

Genome Assembly Algorithms

Most of the biological applications of next-generation sequencing such as quantification of transcriptome, assembly of new genomes, and identification of protein binding sites and alignment of sequence reads to a reference sequence as a first step of analysis. The process of aligning short reads into longer sequences is known as assembly. It is like a jigsaw puzzle where each short read is an individual part of the puzzle, and the whole genome sequence is a finished puzzle. There are many alignment tools developed in the last 4 years which are better than classical aligners in terms of speed and accuracy. Alignment algorithms can be based on hash tables, suffix trees, and merge sorting (Miller et al. 2010). The concept of hash table started with basic local alignment search tool (BLAST) which finds significant local alignment comparing exact matches to a k-mer (seeds) in a hash table. Ma et al. (2002) improved this method by creating the spaced seed (i.e., a seed with internal mismatches) which turned out to be the most popular approach for alignment of short reads. Eland, SOAP, SeqMap, MAQ, RMAP, ZOOM, and Novoalign are various popular programs for aligning short reads to a reference genome using spaced seed. Although spaced seed has mismatches within the seed, it never permits any gap in the seed. Eland was the first program developed by Anthony Cox from Illumina that aligns short oligonucleotides against a reference genome. SOAP is an efficient program for gapped and ungapped efficient alignment of short reads onto a reference genome (Li et al. 2008a). SeqMap can map a large amount of short reads to a genome based on index-filtering algorithm (Jiang and Wong 2008). MAQ builds assemblies by mapping short reads to a reference genome using quality score (Li et al. 2008b). RMAP software package has tools to map paired-end reads using a more sophisticated quality score (Smith et al. 2009). ZOOM maps short reads onto a reference genome with improved sensitivity and speed (Lin et al. 2008). Novoalign, a commercial software developed by Novocraft Technologies, is an aligner for short reads from Illumina genome analyzer. Another seeding approach q-gram filter builds an index allowing a gap within the seed. Two programs SHRiMP and RazerS are based on q-filter algorithm. SHRiMP is highly efficient in mapping short reads to a reference genome with high polymorphism (Rumble et al. 2009). RazerS is a popular read mapper with improved performance for long reads with large numbers of indels (Weese et al. 2009).

The algorithms based on suffix/prefix tries may be represented in a form of suffix tree (McCreight 1976), enhanced suffix array (Abouelhoda et al. 2004), and FM-index (Simpson and Durbin 2010). All algorithms identify exact matches at first and then build inexact alignments based on the exact matches. The suffix trie is a data structure that stores all the suffixes of a string in order to allow fast string matching. A trie needs a huge space and is impractical for even a small genome. Thus, there are various data structures such as suffix tree, suffix array, and FM-index to reduce the space. A suffix tree requires 12–17 bytes per nucleotide and is impractical for holding human genome in memory (Li and Homer 2010). The enhanced suffix array is more space efficient than suffix tree and takes only 6.25 bytes per nucleotide. The FM-index is the most space-efficient data structure using 0.2–2 bytes per nucleotide, and an FM-index of the entire human genome needs 2–8 GB of memory. The most widely used data structure is FM-index due to its small memory footprint. Bowtie, BWA, SOAP2, BWT-SW, and BWA-SW are the most popular programs built upon FM-index. Bowtie is a very fast and memory-efficient aligner for large genomes based on Burrows-Wheeler indexing (Langmead et al. 2009). Burrows-Wheeler alignment (BWA) tool is another efficient short-read aligner for large genomes allowing mismatches and indels based on Burrows-Wheeler transform (Li and Durbin 2009). SOAP2 is a short oligonucleotide alignment program with reduced memory usage and improved alignment speed (Li et al. 2009a). BWT-SW is an efficient tool to find all local alignments (Lam et al. 2008). Burrows-Wheeler Aligner’s Smith-Waterman (BWA-SW) alignment is an efficient algorithm to align long reads of up to 1 Mb against a large sequence database (Li and Durbin 2010). However, MUMmer (Kurtz et al. 2004) and OASIS (Meek et al. 2003) are based on suffix tree, whereas Segemehl (Hoffmann et al. 2009) and Vmatch (Abouelhoda et al. 2004) apply enhanced suffix array as data structure. The yet other aligner of biological sequences (YOABS) is a very efficient long-read alignment program having advantages of both hash- and trie-based algorithms (Galinsky 2012).

There are more than 50 short-read alignment software packages available, albeit few of them are popular among users. Table 1 gives a list of popular alignment software packages available for short reads. All programs generate outputs in the form of a Sequence Alignment/Map (SAM; Li et al. 2009b) or Binary Alignment/Map (BAM; Carver et al. 2010) format which can be viewed through alignment viewers (Table 2) such as GBrowse (Stein et al. 2002), LookSeq (Manske and Kwiatkowski 2009), Tablet (Milne et al. 2010, 2013), BamView (Carver et al. 2010; Carver et al. 2013), GenomeView (Abeel et al. 2012), IGV (Thorvaldsdóttir et al. 2013), and MGAviewer (Zhu et al. 2013). The SAM format can be created and manipulated using SAMtools (Li et al. 2009b) which has extensive information regarding a read, its properties, and its alignment to a reference sequence. BAM format is the compressed binary form of SAM format which can be converted to SAM format and vice versa using SAMtools.

Table 1 Popular programs for short-read alignment
Table 2 Alignment viewers of SAM/BAM format

De Novo Assembly of Short Reads

De novo sequence assembly is a method where individual short reads are merged into a long continuous sequence (contig) like the original template. In fact, short reads of 40 nucleotide length can be used to assemble the vast majority of protein encoding genes in most of the prokaryotic genomes albeit having many gaps. Table 3 shows common programs for assembling short reads without any reference genome. All algorithms for assembling capillary-based sequence reads of 400–1,000 nucleotides into long contiguous sequences adopts a common approach known as overlap-layout-consensus (OLC) approach (Li et al. 2012). The OLC algorithm first finds overlaps between sequence reads and then looks for most fitting pairs of reads (layout) and finally derives a consensus sequence from this layout. The overlap step is computationally expensive, and therefore various algorithmic approaches have been adopted to improve the computational efficiency. The OLC approach is adopted by many popular programs such as Arachne (Batzoglou et al. 2002), Celera Assembler (Myers et al. 2000), CAP3 (Huang and Madan 1999), PCAP (Huang et al. 2003), PHRAP (de la Bastide and McCombie 2007), Phusion (Mullikin and Ning 2003), and Newbler (a commercial assembler developed by Roche Diagnostics). Most of the assemblers designed for short-read sequences are based on De Bruijn graph (DBG; Li et al. 2012) and Eulerian path approach (Pevzner et al. 2001). Some of the popular software packages based on DBG and Eulerian paths are Euler (Pevzner et al. 2001), Euler-USR (Chaisson et al. 2009), Velvet (Zerbino and Birney 2008), ABySS (Simpson et al. 2009), ALLPATH-LG (Gnerre et al. 2011), SOAPdenovo (Li et al. 2010), and Gossamer (Conway et al. 2012). The graph-based algorithm assembly creates a model where the string is a node and the relation between strings is represented in a form of edges. In De Bruijn graph (DBG) algorithm, reads are chopped into smaller fragments (k-mers), and k-mers are converted into a DBG for final determination of genome sequence. The optimal solution is obtained through finding a Eulerian path (i.e., a path which covers a node only once) through the graph. However, the string graph assembler (SGA) is a program based on a string graph which keeps all reads intact and generates a graph based on overlaps between reads (Simpson and Durbin 2012).

Table 3 Some programs for de novo assembly of short reads

Scaffolding Algorithms

The large assembled regions of sequence are known as contigs which need to be joined together to get the whole genome sequence. The final process of joining multiple contigs together to form a continuous genome sequence (scaffold or supercontig) is known as scaffolding or finishing. This process is done in four consecutive steps, namely, contig orientation, contig ordering, contig distancing, and gap closing. The orientation of contigs in same direction (5′-3′ direction in prokaryotes) is done using a reverse complementary sequence. The contigs are placed in an appropriate order starting at the origin of replication and extended in 5′-3′ direction of DNA replication. The distance between contigs can be estimated after correct orientation and order. The final step of closing and filling gap can result into a finished genome. The paired-end reads provide additional information for grouping two contigs in a genome. Scaffolding process may be based on a graph, where a contig is treated as node and matching pair contigs are connected by edges. The algorithm finds an optimal path through the graph. The scaffolding process may be more accurate using additional information such as reference sequences of related organism, restriction maps, and RNA-seq data. Some of the popular programs (Table 4) for scaffolding are SOAPdenovo (Li et al. 2010), ABySS (Simpson et al. 2009), Bambus (Pop et al. 2004), SOPRA (Dayarian et al. 2010), SSPACE (Boetzer et al. 2011), OPERA (Gao et al. 2011), MIP Scaffolder (Salmela et al. 2011), GRASS (Gritsenko et al. 2012), and RACA (Kim et al. 2013).

Table 4 Popular programs for scaffolding

Biological Applications of Next-Generation Sequencing

Genome Sequencing

The worldwide effort to understand the genetic basis of common and rare genetic disorder has gained momentum with the advent of next-generation sequencing technology. It will help largely in the identification of single nucleotide polymorphisms (SNPs) and haplotype mapping (International HapMap Consortium 2003) in individual human genomes and lay a foundation for personalized medicine. The 1,000 genome project (http://www.1000genomes.org) turned into reality with the availability of NGS technology. Cancer biology is another area where next-generation sequencing can decipher the novel molecular pathways involved in tumorigenesis (Hahn and Weinberg 2002). Next-generation sequencing will also influence the highly emerging area of synthetic biology where a new enzyme or a novel genetic network may be developed (Khalil and Collins 2010).

Functional Genomics

Functional genomics is focused to apply genomics data for understanding dynamic life processes. RNA-seq is widely used to quantify gene expression levels for different genes like microarray technology (Wang et al. 2009). It has several advantages over microarray analysis such as no prior sequence information is needed; highly expressed and lowly expressed genes are equally detected and allow detailed identification of structure of transcripts including alternative promoters and alternative splicing sites. In RNA-seq technology, the relative abundance of a transcript is estimated by counting the number of times it is hit by the sequence reads. This method accurately estimates relative RNA levels under different experimental conditions or in different cell types.

Epigenetics

Epigenetics deals with heritable regulatory changes in chromosomes without any change in the DNA sequence (Bird 2007). The epigenetic changes such as DNA methylation, histone modification, and ncRNA have an important role in maintaining chromosome structure. The posttranslational modifications of histones such as methylation, acetylation, ubiquitination, and phosphorylation generate different “marks” for different functional properties. The DNA-binding proteins, histone modifications, or nucleosomes can be mapped across the genome using ChIP-seq approach where a chromatin immunoprecipitation (CHIP) is followed by sequencing (Park 2009). The DNA methylation involves methylation of the cytosine base in DNA and can be identified by a version of NGS known as Methyl-seq (Brunner et al. 2009). The active gene regulatory elements can be better understood by using another approach of NGS known as DNAse-seq (Song and Crawford 2010). Noncoding RNA (ncRNA) has been implicated in many epigenetic events such as X-chromosome inactivation and gene silencing (Mercer et al. 2009).

Current Status of Next-Generation Sequencing in Plant Genomics

NGS has been used extensively for whole genome sequencing of plants in the last 5 years (Table 5). Arabidopsis thaliana (125 Mb) was the first plant completely sequenced in 2000 using Sanger sequencer (Arabidopsis genome; Initiative 2000). It was followed by sequencing two major rice varieties, namely, japonica (420 Mb; Goff et al. 2002) and indica (466 Mb; Yu et al. 2002) in 2002 and first fruit grapevine (487 Mb; Jaillon et al. 2007) in 2007 using the same sequencing method. Subsequently, the draft genomes of papaya (372 Mb; Ming et al. 2008) and legume Lotus japonicus (315 Mb; Sato et al. 2008) were developed in 2008. Sorghum genome (730 Mb; Paterson et al. 2009) and maize genome (2.3 Gb; Schnable et al. 2009), and soya bean genome (1.1 Gb; Schmutz et al. 2010) were sequenced in 2009 and 2010, respectively, onto Sanger platform. However, the pace of sequencing plant genomes rapidly increased with the advent of the next-generation sequencing (NGS) technology. Cucumber genome (243.5 Mb; Huang et al. 2009) was sequenced taking advantages of both Illumina GA technology (high sequencing depth and low unit cost) and Sanger technology (long read and clone length). In 2010, wild grass, Brachypodium distachyon was sequenced using both methods (International Brachypodium Initiative 2010). The cocoa genome (430 Mb; Argout et al. 2011) was sequenced applying two NGS platforms, namely, Roche 454 and Illumina along with Sanger sequencing. The apple genome (604 Mb; Velasco et al. 2010) was sequenced using both Roche 454 technology and Sanger technology. The woodland strawberry (209.8 Mb) was sequenced using three NGS platforms: Roche 454, Illumina Solexa, and Life Technologies SOLiD (Shulaev et al. 2011). In 2011, the potato genome (727 Mbp) was sequenced using two major NGS platforms: Roche 454 and Illumina Genome Analyzer along with conventional Sanger sequencing technology (Potato Genome Sequencing Consortium 2011). The cannabis genome (534 Mb) was sequenced using Roche 454 and Illumina Genome Analyzer IIx or HiSeq platforms (van Bakel et al. 2011). The draft genome of pigeon pea (606 Mb) was sequenced with Illumina technology along with Sanger technology (Varshney et al. 2012). A close relative of Arabidopsis, Thellungiella parvula, is endemic to saline habitat, and its genome (140 Mb) was investigated using Roche 454 and Illumina GA2 (Dassanayake et al. 2011). The date palm genome (658 Mb) was sequenced de novo using Illumina GA2 and Sanger sequencer (Al-Dous et al. 2011). A draft consensus sequence of grape genome (504 Mbp) was developed with 1.7 million SNPs (Velasco et al. 2007). Brassica rapa genome was sequenced by Brassica rapa genome sequencing project consortium (Wang et al. 2011). The cotton plant draft genome (775 Mb) was sequenced using Illumina HiSeq 2000 platform (Wang et al. 2012). Melon, a close relative of cucumber, was covered for genome (375 Mb) using Roche 454 pyrosequencing, Illumina, and Sanger technologies (Garcia-Mas et al. 2012). The tomato genome (760 Mb) was sequenced using Roche 454 GS FLX, Illumina GA2, and SOLiD along with Sanger sequencing (Tomato Genome Consortium 2012). The banana genome (472 Mb) was sequenced with combined application of Roche 454, Illumina GA2, and Sanger technologies (D’Hont et al. 2012). Recently, both Roche 454 (GS FLX or FLX Titanium) and Illumina (GA2 or HiSeq 2000) have been applied to decipher the genome sequence of barley (4.98 Gbp) (The International Barley Genome Sequencing Consortium 2012). Since the bread wheat has a large genome size (17 Gb) than other cereals and is hexaploid in nature, the successful completion of bread wheat genome sequencing using 454 pyrosequencing and wheat A-genome (4.94 Gb) sequencing on Illumina HiSequation platform is a significant event in the next-generation sequencing of crops (Brenchley et al. 2012; Ling et al. 2013). The completion of wheat genome will not only pave the way for better productivity of wheat crop but decipher the role of polyploidy in plant genome evolution as well. Recently, the whole genome sequencing of sweet orange (Citrus sinensis; 367 Mb) was done on Illumina GAII sequencer (Qiang et al. 2013). The draft genome sequence of chickpea (Cicer arietinum; 740 Mb) was completed on 454/Roche GS FLX Titanium platform (Jain et al. 2013). The complete genome of sacred lotus (929 Mb) was sequenced with combined application of Illumina and 454 technologies (Ming et al. 2013). Other whole genome sequencing projects are underway in many plant species such as amborella (Amborella trichopoda), columbine (Aquilegia sp.), sugar beet (Beta vulgaris), monkey flower (Mimulus guttatus), rose gum tree (Eucalyptus grandis), flax (Linum usitatissimum), cassava (Manihot esculenta), and pear (Pyrus bretschneideri). Some species of the lower plant species were sequenced in order to understand the evolution of vascular plants on land. The genome of green alga (Chlamydomonas reinhardtii; 120 Mb) (Merchant et al. 2007), genome of moss (480 Mb; Physcomitrella patens) (Rensing et al. 2008), and genome of lycophyte (Selaginella moellendorffii; 213 Mb) (Banks et al. 2011) were sequenced using conventional Sanger sequencing and have revealed insights into genomic evolution of land plants.

Table 5 Overview of plant genomes sequenced applying next-generation sequencing

Future Prospects in Next-Generation Sequencing and Assembly

NGS-based technology has a wide scope for solving many existing problems in genomics. However, the low read length with intrinsic error rate of NGS is a major problem and is a prohibitive factor for de novo assembly of large genomes. Therefore, this technology is largely based on the availability of a reference genome. However, this problem will be solved in the future with an increase in size of the read length. Although NGS provides a deep coverage, it has a low throughput in comparison to microarrays. However, this problem may be alleviated by developing large-scale parallel NGS. With an increase in the number of reference genomes, it is expected that whole genome resequencing will become more popular in order to interrogate the diversity of crop genomes. New dedicated algorithms are needed to deal with complex repeats in the plant genome for better quality of assembly. Along with assembly algorithms, the next-generation data quality and quantity should be improved in the near future.

Conclusion

In this work, three common NGS platforms and various computational methods for analysis of NGS-derived sequence data are discussed. The impact of NGS technology on plant genome sequencing especially on crop genomes, till date, is elaborated. It is expected that NGS technology will grow further in sensitivity and speed and will decipher the genomes of other plants to understand the genome evolution and help in revealing key genomic features to agricultural productivity.