Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction to Next-Generation Sequence Assembly

The sequence assembly process was developed to resolve the limitations of current technologies that prevent the sequencing of the whole genome/chromosome during a single read. In first- and next-generation sequencing methods (see Chap. 3), the whole genome is sheared into short random fragments with short overlaps. Each fragment is sequenced independently and the resulting sequences are individually called a “read”. Hence, the process of repositioning these random reads to reconstruct the whole genome is known as the “sequence assembly process” [1, 2].

According to the sample and type of raw data generated by sequencing instruments and the aim of the study, the assembly process may take many flavors including genome, transcriptome, or metagenome sequence assembly. If the raw data in the sequencing experiment is genomic DNA, the process is called genome assembly. Likewise, if the raw data is mRNA, the process is called transcriptome assembly, whereas assembling reads resulting from sequencing environmental samples that contain a mixture of organisms is called metagenome assembly. The ever-increasing number of applications in genomics, transcriptomics, metagenomics, and single-cell sequencing exhibits the need to acquire sequences from the viral, microbial, bacterial, or eukaryotic communities [3]. While the details of the assembly process and employed assembly tools are different in each case, the sequence assembly process always shares the same stages.

The process of sequence assembly starts with filtering the reads to remove or correct errors and then computing a set of overlaps among them to discover their arrangement. These overlaps are used to connect the reads together into long contiguous structures called “contigs”. Similarly, contigs can also be connected together to form even longer sequence stretches called “scaffolds” [4].

According to the availability of the reference sequences, the sequence assembly process has two main approaches, comparative sequence assembly and de novo sequence assembly. In comparative sequence assembly (also known as reference-based sequence assembly), reference sequences from the same organism or closely related species help to guide the reconstruction process [5]. On the other hand, de novo assembly does not involve reference sequences and consequently is a more complicated process [1].

2 Sequence Assembly Framework

Sequence assembly is a multiphase process. These phases communicate together in order to produce the final assembled sequence. Not only does the organization of these phases differ from one assembly to another, but some phases are completely missing in certain assembly processes in accordance with various issues (Fig. 8.1) [6].

Fig. 8.1
figure 1

Schematic representation of the five stages of next-generation sequence assembly process (Note: G″ is a repairing version of graph G with N nodes and E edges)

The first phase, commonly known as the error correction phase, aims at filtering erroneous reads by removing or correcting sequencing errors. The filtered reads are then fed into the second phase that formulates them into a graph of nodes with their relationships represented as graph edges. This representation overcomes the limitation of available computational resources that are necessary to manage the high throughput nature of next-generation sequencers. However, the resulting graph may contain erroneous nodes or structures that were overlooked during the first phase. Hence, these erroneous structures must be removed or resolved, in the so called graph simplification phase, before the construction of the contigs. Following the graph simplification phase, the contigs are produced by finding the paths on the graph that connect the reads together. Subsequently, the scaffolding phase involves the filtering of the contigs, the detection of misassembled contigs and uncovering the relationships between them to build scaffolds [6]. Finally, the assembly assessment phase evaluates the assembled contigs/scaffolds in accordance with different metrics that reflect the quality, consistency, and accuracy of the algorithm used in the reconstruction process [7, 8].

There are many differing viewpoints when designing an assembler. Some designers rely on the early correction of errors in order to facilitate the remaining phases of the assembly process (i.e., graph building and simplification) [915]. Other designers propose to delay the error correction phase to the graph simplification process since both these phases aim at removing errors. Moreover, merging these two phases would reduce the overall computation time [1622]. Hence, there are stand-alone error correction tools, scaffolding tools, and assessment tools that perform these phases independently from the other assembly phases. Certain designers rely on these independent tools to complete the missing parts in their assemblers.

2.1 Error Correction Phase

Correcting the errors that result from sequencing platforms represents one of the major challenges in the next-generation environment. These errors vary from the presence of simple ambiguous bases to the occurrence of substitution and indel errors (see Chap. 4). By detecting these errors early, the assembly process can be more efficient during the latter stages. The general approach followed by most error correction algorithms is examining the richness of the reads (i.e., read coverage) produced by the next-generation sequencers as a key to distinguish between correct and incorrect reads. This approach can be disrupted by repeats and non-uniform sampling of genomic sequences, which can lead to ambiguous choices during error correction [23].

2.2 Graph Construction Phase

There are diverse paradigms for graph construction in accordance with different graph models. These paradigms must overcome a host of computational challenges in relation to graph representation and path-finding algorithms for the contigs building (algorithms and challenges are discussed in detail in Chap. 9). Paradigms can generally be categorized into four main categories: overlap-based construction, k-mers-based construction, greedy-based construction, and hybrid-based construction [24, 25]. Each of these paradigms and their accompanying challenges are discussed in more detail in Chap. 9 as well.

2.3 Graph Simplification Phase

As mentioned previously, some errors are not recognized during the error correction phase and can subsequently complicate the efforts of path-finding algorithms that attempt to connect reads and assemble accurate contigs. These errors form diverse structures in the assembly graph which must be filtered through identification and correction before the building of contigs is initiated.

2.4 Scaffolding Phase

The process of creating scaffolds is not as simple as the process of creating contigs. The goal of the scaffolding process is to order and orient contigs that result from the assembly process. The scaffolding process is guided by paired-end reads that filter contigs, detect misassembled ones, and allow accurate contig extension into the repeated regions [6, 26].

2.5 Assembly Assessment Phase

Assessing the performance of an assembler is dependent on the metric(s) used during the evaluation process. One of these approaches targets the contiguity of the resulting contigs/scaffolds and utilizes different statistical metrics to assess the final assembled sequence [2734]. Another approach scrutinizes the accuracy of the assembled contigs/scaffolds and uses one of the previously finished genomes as a reference to assess the draft sequence [29, 31]. Additional evaluative strategies include examining the constraints imposed by paired-end libraries, the nature of the sequences being assembled and the sequencing experiments themselves [31, 35, 36].

Since the assembler is a software program with a set of functionalities, it must be assessed not only in terms of its output but also in relation to other factors. These include responsiveness to user commands, the friendliness of the user interface components, and setup requirements. The evaluation of such functionalities allows the targeted assessment of the usability features of an assembler [3739].