Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction to Assembly Assessment

The assessment of the assembly process is mainly performed from two perspectives. The first perspective is assembly quality, which evaluates the contiguity, consistency, and accuracy of the assembled genomes using different approaches [14]. The second perspective is the performance and usability of the assembler, which includes numerous issues such as hardware and software requirements, ease of installation and execution, user-friendly interfaces, run time per analysis, required memory per 1 GB of data, and the speed of responsiveness to user commands [58].

2 Contiguity and Consistency Measures

2.1 Contiguity Assessment

Statistics metrics are usually used to assess the contiguity of the assembled contigs/scaffolds. These metrics include the distribution of their lengths, their maximum, minimum and average lengths, the number of resulting contigs/scaffolds, the total sum of the assembled contigs/scaffolds, the total length of their short reads, and the N x score. N 50 and N 75 represent the most important metrics for measuring contig/scaffold contiguity. They are defined as the length of the contig/scaffold such that 50 %/75 % of its bases are in contigs of greater or equal length [14, 912]. Although a large value of the N x score indicates more contiguity in the assembled contigs/scaffolds, the misassembly of contig/scaffold sequences may also increase the score [13].

2.2 Consistency Assessment

Due to the presence of abundant information in paired-end libraries, including the estimation of insert size among each pair of reads and their orientation, approaches assessing consistency can utilize this information in the evaluation process. Following the completion of the assembly process, read pairs can be located in the draft sequence. In this case, a comparison of the assembly process with the annotated information of the read pairs (such as separation distance or orientation) can occur. Based on the number of satisfying constraints, we can infer the validity of the assembled sequence [14]. A recently introduced metric also utilizes the idea of aligning the paired-end reads to the assembled genome in generating Feature-Response Curves (FRC) to overcome the available tradeoff between the contiguity and accuracy of the assembly results [15, 16]. Other consistency methods target the type of sequence being assembled (such as haplotype sequences) [3] as well as the constraints imposed by the read coverage to assess the assembled sequences [17] or optical maps [18].

3 Accuracy Measures

Comparing the draft sequence assemblies to ones that have been completed represents the most important metric in evaluating the assembly quality [3, 9]. This reference can be an assembled genome of the same species or of a closely related species. The comparative process takes different perspectives such as aligning the two sequences using one of the available alignment tools (i.e., tools that were mentioned in Chap. 2) that report the percentage covered by the assembled sequence [5, 19], the long-range contiguity of the assembled contigs/scaffolds [20], their accuracy and the introduction of modification patterns in the assembled sequences such as insertions, deletions, and substitutions [21]. Furthermore, the comparison process assists in the identification of core genetic components and novel genes [22]. The number of misassembled contigs/scaffolds (i.e., breaks) and the number of misaligned bases (i.e., mis-calls) are also used as accuracy metrics in the context of alignment to a reference sequence [23]. Another perspective for assessment occurs during the unavailability of the reference genome. In this case, the comparative process requires independent genetic material from a public database. These genetic components (such as mRNA or cloned genes) can only be utilized if they and the assembled sequences belong to the same type of organism. When this criterion cannot be met, the accuracy approaches enlist components from closely related organisms or conserved sequences [1, 22].

4 Assembler’s Performance Measures

The runtime and memory usage of an assembler are the most important criteria for the usability measure. Depending on the available computational resources, current assemblers used in next-generation environments are classified into two categories. In the first category, the assemblers run on a single machine with very large memory requirements, e.g., to assemble human and mammalian genomes [19, 24]. In the other category, assemblers are run on tightly coupled cluster machines [25]. The high-throughput nature of next-generation sequencing technologies and the presence of short-read sequences and their quality scores imposes a major constraint on the system memory available. To ensure efficient memory savings, most assemblers formulate the assembly problem as a set of graph nodes and rely on efficient data structures to accommodate these nodes. These different graph models were discussed earlier (see Sect. 9.3), including their advantages and disadvantages with respect to computational resources and several studies that reformulated their representations to ensure efficient storage in memory. However, no memory-efficient solution is presently available for next-generation sequence assemblers, creating a need for new tools and algorithms in this area.

5 Assessment Tools and Evaluation Studies for Assessing Assembly Quality

There are several studies for evaluating assembly quality based on combining the approaches that we have discussed previously or defining novel strategies. Furthermore, there are tools that are especially designed for the assessment of the sequence assembly quality. However, the generation of assessment tools that consider the complexity of the data sets being assembled, the assembly algorithms, different parameter settings, and the nature of sequencing experiments are still lacking [21, 26]. It is also important to note that there is always a tradeoff between the different quality measures. For instance, trying to maximize the value of one measure (i.e., improve contigs/scaffolds connectivity) may decrease the value of another (i.e., contigs/scaffolds accuracy). Here, we will mention some studies that attempted to design assessment approaches and metrics that are applicable to wide range of next-generation sequence assembly techniques. Then, we will review the available assembly assessment tools.

5.1 Evaluation Studies for Assessing Assembly Quality

Assemblathon [27] is one of the studies that defined its own statistical metrics in addition to existing ones. It uses the haplotype sequences as reference measures to newly defined metrics such as NG50, which is the same as N 50 but uses an average length of haplotype sequences instead of contig/scaffold lengths during its computation. Similarly, CPNG50/SPNG50 denotes the average length of contigs/scaffolds consistent with the haplotype sequences, while CC50 measures the connectivity between any two randomly chosen points in the assembled genomes. The recently published version of the Assemblathon [28] addressed some practical issues during assembly evaluation, including the consideration of diverse assembly results from various assemblers with different parameter settings, the choice of assemblers based on metrics of interest and overlooking contiguity metrics when studying the genetic components of the assembled sequences.

E-size is yet another statistical metric introduced in GAGE [13]. E-size measures the expectation that a certain point (or base), which is chosen randomly from a reference genome, is located in the assembled contigs/scaffolds in terms of their lengths. Additionally, GAGE also discussed the different factors that can affect the evaluation process, such as the complexity of the genome being assembled and the employed assembler. It also reported that various statistical measures cannot be used alone in indicating assembly quality due to inefficiencies in representing the contiguity and accuracy of the assembled sequences. A more recent version of this study is called GAGE-B [29]. GAGE-B evaluated different bacterial genome assemblers using libraries with high coverage reads and studied the effect of the coverage and read lengths on the assembly quality.

Additionally, Haiminen et al. [30] reported that the assessment process can be affected by the nature of sequencing experiments, such as the average length of short reads, their coverage, and the rate of sequencing errors. Furthermore, they give a different score for each mis-call base according to diverse-modified operations, such as substitutions, insertions, deletions, reordering, redundancy, and relocations. The accuracy of the assembled sequence is determined by gathering these scoring values.

5.2 Assembly Assessment Tools

QUAST [31] is an assessment tool that uses a combination of metrics which consider the presence or absence of the reference genomes. It uses N 50, NG50, NA50, and NGA50 in measuring the assembly quality in terms of aligned blocks rather than aligned contigs/scaffolds. QUAST also combines other discussed metrics such as the total number of misassembled contigs/scaffolds and genetic components. Moreover, it provides a full set of functionality to generate different statistical reports supplemented with plots and figures.

Computing Genome Assembly Likelihoods (CGAL) [32] introduced the likelihood metric during de novo assembly evaluation based on the uniformity of the read coverage, errors in the sequenced reads, the distribution of insert sizes, and the size of unassembled reads.

REAPR [33] is another reference-free assessment tool that identifies errors in the assembled sequences using paired-end reads and provides useful information to the end users that reflects the quality of the algorithm used in the assembly process.

6 Assessment of Transcriptome and Metagenomes Assembly Quality

The assessment of assembled transcripts also represents a challenge in the next-generation environment since it relies on the abundance of reference transcripts, its length, its different splicing isoforms, and the existence of novel transcripts. Martin and Wang proposed different metrics for assessing transcriptome assembly at different levels of complexity in the context of the abundance of reference transcripts that are well expressed and originate from the same transcriptome sequences [34]. These metrics include accuracy, completeness, contiguity, chimerism, and variant resolution. Although these metrics measure the assembled transcripts according to a set of reference transcripts, they provide useful insight regarding the correct number of assembled bases, the percentage of coverage with respect to reference transcripts, the number of chimeric transcripts that are introduced during the assembly process, and the percentage of resulting variations in the assembled transcripts [34, 35]. If the reference transcripts are not available, other complementary approaches may be utilized instead. This includes examining the encoding of full-length ORFs in different isoforms and performing subsequent validation through the use of proteomic assays [36].

The evaluation of metagenomic sequence assemblies is another formidable challenge in the next-generation sequencing environment due to the presence of a variety of genetic materials from different microbial communities. Mende and colleagues [37] proposed a number of metrics for evaluative purposes, including the number of chimeric contigs, the accuracy of contigs based on their defined scoring scheme, and the variety of genetic components in the resulting assembly sequences.

Charuvake and Rangwala [38] presented the entropy metric to measure the degree of chimerism in contig sequences. Furthermore, they exploited the paired-end reads and sequence coverage to measure the assembly quality. Recently, Assembly Likelihood Evaluation (ALE) [39] announced a reference independent framework for assessing metagenomic and single-cell assemblies. ALE utilizes statistical methods that rely on different informational sources such as paired-end constraints and relevant factors during sequencing experiments (i.e., coverage, errors, and length). In addition, it reports various assembly errors such as base-call errors, misassembled chimeric sequences, as well as genome rearrangements that are a result of indel operations.