Background

DNA sequencing technologies have undergone massive changes and improvement during last four decades. It can be traced back to 1977 when Sanger partial chain termination method and Maxam–Gilbert chemical degradation method were first reported (Maxam and Gilbert 1977; Sanger and Coulson 1975; Sanger et al. 1977). The maximum sequence length generated by Sanger sequencing is about 1 kb.

Sanger sequencing requires large amount of DNA samples generally produced by cloning target sequences into vectors and amplified by prokaryotic cells such as E. coli. Invention of polymerase chain reaction (PCR) created a tremendous opportunity for later biotechnologies. PCR using oligos immobilized on flow cells enables clonal amplification of templates which is essential for next-generation sequencing (NGS) or the second-generation sequencing.

NGS comprises a number of robust technologies characterized by parallel sequencing of massive DNA molecules. There were four major NGS platforms made commercially available in consecutive years: 454 system by Roche, GA/Solexa system by Illumina, SOLiD system by ABI, and Ion Torrent system by Life Technologies (Goodwin et al. 2016; Liu et al. 2012; Mardis 2013). Over the past decade, Illumina emerged as dominant provider of sequencers due to their lower cost, high speed and high yield.

Illumina has released a series of instruments to fulfill the need of various output demands. Some of these sequencers produce large amount of short reads (billions of reads could be generated from a single run) of ≤ 300 bp for complex eukaryotic genomes or small microbial genomes in relatively a short period of time (Liu et al. 2012). Due to this virtue, Illumina sequencers have gained popularity in their applicability and NGS has been widely deployed to explore various domains of genomics including oncology, microbiology, environmental genomics, metagenomics, etc., for biological, medical, environmental, and agricultural studies (Ashley 2016; Deurenberg et al. 2017; Gardy and Loman 2018; Giampaoli et al. 2018; Hoper et al. 2016; Scheben et al. 2017; Yuan et al. 2017). However, the read length remains a bottleneck for biological studies (Ulahannan et al. 2013).

Limitations of NGS

NGS is advantageous in many aspects such as low cost, high speed and high yield and has been intensively employed in various biological analyses during the past 15 years. Several studies have been conducted to solve same biological question with different NGS and TGS methods. These works, however, have led to the discovery of a number of intrinsic limitations of NGS which may have significant impact on the accuracy of biological studies. Among the bottlenecks, short-read length is the most noticeable drawback for NGS sequencing. This limits the precision of many biological studies, especially genome assembly and transcriptome analysis.

For genome assembly, to deduce the genome sequences from billions of short reads, one has to face computational challenges resulted from genomic complexity, time and hardware limitations. These challenges have become a critical issue for large genome assembly, because short reads often result in highly fragmented assemblies resulted from unsolvable repetitive regions or regions with high GC content (Petersen et al. 2017; Salzberg and Yorke 2005; Schmutz et al. 2004). Using short reads in analysis of segmental duplication, structural variations (SVs) or paralogous regions may result in a significant number of false positives (Guan and Sung 2016; Treangen and Salzberg 2011). This issue was empirically addressed by a number of reports. Despite the advances in sequencing technologies and bioinformatics, de novo assembly of large genomes remains challenging (Chin et al. 2014).

Complexities resulting from heterozygosity, transposable elements, GC-rich regions, tandem repeats and interspersed repetitive regions of 10 kb–10 Mb or more in genome remain unresolved by NGS short reads (Alkan et al. 2011). Sequencing of polymorphic tandem repeats in the genome by NGS is severely impaired by read length (Mousavi et al. 2018), and the read length of 100–250 bases used for determining the TR expansion might be inaccurate (Bahlo et al. 2018). For sequencing complex chromosomal rearrangements and structural variants, previous analytical approaches such as fluorescent in situ hybridization (FISH), array CGH, PCR and NGS are either laborious or imprecise (Pang et al. 2010; Quinlan and Hall 2012). Short paired-end reads, although being able to offer single base precision, are frequently unable to precisely map the repetitive regions (e.g., trinucleotide repeats) (Tattini et al. 2015). On the other hand, SMS long-read length offers a superior alternative for characterization of CGRs (Chaisson et al. 2015; Huddleston et al. 2017; Merker et al. 2018; Spies et al. 2017).

Transcriptome data analysis also encounters similar constrains of short reads as that of in genome assembly. Short reads are often unable to infer the full-length RNA transcripts or to precisely determine specific isoforms (Bayega et al. 2018; Martin and Wang 2011). Because of this problem with short reads, studies were unable to fully address gene regulation, protein-coding potential of genome and phenotypic diversity.

NGS is also limited by its incapability of direct sequencing of RNA and epigenetic/methylation markers. RNA sequencing by NGS requires conversion of RNA molecules to corresponding cDNA molecules and then sequenced as DNA. This procedure is seen in all aspects of biological studies, especially transcription of protein-coding genes and non-coding genes (Costa-Silva et al. 2017). Epigenetic modification plays a vital role in regulation of eukaryotic gene expression. Previous DNA methylation largely relies on 5-methylcytosine (5mC) bisulfite conversion. Although genome-wide 5mC profiling became feasible by NGS, it is still limited by uneven coverage as well as sequencing and mapping artifacts (Nair et al. 2018; Smith et al. 2009; Warnecke et al. 2002).

Moreover, short-read sequencing normally involves the usage of large equipment and laborious experimental procedures and bioinformatics analysis and thus unable to meet the need for fast field testing. The process from transportation of biological samples from one place to another, preparation of sequencing libraries, sequencing and data analysis may take a long time (Quick et al. 2016).

Development of single-molecule sequencing

Shortcomings of NGS fostered the development of single-molecule sequencing (SMS), or the third-generation sequencing (TGS). Evolvement of SMS resulted in an increase of 1–2 orders of magnitude in read length over Sanger sequencing as achieved by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), with a small portion of reads by the latter, may achieve an increase of 3 orders of magnitude (Giordano et al. 2017; Jain et al. 2018; Judge et al. 2015; Weirather et al. 2017).

The development of SMS has been phenomenal, as a number of crucial technologies being developed using cross-disciplinary expertise. Firstly, pores (for Nanopore sequencing) or wells (for PacBio sequencing) at micron scale were made and immobilized on well-designed matrix to harbor single nucleic acid molecules for sequencing (Benner et al. 2007; Eid et al. 2009). Secondly, novel sequencing mechanisms were implemented to detect single fluorescent signals (by PacBio) or to distinguish electrical changes (by Nanopore), which are then converted to nucleotide bases as well as base modifications (Hartel et al. 2019; Rhoads and Au 2015). Thirdly, extremely delicate detection systems were designed independently to detect base-level signals (Benner et al. 2007; Clark et al. 2011). Moreover, a number of software tools were developed for base calling and error correction.

Similar to Sanger sequencing and most NGS platforms, PacBio sequencing also adopts sequencing-by-synthesis mechanism where a DNA strand serves as template for the synthesis of another strand by DNA polymerase. Single-molecule real-time (SMRT) sequencing developed by PacBio, founded in 2004, was the first widely deployed long-read sequencing technology. Reads generated by SMRT sequencing can reach about 200 kb. This feature gives an edge to long-read sequencing technologies in base-to-base comparisons of genomes to identify genetic variations and further understanding of the gene functions and disease association with significantly improved accuracy. RS system, the first SMRT sequencer released in 2009, sequences circular single-stranded DNA templates (i.e., SMRTbell) in real time. SMRTbell is made by ligating hairpin adaptors to both ends of a target double-stranded DNA molecule. Each hairpin loop has a sequencing primer-binding site for primer binding (Ardui et al. 2018; Travers et al. 2010).

DNA template-polymerase complexes are built and immobilized at the bottom of zero-mode waveguide (ZMW) chambers (Fig. 1). When a fluorescently labeled nucleotide base is incorporated, a light pulse is generated and recorded. Differing from NGS which has fluorescence labeled on the nucleobase moieties, fluorophores of SMRT sequencing are labeled on the terminal phosphate of nucleotides. The formation of phosphodiester bond automatically removes the fluorophore together with the conjugated PPi. Base-linked labeling of fluorophores was found not suitable for SMRT sequencing due to poor incorporation (Ardui et al. 2018; Eid et al. 2009; Rhoads and Au 2015).

Fig. 1
figure 1

Figure reproduced by copyright permission of AAAS

SMRT sequencing. a Graphic representation of ZMW chamber with DNA polymerase bound single molecule of DNA template. b Association formed between phospholinked nucleotide with the template in the polymerase active site which leads to elevation of the fluorescence signal on respective color channels. Phosphodiester bond formatted further releases dye–linker–pyrophosphate product and diffuses it out of ZMW, and ends the fluorescence pulse. The polymerase shifts to next position, forms bond with nucleotide and generates next pulse. (Eid et al. 2009)

General features of PacBio sequencers are listed in Table 1.

Table 1 Summary of general features of PacBio sequencers

Nanopore sequencing by ONT, founded in 2005, is based on an innovative technology capable of distinguishing minor changes in ionic current when nucleotide bases of single-stranded DNA/RNA molecules pass through protein nanopores immobilized on a saline solution-immersed membrane to which a fixed voltage is applied (Fig. 2).

Fig. 2
figure 2

Nanopore sequencing. DNA or RNA molecules are introduced into nanopores by motor protein and then each causes alterations in electric signal when passing through the pore. These signal fluctuations are further converted into basecalls with special algorithms. (Image adopted from Oxford Nanopore Technologies website.)

Under an applied voltage, the 5′ end of ssDNA is electrophoretically fed into a matrix-embedded protein nanopore, a mutant form of Mycobacterium smegmatis porin A (MspA) designed in such a way that it has a short and narrow constriction (about 1.2 nm wide and 0.6 nm long) to achieve single-nucleotide resolution (Manrao et al. 2012). When a DNA molecule passes through the nanopore, distinct current signals are obtained with respect to each nucleotide. The alterations in ionic current are recorded for each pore and converted to a base sequence previously by Hidden Markov Model-based or currently by Recurrent Neural Network-based basecaller (Boza et al. 2017; Hartel et al. 2019). ONT developed portable sequencer MinION to expedite real-time field sequencing of small amount of DNA samples from biological or environmental samples (Lu et al. 2016). PromethION is ONT’s benchtop instrument that contains 48 flow cells that can be run in parallel and generate up to 100 Gb of data per flow cell per run. General features of ONT sequencers are listed in Table 2. Nanopores can be used to sequence both plus and minus strands of DNA fragments with customized library preparation kits (2D, now 1D2) to increase the sequencing accuracy. Also, ONT’s sequencers allow RNA molecules to be sequenced directly (Keller et al. 2018).

Table 2 Summary of general features of ONT sequencers

The longest read length achieved by MinION flow cell to this date is more than 2 Mb (Payne et al. 2018). Ultra-long reads (ULRs) are an important feature with a strong potential to facilitate the assembly of large genomes, which was demonstrated by the assembly of a contiguous yeast genome by MinION long reads (LRs) (Istace et al. 2017), followed by human genome assembly by MinION LRs plus ULRs without Illumina short reads (Jain et al. 2018), which will be further discussed later.

Comparison between PacBio and ONT platforms

PacBio and ONT share a common advantage of long-read length and a common disadvantage of high error rate of ~ 5–20% randomly distributed before error correction (Sedlazeck et al. 2018). As such, Illumina short reads are frequently incorporated with long reads for hybrid assembly of large eukaryotic genomes (Antipov et al. 2016; Giordano et al. 2017) or for hybrid sequencing of transcriptomes to characterize transcript isoforms or fusion genes (Deonovic et al. 2017; Weirather et al. 2015).

PacBio and ONT also differ in a number of aspects. To better understand the performances between PacBio and ONT platforms, here we describe three reports separately related to genome assembly (Giordano et al. 2017), transcriptome analysis (Weirather et al. 2017), and structural variant calling (Sedlazeck et al. 2018). Since every sequencing platform endeavors to improve its sequencing quality, the pros and cons of a sequencing platform are expected to shift as new sequencers and bioinformatics continue to evolve. Readers are encouraged to update knowledge through frequent literature search. Here, we summarize some of the reports.

Giordano and colleagues conducted a comprehensive comparison on genome assembly efficiency between PacBio RSII (read length between 5–60 kb, average 12 kb, error rate 13%, accuracy 99.9% after correction, throughput 1 Gb/run), ONT MinION (2D, R7.3/R9 flow cells), and Illumina MiSeq (80 ×  of 2 × 150 bp) at various read depth using four yeast strains of S. cerevisiae genome (Giordano et al. 2017). They found that RSII performed much better in throughput per run and slightly better in accuracy than MinION. Furthermore, both sequencers performed similarly in error rates and accuracy. Moreover, at 31X coverage, either RSII or MinION reads alone were able to complete the assemblies of the 16 nuclear chromosomes, but not the mitochondrial genome. Moreover, with Illumina short reads, both PacBio and ONT’s datasets could achieve an accuracy above 99.9%. It is worthy to note that, since 2017, however, the output and quality have been significantly improved for MinION flow cell. Read length of MinION can be much longer than that of RSII for both average read length and longest read length.

They also evaluated eight assembly pipelines (namely, PBcR-Self, PBcR-MiSeq, Canu, Falcon, ABruijn, SMARTdenovo, Miniasm and Racon) at various depth and with or without base error correction or consensus construction prior to assembly (Chin et al. 2016; Jayakumar and Sakakibara 2019; Koren et al. 2012, 2017; Li 2016; Lin et al. 2016; Vaser et al. 2017). Miniasm turned out to be the fastest. However, it missed 4–5% of the genome, because of no base error correction. On the other hand, PBcR-MiSeq was the most accurate (> 99.68% reference genome covered with identity above 99.9%), as it included MiSeq short reads to correct long reads produced either by MinION or RSII. These results indicate the importance of error correction and superiority of hybrid assembly over assembly by long reads alone.

A comparison between PacBio RSII and ONT MinION Mk 1B on transcriptomes of human embryonic stem cells (hESCs) was reported by Weirather and colleagues (Weirather et al. 2017). In the study, short reads produced by HiSeq 4000 were used in Hybrid-Seq libraries to evaluate the essentiality of short reads for full-length transcript sequencing and characterization of transcript isoforms. Results indicated that RSII had better quality in terms of error rate (error rate 1.72% in PacBio CCS vs 13.4% in ONT 2D), while ONT has higher yield and throughput/cost ratio, and that both SMS platforms are suitable for full-length transcriptome analysis. In general, Hybrid-Seq strategy performs slightly better in fully utilizing PacBio and ONT reads in transcriptome analysis as both SMS platforms can benefit from short reads for improving accuracy (median errors reduced to 0.05% from 0.13% in ONT-Illumina vs. ONT).

A comparison between PacBio and ONT on accuracy of structural variant calling using a well-studied human genome NA12878 was reported by Sedlazeck and colleagues (Sedlazeck et al. 2018). From 28X ONT data and 55X PacBio data, 26,567 and 15,499 SVs were identified, respectively. Most (94.80%) SVs identified from PacBio data were confirmed by ONT, Illumina or other call sets, while ONT had worse concordance—among the 11,433 SV calls unique to ONT, most (96.01%) were deletions and 92.88% were within homopolymers or other repeats. Contrarily, among the 773 SV calls unique to PacBio, mostly (66.49%) were insertions and only 41.8% were within homopolymers or repeats. The bias in ONT is suspected to result from errors in base calling. However, the results were influenced by coverage. The authors also showed that 15X PacBio reads can call 69.64% of SVs at precision rate about 80% whereas number of calls can be increased to about 80% with precision rate of ~ 85% at coverage of 30X. Subsampled ONT data at 20X coverage were able to call 84.23% of SVs at precision rate of 82.24% which was better than PacBio data at 30X .

Besides the above-mentioned issues, sequencing speed is also a critical factor. For better accuracy, the speed and output of PacBio sequencing are compromised to only a few bp per second, while ONT system can reach above 400 bp per second. Key differences between PacBio and ONT sequencers are summarized in Table 3.

Table 3 Comparison between ONT and PacBio

Ultra-long reads as a unique feature of ONT

Using ONT MinION sequencer, Jain et al. (2018) assembled human genome using Canu assembler on 30X  coverage composed of ULRs and LRs, without Illumina short reads. In a total of 53 R9.4 flow cells used, they found that ULRs could be better produced from high molecular weight DNA freshly extracted from cells. To the best of our knowledge, ULR has not been demonstrated by PacBio sequencers.

Comparison made by Jain et al. between ONT and PacBio reads also showed that both with read correction and equivalent coverage, genome assembled from ONT reads has lower identity to reference genome GRCh38 than that assembled from PacBio reads (92% vs. > 99%) and the frequency of deletions is also higher for ONT reads—as caused by homopolymers and low complexity regions, suggesting that ONT is more error prone than PacBio. However, as also reported in the comparison, errors in ULRs distribute more or less evenly and do not increase with read length.

To further demonstrate the ULRs mentioned in the report by Jain et al., here we display ONT data that were used in their study (https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md), together with several other PacBio and ONT datasets from recent studies (Fig. 3) (Supplementary Tables 1–3). Based on previous reports and to have a reasonable separation, we temporarily define ULRs as those with read length ≥ 300 kb.

Fig. 3
figure 3

Comparison of read length between ONT and PacBio datasets. Recent datasets generated by ONT or PacBio were analyzed by NanoPack tool (De Coster et al. 2018) (https://github.com/wdecoster/nanopack). (ONT datasets: JAIN (MinION), ONT1 (MinION), ONT2 (MinION), ONT3 (PromethION), ONT4 (PromethION); PacBio datasets: PB0 (RS II), PB1 (Sequel), PB2 (Sequel II)). Numbers written below the blue line are number of reads with length ≥ 100,000 bp per million of total reads. Numbers written above the red line are number of reads with length ≥ 300,000 bp per million of total reads

Figure 3 shows the result from comparison of ONT and PacBio datasets (Supplementary Tables 1–3). We observed that although the median read length of ONT data is comparable to that of PacBio, a small portion of ONT reads was above 300 kb in length. At the same time, PacBio data do not contain any reads above 150 kb, but the N50 of these datasets varies depending on the sequencing protocol used. Size selection and sequencing kits are few of many factors.

It would be interesting to further understand how protocol influences the production of ULRs. We thus compare ONTs and the method used by Jain et al. for making more ULRs. Figure 4 shows the results generated from three protocols: one with standard ligation kit (SQK-LSK108 1D ligation kit), second with standard rapid kit (SQK-RAD002 genomic DNA kit) and third with a modified protocol using rapid kit (SQK-RAD002 genomic DNA Rapid kit) for input DNA extracted by modified Sambrook and Russell’s protocol, followed by MinION sequencing. Mean read length from modified protocol was about 3.5 and 1.5 times higher than that from standard ligation kit and standard rapid kit, respectively, while the N50 from the modified protocol was more than 8.5 and 2.5 times higher than that from standard ligation kit and standard rapid kit, respectively. ULRs (i.e., read length ≥ 300 kb) totaled around 2 Gb, 472 Mb and 84 Mb for modified protocol and standard ligation kit and standard rapid kit, respectively. Modified protocol is about 700-fold more efficient than ligation kit in producing ULRs (7000 ULR pm vs. 12 ULR pm).

Fig. 4
figure 4

Comparison of read lengths between libraries produced from different protocols with a focus on ultra-long reads. Reads produced from ligation kit, rapid kit and modified protocol (Ultra) were compared to highlight ultra-long reads. Figures are generated with NanoPack tools (https://github.com/wdecoster/nanopack). Numbers written below the blue line are number of reads with length ≥ 100,000 bp per million of total reads. Numbers written above the red line are number of reads with length ≥ 300,000 bp per million of total reads

The advantage of ULRs in genome assembly is evident as shown in the report. Interestingly, additional incorporation of 5X  coverage of ULRs was able to increase NG50 by at least twofold (from ~ 3 Mb of long reads to ~ 6.4 Mb). Assembly contiguity also increased, while yield per flow cell was not compromised by ULRs and sequencing accuracy does not decrease along with the read lengths of ULRs.

Applications leveraging on single-molecule sequencing technologies De novo genome assembly

Long-read length benefits a lot to genome assembly by increasing N50 and contiguity, while short reads result in highly fragmented assemblies. Since long reads span through regions of high GC, low complexity and repetitive regions, they resolve the bubble formation in de bruijn graph-based assembly and also determine the lengths of microsatellites and tandem repeat regions.

Giordano et al. (2017) showed that yeast genome could be assembled with 31X of PacBio or ONT reads with accuracies about 99% and 98%, respectively. De novo assembly of GM12878 human genome with 30X Nanopore reads can produce an assembly with NG50 around 3 Mb, whereas adding 5X ULRs can increase the NG50 to ~ 6.4 Mb (Jain et al. 2018). Thus, SMS platforms offer a better solution for large genome assembly (Seo et al. 2016; Shi et al. 2016). In particular, ONT method that can provide reads up to 2 Mb would be extremely helpful for achieving high contiguity and resolving repetitive regions.

Sequencing tandem repeats in human diseases

Tandem repeats (TR) regions are short genomic regions ranging up to thousand base pairs that are repeated multiple times in human genome (de Koning et al. 2011). Although widely spread in non-coding regions, some tandem repeats can still be found in coding regions (Usdin 2008). It has been reported that up to 9% gene have tandem repeats within their coding region and about 12% of genes have tandem repeats in their promoter regions. These regions are hyper-mutable and are often used in forensics, population genetics and are associated with several genetic diseases. Compared to other genomic elements, the mutation rates of tandem repeats are 10 to 10,000 fold higher (Ameur et al. 2019; Duitama et al. 2014).

The read lengths offered by SMS can subtly profile tandem repeat regions and various diseases caused by tandem repeat expansions have been studied with both PacBio and ONT (Ameur et al. 2019; Loomis et al. 2013; Mitsuhashi et al. 2017).

McGinty et al. (2017) demonstrated the potential of nanopore sequencing in characterization of the role played by tandem repeats in chromosomal rearrangement and sequencing time is much shorter than PacBio sequencing. Using nanopore sequencing of 11 individuals (6 patients of Alzheimer’s), De Roeck et al. (2018, 2019) showed that variable number tandem repeat (VNTR) expansion results in increased risk of Alzheimer’s disease. They used NanoSatellite software for resolving tandem repeats from PromethION data. When compared together, both PacBio and ONT were able to sequence through the SCA36 ‘GGCCTG’ and the C9orf72 G4C2 repeat expansions. Both regions were cloned into plasmids and then sequenced with PacBio and ONT MinION. While median read length was found to be similar by both these platforms, MinION had a higher percentage of reads that spanned through these repeat regions (Ebbert et al. 2018).

Although both PacBio and ONT suffer high error rates. Ability of direct detection of nucleotides without DNA synthesis and, hence no GC bias, makes ONT sequencing lucrative to analyze tandem repeats (Bahlo et al. 2018). By employing the ONT sequencing, identification of novel tandem repeats associated with disease can be done in a cost-effective manner (Gießelmann et al. 2018).

Detecting complex chromosomal rearrangements and structural variants

Complex genomic rearrangements (CGRs) refer to insertions, deletions, inversions, duplications, and translocations of variable genomic sequences (Sudmant et al. 2015). These genomic sequences, which are frequently repetitive in sequence and polymorphic in structure and length, contribute to the etiology of a number of diseases, including cancer (de Koning et al. 2011; Macintyre et al. 2016; Tubio 2015), early-onset Alzheimer’s disease (Rovelet-Lecrux et al. 2006), and autism (Hedges et al. 2012).

McGinty et al. (2017) demonstrated the potential of nanopore sequencing to characterize the DNA repair pathways involved in (GAA)n-induced CGRs. In the study, they showed that ONT long reads can detect the CGR breakpoints with single base pair resolution. The intricacies of CGR would not have been discovered without long reads.

SVs characterization of genomic SVs by PE short reads often results in false negative or false positive, and long reads are more likely to span across questionable repetitive regions or the breakpoints of SVs. To facilitate alignment of SVs with long reads, open-source methods NGMLR and Sniffles were introduced by Sedlazeck and colleagues (Sedlazeck et al. 2018). A comparison between PacBio and ONT data using the mentioned methods was discussed above in “Comparisons between PacBio and ONT platforms”. In genomic analysis of two chromothripsis patients, comparisons between short-reads and long reads were made in identification of complex structural variants. They showed that 32% more SVs could be identified using NanoSV with long reads from ONT’s MinION as compared to short reads (Cretu Stancu et al. 2017).

Haplotype phasing of variants and dissecting the complexities of highly polymorphic regions MHC/HLA

A diploid human genome generally has 4–5 million sites that differ from a reference genome. Most genomic variations are heterozygous in nature and their density across the genome varies with ethnic, geographic background of parents (Eberle et al. 2017). This haplotype information of parental alleles affects the analysis of allele-specific expression, DNA-binding sites, de novo mutations and other genomic features. Due to the limitations of current methods, there is great interest to acquire haplotype information directly from the reads (Raymond et al. 2005; Tewhey et al. 2011). In simple words, phasing of variants can be achieved if they are present on the same read. Longer read can cover maximum variants to be phased, but read length, sequencing errors and fluctuating coverage could be major limiting factors which may induce false positive and true negatives (Cretu Stancu et al. 2017; Delaneau et al. 2013; Laver et al. 2016).

The hyperpolymorphic major histocompatibility complex (MHC) or human leukocyte antigen (HLA) encodes for proteins of antigen presentation pathway. The variations in 3.6 Mb genomic region of MHC located on chromosome 6p21 are associated with immune response. These genes define susceptibility or resistance for various infections as well as confer hypersensitivities to specific drugs (Traherne 2008; Trowsdale 1993).

In clinical practice, precise HLA genotyping is imperative before allogeneic hematopoietic stem cell transplants or organ transplants for ascertaining the compatibility of HLA between donor and recipient (Sasazuki et al. 2016). However, unambiguous HLA genotyping is technically challenging due to high polymorphism, high sequence similarity and extreme linkage disequilibrium (Profaizer et al. 2016). Till date, 24,093 allelic variants are identified in human genomes (http://www.ebi.ac.uk/ipd/imgt/hla/stats.html).

The advent of NGS greatly benefited HLA genotyping techniques as it offered higher throughput and better resolution than previous technologies (Erlich et al. 2011; Ozaki et al. 2015; Wang et al. 2012). Although the NGS methods provide good resolution, time taken in sequencing, phasing of HLA genes and associated regulatory regions remains a challenge (Hosomichi et al. 2015). The TGS techniques offers longer read lengths and provide sequence information of HLA regions which are hard to assemble with short reads facilitating identification of novel alleles, phasing and haplotype identification(Ambardar and Gowda 2018; Lang et al. 2018; Turner et al. 2018). Despite the high error rate, TGS technologies provided 100% accuracy in class I HLA typing at two-field resolution (Liu et al. 2018). To construe HLA architecture, combination of both NGS and TGS can ameliorate clinical practices.

RNA sequencing to detect alternative splicing/transcripts or RNA isoforms

On the other hand, long-read RNA sequencing is a superior approach over NGS short-read RNA-Seq in detecting alternative splicing transcripts or transcript isoforms, due to the fact that short reads are unable to span fully across gene transcripts and uneven coverages of inter-exonic or intra-exonic regions may fluctuate severely, making it difficult to be interpreted by bioinformatics means (Bayega et al. 2018; Steijger et al. 2013). SMS long-read sequencing has been found particularly useful in comprehensive characterization of RNA isoforms at various levels including single cell transcriptome analysis (Boldogkoi et al. 2019).

In a study of using RNA-seq approach together with Nanopore MinION long-read sequencing to investigate isoform diversity in individual B cells, Byrne et al. (2017) showed that complex isoforms can be precisely quantified at the single cell level. Their approach can be very useful for the study of immune response. In another analysis of alternative splicing transripts in Amborella trichopoda plant, Liu and colleagues demonstrated the feasibility of using PacBio Iso-Seq long reads to identify alternative splicing events without a reference genome (Liu et al. 2017).

Direct sequencing of RNA

RNA is recognized as a crucial component to interrogate biological phenomena and direct RNA sequencing is gaining a new momentum through direct sequencing using ONT platform.

For PacBio SMRT sequencing method, cDNA prepared from RNA can be used as input without undergoing fragmentation step, thereby enabling full-length information of RNA. Since chemistry of ONT sequencing involves the usage of nanopores through which either DNA or RNA molecule can pass. ONT permits direct RNA sequencing. A recent study of herpes simplex virus type 1 transcriptome by direct RNA-seq by Depledge et al. (2019) demonstrated the superior capability of direct RNA-seq in redefining transcriptional complexity, as novel class of chimeric transcripts was detected. They also stated that high error rate can be partially overcome by error correction using Illumina short reads. A direct sequencing of influenza RNA genome was reported by Keller et al.(2018), who designed an adapter to conserved termini of the viral genome and thus to direct the (-) sense RNA into MinION protein nanopore for direct sequencing. Taken together, nanopore direct RNA sequencing possesses ample advantages and is expected to benefit the understanding of host–pathogen interactions.

Direct sequencing of epigenetic/methylation markers

Both PacBio and ONT methods are proven to be much advantageous than current bisulfite method as they provide direct identification of various nucleotide methylation not just limited to 5mC (Clarke et al. 2009; McIntyre et al. 2019; Rand et al. 2017; Simpson et al. 2017; Xiao et al. 2018).

Euskirchen and colleagues screened glioma samples to identify copy number variations and methylation profiles in IDH1, IDH2, H3F3A, TP53 and TERT promoters using deep amplicon sequencing by Nanopore MinION Mk 1B/R9 or R9.4 flow cells (Euskirchen et al. 2017). Study was designed to achieve same-day detection of structural variants, point mutations, and methylation profiling using a nanopore device. A significant correlation was observed in outcomes of nanopore sequencing and data generated from short-read exome sequencing, Sanger sequencing, SNP array, and/or genome-wide methylation microarray. Nanopore sequencing method outperformed hybridization-based methods and current sequencing technologies in time consumed in diagnosis and laboratory equipment and expertise required. Overall, ONT method can be applied for precision medicine development for cancer patients in limited resources, within short duration and cost-effective manner.

Fast sequencing in PGS for decision making

Preimplantation genetic screening (PGS) is the process of screening of all 23 pairs of chromosomes to identify genetic defects within embryo prior to implantation. Success of in vitro fertilization (IVF) depends on the selection of viable embryo, which was previously based on morphological, developmental characteristics and chromosomal status (Lee et al. 2015b). The reliability of these criteria was found very low and new methods for detailed assessment of genetic defects and aneuploidy are desired (Fragouli et al. 2010; Lee et al. 2015a). NGS-based methods are advantageous over CGH-based methods in cost, detection of partial or segmental aneuploidy and mosaicism in multicellular samples, and automation (Fragouli et al. 2011; Yang et al. 2015). Friedenthal et al. (2018) compared CGH with NGS in single thawed euploid embryo transfer (STEET) and revealed that implantation rate and ongoing pregnancy/live birth rate were higher in NGS-based PGS.

There is very little information available about the usage of PacBio sequencing in PGS, while Nanopore MinION is suggested by Wei et al. (2018) to be a faster sequencing tool for PGS. They showed that the whole procedure of embryo selection can be completed within a day, making it a protocol faster than any NGS-based method. This study showed that results obtained with Nanopore sequencing methods were comparable with those of other NGS-based methods. Given that traditional NGS-based methods could be laborious, longer and more costly. Nanopore MinION is less laborious, faster and cheaper for PGS decision making and thus is able to facilitate fresh embryo transfer and reduce stress and cost for patients.

A fast and portable sequencing method in investigating outbreak of human infectious diseases

In case of an outbreak of human infectious disease, the first and foremost task is to sequence the genome of the infectious agent. Analysis of the genomic sequence can help inferring viral evolution and facilitate the identification of genetic sequences crucial for its survival and transmission, and thus expedite diagnosis and treatment.

PacBio long-read sequencing is able to overcome some of the limitations such as repetitive sequences and high GC content. However, it also requires laborious laboratory setup, long runtime and high cost, causing limited usage in case of pathogenic outbreaks.

In a scenario of disease outbreak, portable and rapidly deployable setup is desired to lower the cost of transport and to accelerate the diagnostic process. ONT’s MinION has demonstrated its strong efficacy in handling Ebola outbreak in western Africa (Hoenen et al. 2016; Quick et al. 2016), and Lassa virus (LASV) outbreak during 2018 in Nigeria (Kafetzopoulou et al. 2019). A comparison between Illumina and ONT methods in metagenomic sequencing was conducted. Through recovering whole-genome sequences of Dengue virus and chikungunya virus directly from serum and plasma of patients, they demonstrated the feasibility of nanopore metagenomic sequencing at a lower requirement of resources (Kafetzopoulou et al. 2018). Nanopore devices can work in less favorable locations and conditions and do not need a sophisticated laboratory setup (Greninger et al. 2015; Hansen et al. 2018). These devices can also reduce risk and avoid expensive logistics in terms of cost and time incurred in carrying samples from place to place.

Summary

Many biological questions can be answered with various sequencing technologies available till date. Choice of any sequencing method to be employed in human genetics project is generally context dependent. Additionally, one can consider cost, accuracy, running time and technical biases of these methods. Improved read length in TGS/SMS technologies is a milestone in the field of human genetics. Both PacBio and ONT have been continuously putting efforts in upgrading their sequencing solutions toward increased read length, reduced error rate and cost of sequencing. Oxford Nanopore’s ability to generate ultra-long reads and to differentiate modified nucleotides at high speed are few of its advantages over PacBio methods. Longer reads can greatly benefit genome assemblies of complex organisms, resolving tandem repeats and complex structural rearrangements in human diseases, phasing of haplotypes, deciphering the MHC sequence and identifying correct isoforms of RNA. Direct detection of RNA molecules or epigenetic modification can overcome the need of reverse transcription in case of RNA sequencing and bisulfite treatment to decipher methylation. Sequencing solutions that reduce the analysis time would improve the decision making in PGS and also in case of disease outbreaks by pathogens.

Despite their advantages, long reads produced by ONT sequencing methods suffer a high error rate, which might hamper the accuracy of genome sequencing projects. This error rate is expected to reduce/diminish to certain extent by improving the signal detection systems of these sequencers. Many researchers have shown that although long-reads technologies have been developed, short reads have not lost their relevance yet. High accuracy rates of short reads and longer length of long reads can be combined to achieve better accuracy. Another limitation faced by projects involving long reads is the computational requirement of analysis. During genome assembly, as number of reads increases, the number of overlaps computed between the reads also increases exponentially. Nonetheless, with methods that leverage the power of ultra-long reads, SMS will be a revolution in genomic studies and will create new possibilities.