Keywords

1 Introduction

The COVID-19 is a contagious infectious disease caused by the new SARS-CoV-2 virus (Severe Acute Respiratory Syndrome Coronavirus 2). The virus is thus known due to its taxonomic and genomic relationships with SARS-CoV-1, identified in 2002 as the etiological agent of an epidemic of the Severe Acute Respiratory Syndrome (SARS). The term COVID, designated in 2020 by the World Health Organization (WHO), means Corona Virus Disease (Coronavirus Disease), while “19” refers to the year 2019 when the first cases were reported in the city of Wuhan, China [1].

The new Coronavirus belongs to the order Nidoviral of the family Coronaviridae and to the ß subgroup genus [2, 3]. According to phylogenetic analyzes, its relationship with the ß genus of the Coronavirus family was evidenced specifically with SARS-CoV-1 [4]. On January 10, 2020, the first genome sequencing of this virus was published in Wuhan, and, until the month of August 2020, more than 78,000 genomic sequences of SARS-CoV-2 obtained worldwide have already been deposited in the GISAID database [5].

2 SARS-CoV-2

SARS-CoV-2 has a positive, linear, single-stranded RNA genome of approximately 30 kb, enveloped and with a diameter of 60–140 nm, with 14 6-11 open reading frames (ORFs) encoding 29 proteins. The ORF1ab encodes proteins for RNA replication and genes for non-structural proteins (nsp) and structural proteins [3]. Sixteen non-structural proteins (nsp 1-16) (Table 1) have been reported to be of great importance for their specific functions in Coronavirus replication. However, the functions of some of the nsps are still unknown or not well understood [6]. The virus also has accessory proteins that interfere with the host’s innate immune response [7].

Table 1 Functions of SARS-CoV-2 non-structural proteins (nsps) [6]

SARS-CoV-2 still has four structural proteins: the main one, which is the protein S (spike), allows the virus to bind to the cell membrane, connecting the host’s ACE2 receptor, being an important determinant of the reach and pathogenicity of the host. This protein is functionally subdivided into the S1 domain, responsible for binding to the receptor and the S2 domain responsible for cell membrane fusion. Protein E (envelope) is associated with the formation of the coronavirus envelope and contains a hydrophobic domain. Finally, protein M (membrane) and protein N (nucleocapsid) interfere in cellular, viral replication and packaging processes [3].

3 Mutations

Genetic mutations are modifications that change the nucleotide sequence of an organism’s DNA (or RNA). In the SARS-CoV-2 genome, the most reported mutations are found, mainly, in spike (S) protein, RNA polymerase, primase RNA and nucleoprotein, being related to changes in the transmissibility and virulence of the virus. For example, the most significant mutations are found in the gene that encodes S protein, such as mentioned in [3]. It is also showed that primer-independent RNA primase (nsp8) contains more mutations than any other protein [8].

Therefore, it is important to sequence and analyze, phylogenetically, the origin of the new coronavirus to understand and prevent new outbreaks of zoonoses caused by mutations that can occur constantly. Thus, we trace in this work relationships with other coronaviruses of wild origin. In addition, this work also shows that knowing the mutation rate, defined as the probability that a change in genetic information will pass to the next generation, helps to predict the infectious capacity of the virus and the damage to human health [9, 10].

4 Sequencing

Sequencing a genome literally means writing a sequence of letters A, C, G and T (or G, C, U and A in the case of RNA), where each letter refers to one of the four types of base, or nucleotides present in DNA, i.e., the substances:

  • A: Adenine

  • C: Cytosine

  • G: Guanine

  • T: Thymine (or U: Uracil, in the case of RNA).

The genetic material of SARS-CoV-2 is more specifically formed by RNA [11].

Nowadays, there are the first, second and third generation sequencing, being the last two called next-generation sequencing, which allow the reading and analysis of these nucleotides for molecular studies. Originally, first-generation sequencing was developed by Sanger in 1975, being a method that generates small sequencing fragments of different sizes, starting from the same point. It was the basic method used for the sequencing of the human genome, however, it has limitations related to the high cost for long size sequences, in addition to the high processing time, since it processes little genetic material per unit of time [12].

Second-generation sequencing (SGS), which emerged in the world in 2005, is able to supply the high cost and low performance of first-generation sequencing, and has a higher yield, with the capacity to sequence a large number of molecules in parallel. In fact, SGS techniques have a low cost for the amount of data it generates, however, they also need polymerase chain reaction (PCR) amplification, which can cause errors, in addition to increasing the complexity and time required for sample preparation [13].

The Sanger sequencing and SGS deliver only short-read DNA fragments within the range of 50–1000 bases [14]. So, due to the need for technologies that are capable of performing long and high-speed readings, third generation sequencing (TGS) is the more suitable. Unlike the other ones, TGS technologies have a direct target on genetic molecules, allowing sequencing in real-time, and making readings available for analysis as soon as they pass through the sequencer [13].

TGS has a low sequencing cost and easy sample preparation without the need of PCR amplification and an execution time significantly faster than SGS technologies. In addition, TGS is able to produce long reads (exceeding several kilobases) for the resolution of the assembly problem and repetitive regions of complex genomes, however [15].

The three commercially available TGS technologies are Pacific Biosciences (PacBio) real-time sequencing, Illumina’s Illuminate Tru-seq Synthetic Long-Read technology, and Oxford Nanopore Technologies’ sequencing platform (available on devices MiniON, GridION and PromethION [16]).

Oxford Nanopore Technologies (ONT), like the others TGS, is cheaper and faster than SGS methods. This has potential advantages over current widely used sequencing technologies (Ion Torrent and Illumina), which depend on sequencing groups of amplified molecules [17]. MinION has the capacity to produce >1,000,000 sequences per day, with average reading lengths of around 20,000 bases and maximum reading lengths close to 1,000,000 bases [18], however, the speed error is greater than SGS methods [17].

ONT sequencing has been increasingly applied in clinical virology, due to its ability to perform long readings. It is also able to get, in the shortest time possible, sample responses to discover as much information about the pathogen candidate, which is an important feature to deal with public health emergencies. Li et al. [19] noticed that Nanopore is efficient in terms of time, both for detection (of nucleotides) and for data analysis, and has been used in different organisms, from the simplest to the more complex (like humans) because there is no maximum limit on the size of the sequence, as occurs in other techniques [20].

Nanopore sequencing has other advantages, such as tracking biomarkers or genes, requiring a low volume of samples with low concentration, and the fact that it does not require complex steps such as amplifications and conversions of genetic material. Thus, Nanopore proves to be a cheaper and more efficient technique than techniques that use PCR [21]. It can be also used for real-time detection of pathogens in complex clinical samples [22]. In addition, one of the great advantages demonstrated by this technique is the direct RNA sequencing. This method is attractive because it eliminates the need for reverse transcription and, therefore, can reduce initialization errors and non-random copies introduced by reverse transcriptase [23].

Its functioning is based on the polarization of the membranes of the Nanopore, allowing to perceive changes in the signal of the electric current of the RNA or DNA molecules when they pass through the pores [24]. Basically, the ionic current depends on the nitrogenous base that is passing at the time of reading, making it possible to know the sequence of nucleotides, as changes in ionic currents are happening [21].

In Brazil, scientists from the Adolfo Lutz Institute, in collaboration with University of São Paulo (USP) and the University of Oxford, used the Nanopore methodology to carry out the first genetic sequencing of the new coronavirus (SARS-CoV-2) in Latin America, carried out in just 48 h after the first confirmed case [25].

The estimated time to sequence SARS-CoV-2 using another technique is longer, so it is evident that the ONT has been used on several occasions. For example, the entire MinION workflow, from sample preparation to DNA extraction, sequencing, bioinformatics and interpretation, was carried out in approximately 2.5 h [26].

There is no doubt about the importance of the MinION sequencer to determine the SARS-CoV-2 sequence, however, its error rate must be considered, which ranges from 5 to 20% [26]. That is why currently there are different protocols to reduce the error rate, such as nanoCORR—Error-correction tool for nanopore sequence data [27], NanoOK—Software for nanopore data, quality and error profiles [28], and Nanocorrect—Error-correction tool for nanopore sequence data [29, 30]. Also, in ARTIC network—Real-time molecular epidemiology for outbreak response—the protocol starting a MinION sequencing run using MinKNOW [31]. Otherwise, without these protocols it is difficult to carry out an analysis and/or trace the phylogeny of this virus. Thus, with different protocols and sequencing methods in tandem repetition (concatemer), reduction of error rates from 1 to 3% is obtained [26].

This NGS used in Brazil to sequence the new coronavirus shows its benefits, allowing the monitoring of the epidemic in real-time and tracing the behavior of the virus locally and globally. Also, through the analysis of genetic variations, it is possible to predict the route of transmission and dispersion of the virus (allowing, in the future, the development of vaccines), verifying the evolution of the effectiveness of drugs on it, and conducting epidemiological research. Thus, as previously mentioned, sequencing SARS-CoV-2 using an efficient and cost-effective technology, such as is the case of portable sequencing devices as the Oxford Nanopore MinION, is of great importance for the humanity today [23].

5 Conclusions

After analyzing the various studies available in the literature so far, it is concluded that sequencing and analyzing the new coronavirus, in order to detect its mutations, is extremely necessary to prevent new outbreaks (of zoonoses). From this kind of analysis, it is possible to trace a phylogenetic relationship between the virus that causes COVID-19 and other coronaviruses present in the wild to predict its possible damage to human health and its capacity for infection, as well as to know and analyze its mutation rate, as it is transmitted from human to human, and predict future therapeutic targets [9].