Keywords

1 Introduction

COVID-19 pandemic started in December 2019 and was caused due to SARS-CoV-2 virus. Till the end of 2020, no specific treatment could be obtained. Drug and vaccine developments are progressing with the help of bioinformatic approaches. But first, let us delineate the historical use of in silico approaches in viral outbreaks.

1.1 Historical Facts in Previous Viral Outbreaks

Interestingly, COVID-19 is not the first viral disease where bioinformatic approaches were applied although the application of these approaches has magnified during the pandemic. The initial application was seen in Zika and Ebola virus as discussed below. Valuable lessons were learnt and amplified during the COVID-19 pandemic.

1.1.1 Zika Virus (2015–2016) and Ebola Virus (2014–2016) Outbreaks

Zika virus is caused by a flavivirus transmitted primarily by Aedes mosquitoes with mild symptoms lasting 2 to 7 days such as fever, rash, conjunctivitis, muscle and joint pain, and malaise or headache. Interestingly, Zika virus outbreak has been reported multiple times in the last century with the most recent one in 2015 in Brazil. As a countermeasure, Brazilian and American scientists came together for an open drug discovery collaborative effort. The scientific community recorded application of various computational strategies for drug repurposing and heavy usage of molecular docking methods on the viral proteins. These reports included protein homology modelling, X-ray crystallization structures, novel ligand, and protein discovery under the OpenZika project. However, due to a lack of corroborating in vitro and animal studies, many projects were shut down. On the other hand, the most recent Ebola virus outbreak took place in West Africa although it was first discovered in 1976 with fruit bats of the Pteropodidae family as natural hosts. It was previously called Ebola haemorrhagic fever, a severe, often fatal illness with a mortality rate of 50%. Drug discovery process for Ebola consisted of computational pharmacophore analysis of Ebola-active compounds, machine learning, in vitro testing, and generation of FDA-approved drugs (see review [1]). Table 6.1 summarizes the landmark computational studies for the two virus outbreaks.

Table 6.1 Computational approaches in Ebola and Zika drug discovery

2 COVID-19 Pandemic

By 2020, the COVID-19 patients became a major calamity as shown by COVID-19-positive cases and mortality on the WHO dashboard (https://covid19.who.int/). The WHO declared a state of global health emergency to coordinate scientific and medical efforts to rapidly develop a cure for patients [13]. The governments implemented social distancing and lockdowns to curb the spread. Many existing antiviral medications have been tried in hopes to slow or even cure the severely affected patients and thus decrease the morality rate. One of the very promising strategies, therefore, was to use bioinformatic approaches as shown in Ebola and Zika outbreaks.

COVID-19 is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a new strain of coronaviruses that has been isolated from the Huanan Seafood Market, Wuhan, China, in December 2019. The identified primary reservoir is horseshoe bat, and transfer to human takes place through unknown intermediate hosts [14]. In general, the family of coronaviruses can cause respiratory, gastrointestinal, hepatic, and central nervous system diseases. SARS-CoV-2 and its nearest neighbours in the phylogenetic tree, i.e. SARS-CoV and MERS-CoV, cause severe respiratory diseases. Its widespread transmission is due to travel and human to human contact [15]. SARS-CoV-2 is currently known to be sensitive to heat and UV rays and effectively destroyed with 75% ethanol, acetic acid, and chlorine-containing disinfectants [16].

2.1 Current Bioinformatic Efforts in COVID-19

We have highlighted the bioinformatic efforts that have been used in drug discovery research for COVID-19 till end of 2020.

2.1.1 Genomic Efforts: Sequencing Efforts

The whole genome sequences for the virus, SARS-CoV-2, have been isolated from patients from several countries including Brazil, China, Germany, and the USA. They were made publicly available as soon as they were sequenced to accelerate scientific research. There are currently more than 300 samples online. All the sequence samples were found to be closely related with few mutations pointing towards a common ancestor. For example, the Brazilian genome differed by three mutations to the Wuhan reference strain, and two of these three mutations were shared with a German sample. Efforts have been made to provide complete genome sequencing and thus aid the analysis of how the gene is translated to protein and what could be the protein functions especially in SARS-CoV-2 infectivity pathway. Genome sequencing is the starting point for all analysis regarding structure and function of resulting proteins. In addition, it provides knowledge about the origins of the SARS-CoV-2 virus and thus transmission profile. For the scientific research to rapidly move forward, it was crucial that the whole genome sequencing of SARS-CoV-2 is performed at a fast pace.

SARS-CoV-2 belongs to genus Betacoronavirus and subgenus Sarbecovirus. The virion or the infecting particle consists of an envelope containing a single positive-stranded RNA. The first genome, accession number NC_045512.2, was isolated from a patient in Wuhan, Hubei province, China, and named SARS-CoV-2 Wuhan-Hu-1. Current GenBank sequences and next-generation sequences stand at 39751 and 4266, respectively (data taken from NCBI-NLM SARS-CoV-2 Resources on November 12, 2020). Current estimates suggest a genome size of 29.9 kb and 11 open reading frames or ORFs. The organization of genes encoding the various proteins is shown below.

5′-leader-UTR-replicase-ORFab-Spike (S)-Envelope (E)-Membrane (M)-Nucleocapsid (N)-3′UTR-poly (A) tail-3′-UTR end.

Figure 6.1 illustrates the cellular location of different proteins on SARS-CoV-2 virion. Unsurprisingly, the genome of SARS-CoV-2 is very similar to SARS-CoV (82%), bat- CoV-RaTG13 (96%), and bat-SL-CoVZC45 (86.9%). Main difference lies in the longer branch length to the bat viruses. Although mutations are being observed between various SARS-CoV-2 strains isolated from patients, the reported similarity is 99.98%. Phylogenetic analysis of 160 genomes has shown three main variants (classified as A, B, and C ancestral types) with certain mutations in specific variants, i.e. synonymous mutations T29095C and T8782C are identified in type A and type B, respectively, and non-synonymous mutations C28144T (Leu to Ser) and G26144T (Gly to Val) are detected in type B and type C, respectively [17]. This knowledge is being employed to identify genome-based community hotspots. For some mutations, radical changes in functionality of the protein, host specificity, or virus infectivity have been seen (Table 6.2). Mutational hotspots are located at positions 1397, 2891, 14408, 17746, 17857, 18060, 23403, and 28881. Mutational variants are crucial to assess any possible drug resistance and COVID-19 clinical presentation. This information is also crucial for designing COVID-19 vaccines as well as rapid diagnostic assays. Structural genomic analysis has shown that the viral genome is composed of four structural proteins (spike-envelope-membrane-nucleocapsid) and two non-structural proteins (main protease and RdRp) [23].

Fig. 6.1
figure 1

Proteins on SARS-CoV-2 virion

Table 6.2 Common SARS-CoV-2 proteins and current structure models

2.1.2 Protein Structure-Based Methods

2.1.2.1 Homology Modelling of SARS-CoV-2 Proteins

Considerable work has been accomplished in the development of homology models of SARS-CoV-2 proteins. Leading the way is the main viral protease or Mpro. Other proteins that have been heavily modelled through homology modelling are spike (S) protein, ACE2 human target, and RdRp. Table 6.2 provides the current list of protein targets and highlights whether a specific target is over- or underutilized for the computational studies.

The knowledge of three-dimensional structure of proteins is crucial to develop drugs that can modulate the protein’s action. The structural information, obtained through X-ray crystallography or cryogenic electron microscope or nuclear magnetic resonance, provides information on binding sites and the mechanism of action/inhibition. The three-dimensional structures are also crucial to molecular docking studies as they serve as starting point. These models have been used for docking to understand mechanisms of viral infection and possible treatments. Currently, there are over 400 (total) structures available for various SARS-CoV-2 targets on RCSB PDB.

2.1.2.2 Molecular Docking Studies

Docking simulations have been performed for treatment through both small molecules and antibodies. Small molecule strategies require a target protein structure and screening of molecules that bind this protein using molecular docking and validation using molecular dynamic simulations. Many studies have been performed in a very short span of time with the rapidly made available targets (Table 6.2). In the second method, antigen-antibody docking simulation has predicted high-affinity binding of human antibodies to SARS-CoV-2 proteins. Human antibody CR3022 is shown to possess a high affinity for spike protein [24].

Unfortunately, many early studies lacked proper validation of molecular docking data (redocking or molecular dynamics), and thus reproducibility of the results is highly doubtful. This trend was seen for many drug or herbal chemical candidates in the early days of COVID-19. Studies that came a little later validated the data by molecular dynamics and provided important parameters regarding drug/herbal chemical stability inside the binding pocket of protein target and possible hydrogen bonding and hydrophobic interactions during the simulations [25,26,27,28]. Very few direct validations of molecular docking protocol by redocking with the known ligand, i.e. native ligand on the X-ray crystal structure, were performed in the early days [1, 24, 29].

2.1.2.3 Drug Repurposing

As drug discovery is a long process and initiating it from a completely new drug candidate may result in a long wait while the COVID-19 pandemic was gaining momentum, drug repurposing seemed to be a much rapid alternative, previously employed for many related (MERS and SARS-CoV) and unrelated diseases. In this process, many drugs have been screened [1, 24]. These candidate molecules may be used in SARS-CoV-2 enzyme assays, antagonism of protein mechanism, or decelerating viral infectivity. It is hypothesized that drugs previously used in other viral diseases could also be used. An important aspect would be to make sure that the drugs are, in general, available for further experimentation and development in case they show promise. It will be futile to develop drugs that are only available as structural moieties or are at an investigational stage only. A simple strategy to drug repurposing that shows promise is screening a class of compounds rather than random drug screening. FDA-approved HIV-1 protease inhibitors, e.g. atazanavir and ritonavir, and hepatitis C NS3/4A protease inhibitors, e.g. lopinavir, have been successfully docked into the Mpro active site of SARS-CoV-2. Currently, China and South Korea have treated COVID-19 patients with Kaletra, the combination of lopinavir 200 mg/ritonavir 50 mg with some benefits [30].

2.1.3 Database Generation

It is needless to say that databases are very important to accelerate research in the current times especially because they provide a reference point to scientists worldwide. Virtual databases provide the advantage of large storage capacity while being continuously updated. In SARS-CoV-2, this has provided a better understanding of the virus’s origin after extensive comparisons between genomic data online. Genomic comparisons also helped in producing specific primers for RT-PCR detection kits as early as possible. Many databases have proven extremely useful in the pursuit of treatment strategies for COVID-19. While primary databases such as nucleotide sequence databases and protein sequence databases are extensively used to submit as well as obtain sequencing information for COVID-19 targets, secondary databases such as Prosit, PRINTS, Pfam, InterPro, PhenomicDB, or a genotype-phenotype database provide support to understand protein structure and functionality, while molecular structure databases such as Protein Data Bank, SCOP, CATH, and PubChem are important to obtain structural information of molecular targets. A few COVID-19-specific databases are listed in Table 6.3.

Table 6.3 COVID-19-specific online databases

Due to the widespread effect of the COVID-19 pandemic across countries and to allow easy sharing of scientific information between scientists, these databases have come up. While the WHO COVID-19 database provides general information on sociodemographics of the mortality and infected individuals, ViPR or Virus Pathogen Database and Analysis Resource has been updated to include information on SARS-CoV-2 and contains SARS-CoV-2-related data, tools, and analysis. CoV-AbDab provides data on COVID-19 antibodies. The International Nucleotide Sequence Database Collaboration (INSDC) has released a public statement entitled ‘INSDC Statement on SARS-CoV-2 sequence data sharing during COVID-19’, which highlights the importance of sharing SARS-CoV-2 sequence data within the international scientific community. The INSDC recommends that all researchers working with SARS-CoV-2 sequence data submit both their raw and consensus—or assembled—SARS-CoV-2 data to the INSDC databases, which are freely available to the scientific community. COVID-19 data portal EMBL-EBI (https://www.covid19dataportal.org/) enables data sharing throughout the globe. The initiative facilitates international collaboration to accelerate scientific discovery, monitor the pandemic, and help develop treatments and a vaccine for the new coronavirus. Other SARS-CoV-2 resources can be accessed by NCBI at https://www.ncbi.nlm.nih.gov/sars-cov-2/.

2.1.4 Other Approaches

It should be noted that previous virus research did not employ many of the above discussed tools. Ebola and Zika viral research were the first application of bioinformatic research in the viral disease field, and COVID-19 marks the introduction of drug discovery by using in silico approaches as constructive options. All the current bioinformatic tools have helped scientist in accelerating drug/vaccine development for SARS-CoV-2. Since the advance of bioinformatics in 1990, many tools exist that are used in addition to genome sequencing and molecular modelling. Softwares (both online and offline) are being extensively utilized for sequence analysis, complete genome sequencing, expressed sequence tags, identification of unknown genes, discovery of splice variants, causes of differences between viruses, pharmacogenetics, next-generation sequencing, etc. Multiple alignment tools to perform alignment of experimentally obtained genetic sequences are important for sequence comparisons and sequence-based database searches and additionally help in phylogenetic analysis. This is evident from the use of RNA sequencing from broncho-alveolar fluid samples of SARS-CoV-2 patients to identify its origins. The phylogenetic analysis revealed almost 90% similarity of virus sequences to betacoronaviruses from bat [35]. BLAST or Basic Local Alignment Search Tool has also been widely used to compare sequences whether nucleic acid or amino acid. BLAST uses the already available information on biological sequences from organisms to understand the genetic relationship with other species. For example, SARS-CoV-2 genome sequence similarity with viral metagenomes in pangolin has been seen. In particular, the availability of these bioinformatic approaches was very important in the discovery of newer drugs for COVID-19 as an understanding of genetic sequence needed to follow up by the protein analysis, i.e. function of the gene. This was made possible by comparing related sequences and thus similar functionality. Once the protein function could be determined (coupled to experimental evidence), the already known drugs against related targets could be taken for drug repurposing. In addition, drugs could be developed using computational approaches once this basic understanding is obtained [29].

In addition, due to the rapid research in the field of COVID-19, a comprehensive repository of knowledge about SARS-CoV-2, its proteins, mechanism of infection, and more has been created online. Vaccine design is also utilizing computational methods to design multiepitopes against SARS-CoV-2. The vaccine design includes prediction of potential epitopes from antigenic protein sequences and construction of vaccine, followed by molecular docking simulation to assess the binding affinity to the protein. For example, antigenic epitopes from spike glycoprotein, nucleocapsid, ORF3a, and non-structural proteins have been attempted (reviewed in [24]). Vaccine design has benefitted immensely by the constructed structural genomic and interactomic road maps that describe viral infection molecular mechanisms. An example is the X-ray crystal structure of RBD in complex with human antibody CR3022 (6 W41) for epitope engineering. T- and B-cell epitopes have been identified too through informatics (details in [23]). Development of SARS-CoV models using ECFP6 descriptors and the Bayesian algorithm has been attempted to develop an assay control software (Ekins 2020). The method is fast, does not require crystal structures, and enables scoring of small-molecule structures against many models simultaneously. System pharmacology approaches have also provided important insights into promising antiviral drugs against SARS-CoV-2 based on pathogenesis mechanism and host specificity. Novel algorithms are being used in COVID-19 research. These are mainly network-based algorithms and expression-based algorithms. Network-based algorithms generate networks, for example, drug target network or human protein interactome, and using these networks, putative drug candidates are identified. One such study generated HCoV-host interactome and integrated various drugs specific for targets [36]. Some expression-based algorithms have also aided drug repurposing by linking ACE2 to SARS-CoV-2 in the early days and thus providing two potential repurposed drugs that change ACE2 expression. Functional analysis of genomes is done in parallel to identify the cellular functions of gene products coupled to transcriptomics, proteomics, metabolomics, phenomics, and even systems biology. See reviews [23, 24] for details. Table 6.4 provides a few examples of applications of all the above-mentioned approaches in COVID-19.

Table 6.4 Examples of bioinformatic approaches used in COVID-19

3 Conclusion

There are many lessons that can be learnt from the application of in silico approaches in the viral pandemic of COVID-19. On the positive side, molecular modelling provides the advantage of reduced cost for faster development of a drug candidate. This is propelled by databases and collaborative efforts of scientists. Scientists can thus directly compare the various drug candidates in these databases and assess for further development.

This was the third instance of application of these techniques, and therefore, important lesson would be to rely more on high-resolution crystal structures rather than homology models. Otherwise, it will be very difficult to process the huge repository of docking results generated every time. Also, database generation should become a priority to increase the collaborative capacity and, therefore, data validation by scientists throughout the world. In addition, if the molecular docking is performed on commercially available drugs and herbal chemicals, it would help us limit the number of docked molecules yet keeping the feasibility of taking the drug candidate to the next stage of drug development. Molecular docking of chemicals that are not readily available leads to extra steps of extraction or synthesis that could be avoided. It is thus understandable that a collaborative approach where scientists team up to work on different aspects such as bioinformatics, chemical synthesis, in vitro testing, and in vivo testing will be fruitful. Interestingly, the comprehensive information about virus pathogenesis in the human body is still not available and thus hinders progress in drug development.

In conclusion, due to the heavy inundation with incomplete or unvalidated molecular modelling studies, it is warranted that proper steps are taken to manage pseudoscience propagation.

4 Future Outlook

Application of bioinformatic approaches in Ebola, Zika, and COVID-19 pandemic has shown that these approaches can be used for planning and developing antiviral drugs and support pandemic situations when properly streamlined. Target identification, interaction mapping, understanding structure activity relationships, and molecular docking and dynamics could provide important tools for elucidating underlying pathogenesis and its targeting by novel drug candidates. When coupled to proper experimental testing, the drug development will have far-reaching results. We are able to increase the safety of the drug molecules, i.e. by understanding the physical-chemical properties as well as probability of success. Time must be spent on elucidating underlying mechanisms so as to provide relief to clinical symptoms of COVID-19.