Keywords

2.1 Introduction

Bioinformatics is an interdisciplinary branch of biological sciences that deals with applications of computational biology for the collection, storage, and analysis of biological data. In recent years, several omics projects in plants have been performed, which were contributed by a vast amount of sequencing data. These omics data generated through the traditional or high-throughput next-generation sequencing (NGS) approaches and belong to genome, transcriptome, proteome, or metabolome of the plants (Knasmüller et al. 2008). The term genome refers to the complete nuclear chromosomal DNA sequence of an organism, whereas the total messenger RNA (mRNA) content in a cell at a time is termed as trancriptome. Its level varied with different plant developmental stages and external environmental condition. The latter produce proteome, which is the result of the translation of the mRNA. During the cell metabolism, primary and secondary metabolites are generated and complete set of metabolites present in the cell are called as metabolome (Lister et al. 2009; Saito and Matsuda 2010). Besides, various inevitable modifications, such as expression of genes without changing original genetic material (DNA) of the organism occurs during lifetime and inherited to next-generation, are termed as epigenetics changes.

The data and related information obtained from the plant omics can be useful for generating high-density linkage maps, allele mining, QTL mapping, genome-wide association studies (GWAS), SNP genotyping, single sequence repeats (SSR), and a better understanding of metabolic pathways and its regulations. All these information may be helpful for better plant breeding and improvement programs.

Besides, bioinformatics with the support of highly advanced experimental evidences, various databases have been developed and curated (Shinozaki and Sakakibara 2009). These databases help to discover the novel and unknown information of novel plants and organisms. The National Center for Biotechnology Information (NCBI) is among the world's largest resource databases, storing a vast amount of data in various categories. Also, there are various other databases related to specific plants are available, such as rice genome annotation project (RGAP) database for rice (Kawahara et al. 2013), The Arabidopsis Information Resource (TAIR) for Arabidopsis (https://www.arabidopsis.org/), Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html), and OmicsDI (open source platform facilitating the access and dissemination of omics datasets) (https://www.omicsdi.org). The Phytozome and OmicsDI databases are one of the comprehensive omics databases that included information about several datasets including genomic, transcriptomic, proteomic, and metabolomic data (Goodstein et al. 2012). There is one important tool known as ODG (Omics database generator), which is a tool used for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding (Guhlin et al. 2017). A list of various omics integration, software tools, and web applications is provided in Table 2.1

Table 2.1 Summary of multi-omics integration software tools and web applications

The present chapter describes the available tools and techniques used for curation, interpretation, and functional relevance of biological data using web-based resources. Further, this chapter also describes the online available databases, which can be used to extract the functional and structural information of unknown genes and proteins of novel plants. The relevant resources are also included for validating metabolic pathways. A basic overview is provided for the workflow of different omics analysis (Fig. 2.1).

Fig. 2.1
figure 1

Basic workflow of omics analysis

2.2 Relevance of Bioinformatics in Genomics

DNA polymorphism is the variation of nucleotides in the genomic DNA. These modifications can be originated as a result of single nucleotide polymorphism (SNP), insertion and deletion (InDels), or simple sequence repeats (SSRs). SNPs are locations within the genome, where the original nucleotide is substituted with other nucleotide, whereas InDels are insertion and deletion of nucleotide in the genome, and these changes are inheritable from one generation to other. The length of insertion and deletion in the genomic DNA varies from one to many bases. However, three nucleotide insertion or deletion is very common (Chai et al. 2018; Jain et al. 2014). This could be an evolutionary adaptation as three nucleotides code for an amino acid. SSRs are another genetic variation that occurs in genome and known as simple sequence repeats of single nucleotide to ten nucleotides. However, during the analysis repeats of two nucleotides or more with specific repetition are considered as the SSRs (Agarwal et al. 2015; Daware et al. 2016; Parida et al. 2015; Dwivedi et al. 2017).

Identification of DNA polymorphisms is highly essential for gene mapping, QTL analysis, and marker-assisted breeding. Various techniques have been used to identify DNA polymorphisms including gel-based, like random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), restriction fragment length polymorphism (RFLP), microsatellites, SSR, simple sequence length polymorphism (SSLP), and non-gel-based techniques, like SNPs and InDels. SNPs/InDels are the most popular non-gel-based DNA marker systems, which represent the position of nucleotide(s), where DNA sequence differs by a single or more bases. SNPs/InDels have gained importance due to their ubiquity in the genome coupled with various characteristics, such as stability, robustness, efficiency, and cost-effectiveness (Alkan et al. 2011; Kumar et al. 2012b; McCouch et al. 2010; Rafalski 2002; Steemers and Gunderson 2007). The next-generation sequencing (NGS) is an easy and cost effective method for discovery of SNPs/InDels in a population. A large number of SNPs have been discovered from several plant species like Arabidopsis (Atwell et al. 2010), rice (Huang et al. 2010, 2011; Jain et al. 2014; McNally et al. 2009; Meyer et al. 2016; Zhao et al. 2011), maize (Kump et al. 2011; Tian et al. 2011), chickpea (Deokar et al. 2014; Thudi et al. 2014), and soybean (Hwang et al. 2014; Lam et al. 2010) via genome re-sequencing.

Since huge data of SNPs/InDels are being generated using the NGS, a large number of bioinformatics tools are available to validate the biological significance of the aforesaid changes in the genome. For the analysis and validation of SNPs/InDels various bioinformatics tools are available (Li and Wei 2015; Seal et al. 2014), in which GATK and Freebays are the two important tools to discover the SNPs/InDels from the genome mapped files (Garrison and Marth 2012; Van der Auwera et al. 2013). The genome mapping of sequence reads is performed using different tools, mainly TopHat, STAR, and Bowtie tools (Dobin et al. 2013; Trapnell et al. 2009; Wu et al. 2018). Once the DNA polymorphism is identified, it is annotated using the snpEff software (Cingolani et al. 2012). This helps to understand the effect of SNPs/InDels on various transcriptional, post-transcriptional, and post-translational modifications. These genetic modifications can be further associated with various traits using the genome-wide association (GWAS) study in plants (Marees et al. 2018). The SNPs/InDels associated with various traits can be used for the genetic engineering and crop breeding purposes to improve the crop productivity.

2.3 Application of Bioinformatics in Epigenomics

DNA methylation is one of the epigenetic variations that occur by addition of a methyl group to the genomic DNA. It plays a crucial role in the regulation of chromatin structure and regulates the gene expression in eukaryotes. DNA methylation mainly occurs at the cytosine and adenine nucleotides in DNA; however, methylation in cytosine is specific to higher eukaryotes. In plants, DNA methylation is occurred in three different sequence contexts, CG, CHG, and CHH (where H = A, C or T). This methylation is established and maintained by de novo methyltransferases (DRM1/2/CMT3) via RNA-directed DNA methylation (RdDM) pathway and MET1 proteins (Cao and Jacobsen 2002; Lindroth 2001). Epigenetic modifications are highly stable and heritable, and it regulates cellular and developmental modifications including agronomically important traits in the plants (Manning et al. 2006; Miura et al. 2009; Soppe et al. 2000). DNA methylation analysis has been carried out in different plants to study their role in different developmental processes and stress responses (Chinnusamy and Zhu 2009; Dowen et al. 2012; Gehring et al. 2009; Hsieh et al. 2009; Lang-Mladek et al. 2010; Mirouze et al. 2009; Saze et al. 2003; Zemach et al. 2010).

To study the genome-wide DNA methylation, various techniques have been developed (HPLC, mass spectrometry, Sssl methyltransferase tritium labeling and methyl sensitive restriction enzyme). Initially, these methods were low throughput because they could capture the DNA methylation only in few genes (Karan et al. 2012; Wang et al. 2011). Later, microarray has been proved as first high-throughput technique to study the DNA methylation (Schumacher et al. 2006). Further, next-generation sequencing (NGS) based technique has also been evolved to capture the DNA methylation at the single-base resolution and has been used to study the DNA methylation in various plants including Arabidopsis and rice (Dowen et al. 2012; Garg et al. 2015; Rajkumar et al. 2020; Wang et al. 2011). This technique provides more in-depth knowledge about the DNA methylation, its distribution, and regulation.

Bioinformatics tools such as Bismark and Methylkit are highly efficient tools to analyze the DNA methylation data. Bisulfite sequencing is widely used technique to study the DNA methylation, in which nonmethylated thymine is changed into a cytosine but methylated thymine nucleotide does not modify (Li and Tollefsbol 2011). The first step of bisulfite sequencing is NGS based sequencing. Further, the sequencing data needs to be mapped on genomic DNA. Specific sequence aligner is required to align the sequence reads on the genome. The most widely used sequence aligner is Bismark (Krueger and Andrews 2011). Further, the mapped reads are mined by another bioinformatic tool widely known as Methylkit (Akalin et al. 2012). It extracts the methylated cytosine from the data throughout the genome. This information is further used to annotate and study the biological relevance of methylation on the various biological processes and metabolic pathways using different databases.

2.4 Bioinformatic Tools to Identify the Transcriptomic Alterations

2.4.1 RNA-Seq Analysis

Transcriptome can be defined as the total mRNA in a cell at a particular time. mRNA is derived from one strand of genomic DNA. Further, it translates into a protein with the help of the ribosomes. Transcriptome of the cell can be studied by the microarray and RNA sequencing (RNA-seq) (Jain 2012; Wang et al. 2011). Microarray has low throughput and various limitations as compared to the RNA-seq.

Microarray is based on the hybridization of the DNA probe designed for every gene (Page et al. 2007). They are very specific for the genes. mRNA in one condition is labeled with the green color and mRNA in other condition is tagged with red color. These labeled mRNAs are hybridized on a chip containing DNA probes for various genes. Once the labeled mRNA hybridized with the probe, it emits a fluorescent color, which is detected by the highly sensitive camera. Further, these patterns of color overlap between two conditions and based on the intensity, the differential expression between two conditions is estimated. To analyze these data, GeneSpring GX is one of the most widely used bioinformatics tool provided by the Agilent (Agapito 2019). It is a combination of different utilities that provides powerful, accessible statistical tools for data analysis and visualization. It is designed basically for the need of biologist and enables understanding of transcriptomics, genomics, proteomics, metabolomics, and NGS data within the biological context. It allows the researchers to quick and reliable identification of the biologically significant genes and pathways.

RNA-seq is one of the most advanced techniques based on next-generation sequencing (NGS) to study the transcriptome (Børsting and Morling 2015; Jain 2012; Lister et al. 2009). It has various advantages over microarray, as it can be used to study alternative splicing, polyadenylation, and novel genes or transcript discovery (Rao et al. 2018). During the RNA-seq library preparation process, mRNA is converted into cDNA to enhance stability. The cDNA is mechanically fragmented into small fragments (100–500 nucleotides). These fragments are attached with the adopter sequences present on the sequencing chip. The attached fragments further PCR amplified using the primers based on the adopter sequences to enhance the number of fragments for each molecule. These cDNA fragments are further sequenced by the sequencing technology (Kumar et al. 2012a; Zhong et al. 2011). The sequencing platform uses the sequence by synthesis approach. Based on the sequence length, these techniques are divided into two groups, i.e. short reads and long reads (Berbers et al. 2020). Both of these groups have advantages and disadvantages. The short reads sequencing technology can provide more read depth, whereas long reads technology provides the longer reads but shallow read depth (Reinert et al. 2015).

Once the sequencing is complete, the sequencing reads are mapped on the genome sequence of the respective plant. Mapping of sequencing reads is done by various bioinformatics tools, such as Tophat, SOAP, STAR, Salmon, Bowtie (Dobin et al. 2013; Kim et al. 2013; Patro et al. 2017; Trapnell et al. 2009; Xie et al. 2014). Among all, STAR is the better alignment tool and it provides the normalized count of reads mapped on each gene in every sample (Dobin et al. 2013). Normalized mapped read count is used to estimate the differential gene expression between two samples or conditions. To estimate the differential gene expression, various bioinformatics tools are being used including EdgeR, DESeq, Limma, cufflinks/Cuffdiff, RSEM, and Salmon (Ghosh and Chan 2016; Li and Dewey 2011; Love et al. 2014; Patro et al. 2017; Pollier et al. 2013; Ritchie et al. 2015; Robinson et al. 2010). Edger, DEseq, and Limma are the most used tools for identification of differentially expressed (DE) genes (Love et al. 2014; Ritchie et al. 2015; Robinson et al. 2010).

The DE genes are further used to discover the biological processes and pathways regulated by them. The biological processes were discovered by the EnrichR and BinGO tools (Kuleshov et al. 2016; Maere et al. 2005). For the annotation of DE genes, these tools used the functional annotation from the ontology databases. To discover the role of DE genes in biological pathways KEGG pathway database (https://www.genome.jp/kegg/pathway.html) is used. DE genes were also used to discover the transcription regulatory elements using different databases (Table 2.2). Among all, plant cis-acting regulatory elements database (PlantCARE) and PLACE are the most suitable and highly used database (Guo et al. 2008).

Table 2.2   Databases for the study of promoter sequences and regulatory elements of a gene

For transcriptomic studies, there are several public databases available to store the transcriptomic data, such as Genevestigator, NASCArrays, ArrayExpress, Stanford Microarray Database, Omics DI, and Gene Expression Omnibus (Bhardwaj and Somvanshi 2015). An example of the database is Chickpea Transcriptome Database (CTDB), which has information about the tools used for transcriptome sequence, transcription factor families, conserved domain(s), and molecular markers in chickpea (Verma et al. 2015) (Table 2.2).

2.4.2 Tools and Databases for Transcription Factor Binding Site

Chromatin immunoprecipitation (ChIP)-sequencing (ChIP-seq) is the method to analyze the protein DNA interaction. It is a combination of chromatin immunoprecipitation (ChIP) coupled with NGS to identify the binding sites of DNA associated proteins. It could be useful to discover the binding sites of any protein and has primarily been used to study the transcription factor (TF) binding sites and chromatin-associated proteins (Mundade et al. 2014).

ChIP-seq includes a few critical steps before the sequencing of the DNA-fragments attached with TF/protein. It starts with the crosslinking of protein with the DNA using formaldehyde (Hoffman et al. 2015; Klockenbusch and Kast 2010; Nadeau and Carlson 2007). However, along with the protein DNA crosslinking there are chances of contamination of RNA-protein complexes in the reaction mixture. This crosslinked sample was fragmented to get the DNA-protein crosslinked fragments and pull-down using antibody. The DNA fragments are then sequenced using the deep short-read sequencing platform. The first step in the ChIP-seq data analysis is known as the peak calling.

The most popular bioinformatics tool for peak calling is MACS (Feng et al. 2012; Zhang et al. 2008). This empirically models the shift size of ChIP-seq tags and uses it to improve the spatial resolution of predicted binding sites. Once the binding sites in the whole genome are predicted, these binding sites must be annotated to find out the respective genes, which are present at the downstream. This can be performed by HOMER and various other databases available to annotate these binding sites and related TFs (Table 2.3) (Heinz et al. 2010, 2018). It provides information about the binding sites and their regulating genes and pathways. This information can be used to identify genes and relevant pathways that can be used to implement in the crop improvement.

Table 2.3  Database for transcription factor prediction 

2.4.3 Tools and Databases for Analysis of Post-Transcriptional Modifications

Another important event known as alternative splicing is also studied in transcriptome analysis as the post-transcriptional event. Alternative splicing is divided into five categories such as exon skipping, mutually exclusive exon, alternative 5′ donor site, alternative 3′ acceptor site, and intron retention (Bedre et al. 2019; Eckardt 2013; Shang et al. 2017; Shankar et al. 2016). Intron retention is the most common alternative splicing events that happened during the transcription process under normal or any stress condition (Shankar et al. 2016). The recommended tools to identify the alternative splicing are TopHat, MapSplice, SpliceMap, HMMsplicer, STAR, and HISAT (Au et al. 2010; Dimon et al. 2010; Dobin et al. 2013; Kim et al. 2015; Trapnell et al. 2009; Wang et al. 2010). These tools provide information about the alternative splicing in mRNA. Various bioinformatics tools are available for computing the differential expression of transcript isoforms produced as a result of alternative splicing (Kim et al. 2013; Patro et al. 2017). This will help to identify a specific isoform produced during the stress or different developmental stages (Akhter et al. 2018; Jiang et al. 2015; Shankar et al. 2016). A biologist to understand the deeper knowledge of plant development and stress responses will use this information.

RNA secondary structure is another post-transcriptional changes happened in the RNA during the post-transcriptional event (Ding et al. 2014; Wang et al. 2019b; Yang et al. 2018). It is known that genomic DNA is folded into specific shapes in the nucleus. Similar folding is reported in RNA also after post-transcriptional process to deliver its function or stability. It is well established that ribosomal RNA folded into distinct three-dimensional shape including internal loops and helices. It binds with the ribosomal protein and make ribosomal subunit required for protein synthesis. Various studies have been carried out to discover the mRNA secondary structure in plants using the NGS techniques (Ding et al. 2014; Wang et al. 2019b; Yang et al. 2018). It has been observed that mRNA with variations in RNA secondary structure lead to affect various transcriptional and post-transcriptional events (Li et al. 2012). There are several bioinformatics tools available, which can provide the secondary structure of the RNA (Gruber et al. 2008; Reuter and Mathews 2010; Wang et al. 2019a). It has been observed that RNA secondary structure predicted using the bioinformatics tools and structure detected using the NGS technique are very similar (Li et al. 2012).

2.5 Importance of Bioinformatics in Proteomics and Metabolomics

Proteins regulate various biochemical and physiological functions in the cells. The dysregulation of proteins may result in various diseases like cancer, neurodegenerative disease, and metabolic imbalance. Protein is synthesized from the mRNA during the translation process and folded into three-dimensional structure after protein synthesis. If the 3D structure is not folded properly, the protein will not be able to perform its activity and will not be able to interact with other proteins as well. The knowledge of protein–protein interactions and structure can be obtained from various databases (Table 2.4).

Table 2.4 Important computational tools for predicting protein structure and protein–protein interactions

One of the most advanced techniques available for proteomic analysis is known as mass spectrometry (Di Falco 2018; Reinders et al. 2004). All the proteins from a sample are needed to be extracted and digested using specific proteases to generate a defined peptide. The peptides obtained are analyzed by the liquid chromatography coupled to mass spectrometry (GC-MS) (Lluveras-Tenorio et al. 2017). During the analysis, peptides eluted from the chromatography are selected and data is recorded as a mass spectrometer. The resulted tandem spectra provide information about the sequence of the peptide. These proteins are further used for functional annotation using the gene ontology (GO) terms and KEGG pathways database. The GO term provides the information about the cellular component, biological process, and molecular functions of the respective genes and proteins. The cellular component GO term provides information about the protein location in the cell compartment. The biological process GO terms provide information about the biological processes and molecular functions GO terms represent activities rather than the entities (molecules or complexes) performed by the genes or proteins (Hill et al. 2008). Similarly, the KEGG pathways database provides knowledge about the metabolic pathways regulated by these proteins. This information is further used by the research scientist to conclude the pathways regulated by these genes and used it to translate into genetic engineering and crop improvement.

There are different public databases available for MS proteomics research. These databases are Global Proteome Machine Database (GPMDB), Mass Spectrometry Interactive Virtual Environment (MassIVE), PRIDE, PeptideAtlas, PeptideAtlas SRM Experiment Library (PASSEL), and Proteomics DB. Moreover, for more integration and sharing of public databases, the Proteome Xchange consortium has been made recently to take its advantage for the scientific community (Perez-Riverol et al. 2015).

Metabolomics is another direction of omics included in the comprehensive assessment and quantification of metabolites present in the cell. Metabolites represent a diverse group of low molecular weight molecules including lipids, amino acids, peptides, nucleic acids, organic acids, vitamins, thiols, and carbohydrates. These metabolites have a different role in the biological systems and their role in various plant stress and development processes needed to be understood (Hussein and El-Anssary 2019; Bartwal et al. 2013; Jwa et al. 2006; Saito and Matsuda 2010; Shankar et al. 2016). Further, this information can be used by the biologist to perform genetic engineering or plant breeding to improve the crop plants. Various methods have been developed to study the metabolites including GC, HPLC, UPLC, CE coupled to MS and NMR spectroscopy (Boizard et al. 2016; Boros et al. 2018; Garcia-Perez et al. 2020; Lluveras-Tenorio et al. 2017; Patel et al. 2017; Yang et al. 2013, 2020). This could help in separation, detection, characterization, and quantification of such metabolites and their related pathways. However, the diverse group of molecules makes it more challenging to study the metabolites using a single technique. Thus, more than one technique is used to identify the different metabolites in the plant system.

2.6 Challenges and Opportunity in Omics Study

Various advancements have been achieved in the field of omics study. Now we can detect the maximum number of RNA, DNA, and protein content present in the cell. However, different challenges are still persisted, which need to be answered. Even today, during library preparation of DNA or RNA sequencing, we are not able to capture all the DNA and RNA molecules. A large number of RNA and DNA have become degraded during the sample preparation. Genome re-sequencing with advanced technology is not able to cover 100% of the genome of any organism. We used to get a lot of redundancy during the mapping of the sequencing reads on the genome and/or transcriptome. This problem is more prominent in the plants with genome ≥2n (diploid). Study of proteomics and metabolomics are at very early stage and recent development in large scale proteomics data impose a substantial challenge for available bioinformatics tools to validate these results (Cho 2007; Hongzhan et al. 2007; Reinders et al. 2004; Schubert et al. 2017). During the proteomic analysis, a large number of challenges needed to be resolved besides sample preparation such as data assembly and database search for the functional annotation (Reinders et al. 2004; Schubert et al. 2017). We can annotate only those proteins, whose information is present in the database, but identifying a novel protein is very challenging.

To capture all the DNA and RNA new methods and techniques are being developed. Single molecule sequencing is evolving, as the new approach is developed to improve the genome coverage. The analysis for these molecules is also being improved. It provides a complete sequence information of all the mRNA expressed in a cell or tissue. This will also enable to get a deeper understanding of the post-transcriptional modifications occurred in RNA. Implementation of this method can solve the limitation of protein sequencing and quantification. During the sample preparation, one part of tissue is used to extract either DNA or RNA or proteins and metabolites. This adds the batch effect in the analysis. Now molecular signature is being analyzed from single cell, so developing methods to extract the entire molecular signature from the same cell or tissue has a great opportunity. Recently, few protocols have been developed to extract the DNA and RNA from same tissue but still need a lot of optimization. In bioinformatics analysis, all the tools and techniques come with few limitations. To solve all these limitations, novel techniques and methods are being developed. Hopefully, in future we will be able to develop more advanced technology to solve all these challenges and limitations.