Keywords

32.1 Introduction

Quality and quantity based designer crops and disease-free crops are in demand today. For that, crop improvement and protection is the first priority, in which computational biology approach for sequenced plant genomes plays a very important role and helps in crop improvement by maximizing the yield, quality-based fruits and grains production and disease resistant crops varieties (Chen and Chen 2008; King 2004; Mochida and Shinozaki 2010; Batley and Edwards 2016; Moody 2004). Development of sequence markers based on single nucleotide polymorphism and simple sequence repeat identification has now become feasible method for crop improvement. Lots of techniques, databases, tools and software have been developed to understand and analyze the biological system fully. Here standard bioinformatics techniques with specific tools and software are described.

32.2 Bioinformatics Techniques

32.2.1 Comparative Analysis

A comparative analysis is a field of biological sequence analysis in which the genomic sequence features of different organisms are compared. The genomic features may include the DNA sequence, regulatory region sequence genes and gene order. The major principle of comparative analysis is that to identify the common features between homologous sequences, it will often be encoded within the DNA that is evolutionarily conserved between them or differ region which are involved in diversity (Hardison 2003; Ong et al. 2016; Gebhardt et al. 2005; Sayers et al. 2019) (Fig. 32.1).

Fig. 32.1
figure 1

Genome availability details in the NCBI database for retrieval and comparison of sequences

32.2.2 Sequence Analysis

Sequence analysis is the process of subjecting a DNA, RNA or protein homologous gene (orthologous and paralogous genes) sequence to understand its evolution, function, structure or features based on sequence alignment and searches against biological sequence databases like reference genes, proteins, UniProtKB/swiss-prot, protein data bank, etc. Sequence analysis includes the comparison of common region homologous sequences in order to find similarity and dissimilarity; identification of intrinsic features of the sequence such as active sites, post-translational modification sites, gene-structures, reading frames and distributions of introns and exons and regulatory elements; identification of sequence differences and variations such as point mutations, single nucleotide variants (SNV) and single nucleotide polymorphisms (SNPs) in order to get the genetic marker, revealing the evolution and genetic diversity of sequences and organisms and identification of molecular structure from sequence alone. A basic local alignment tool is the best tool for revealing the evolutionary and genetic diversity of sequences and organisms and identification of molecular structure from sequence (Aljanabi 2001; Bolger et al. 2018; Martinez 2013; Demuth and Hahn 2009; Lyons and Freeling 2008; Altschul et al. 1990; McClure et al. 1994; Pirovano and Heringa 2008; Bawono et al. 2017) (Fig. 32.2).

Fig. 32.2
figure 2

Basic local alignment search tool web page for sequence similarity analysis

32.2.3 Gene Identification

Gene hunting, gene finding or gene prediction refers to the process of identifying the regions of genomic DNA that encode genes. Gene identification is one of the first and most important steps in understanding the gene and genome of organisms once they are sequenced and available to the public domain. Gene finding is one of the key steps in genome annotation, following genome sequence assembly and the filtering of non-coding (intronic) regions and coding (exonic) regions (Alioto 2012; Wang et al. 2004; Mochida and Shinozaki 2010) (Fig. 32.3).

Fig. 32.3
figure 3

Softberry server is a collection of software tools for genomic research focused on computational methods for high throughput biomedical data analysis

32.2.4 Phylogenetic Analysis

Phylogenetic analysis is the study of the evolutionary relationships among groups of homologous genes from organisms (e.g. species or populations). These phylogenetic relationships are discovered based on phylogenetic inference methods (distance-matrix methods: Neighbor-Joining (NJ), UPGMA (Unweighted Pair Group Method with Arithmetic mean) and WPGMA (Weighted Pair Group Method with Arithmetic mean), Fitch–Margoliash method, using outgroups, etc.; Maximum parsimony: Branch and bound, Sankoff-Morel-Cedergren algorithm, MALIGN and POY; Maximum likelihood; Bayesian inference) using sequence or morphological data. A phylogenetic tree is a branching tree diagram that represents the evolutionary relationships among selected biological organisms or species. The phylogeny inferences based on similarities and differences in their genetic or physical characteristics. Phylogenetic analyses have become central to understanding genomes, diversity, evolution and ecology (Thompson et al. 1994, 2002).

32.2.5 Protein–Protein Interaction

Protein–protein interactions (PPIs) are the physical contacts between two or more protein molecules with high specificity based on biochemical events directed by hydrophobic effect and electrostatic forces. In STRING database known interactions based on curated databases or experimentally determined, predicted interactions based on gene neighbourhood or gene fusions or gene co-occurrence and other interactions based on textmining or co-expression or protein homology (De Las Rivas and Fontanillo 2010; Kozakov et al. 2017; Szklarczyk et al. 2019) (Figs. 32.4 and 32.5).

Fig. 32.4
figure 4

STRING is a database for functional protein association networks

Fig. 32.5
figure 5

ClusPro server is a web-based server for the direct docking of two interacting proteins

32.2.6 Microarray Data Analysis

NCBI developed the Gene Expression Omnibus (GEO) database in 2000 for high-throughput gene expression data. Microarray data analysis is used to infer information from the data generated from DNA, RNA and protein microarray experiments; these information allows researchers to investigate the expression level of a huge number of genes of the entire organism genome in a single experiment. Gene Expression Omnibus (GEO) is a public database using MIAME (Minimum Information About a Microarray Experiment) compliant data submissions. Sequence and array-based data are accepted by the repository. Techniques and tools are available to help researchers query and download experimental datasets and gene expression profiles. GEO has collected repository and it consists freely available microarray data, next-generation sequencing data, and other high-throughput functional genomics data submitted by the scientific community (Clough and Barrett 2016). Due to the complexity of data which are generated by experiments are analyzed by bioinformaticians and bio scientists with specialized softwares. GEO has developed many tools for data query, analysis and visualization that can be analyzed directly on the GEO server (Fig. 32.6).

Fig. 32.6
figure 6

Gene Expression Omnibus (GEO) is a database repository of high throughput gene expression data and microarrays

32.2.7 Structure Prediction and Refinement

Protein structure prediction is the construction of the three-dimensional (3D) structure of a protein from its amino acid sequence. In three-dimensional structure, the 3D prediction contains folds and secondary and tertiary structures from its primary sequence. It is highly important in drug designing and in the designing of 3D novel enzymes (Krieger et al. 2003; Xiang 2006; França 2015; Cavasotto and Phatak 2009; Xu et al. 2000).

32.2.8 Molecular Docking Calculation

Molecular docking is the interaction of two or more molecules to provide a stable complex structure. Based on the binding properties of the ligand and target, it generates a three-dimensional structure complex. Molecular docking is an approach to predict the orientation of one molecule to second molecule in the bound structure, which forms a stable complex. Knowledge of the active site orientation in turn may be useful in predicting the binding strength or binding affinity between receptor-ligand molecules using scoring functions. Molecular docking is a prominent method for structure-based drug design, due to the prediction of the binding-conformation of molecular ligands to the target receptor binding site. Characterization of the active binding behaviour plays an important role in rational design of novel pesticides, herbicides, insecticides and fungicides (Ferreira et al. 2015; Guedes et al. 2014; Morris and Lim-Wilby 2008; Meng et al. 2011; de Ruyck et al. 2016; Pagadala et al. 2017; Zhao and Caflisch 2015; Kroemer 2007; Sousa et al. 2006; Jones and Willett 1995; Lybrand 1995; Goodsell et al. 1996; Gschwend et al. 1996; Trosset and Cavé 2019).

32.3 Bioinformatics Databases

Biological Data Model

Biological data model is a library of biological life sciences information and biological databases; it has a collection of computational analysis tools, literature and high-throughput experimental data. Biological database contains information from research areas including genomics, phylogenetics, proteomics, metabolomics microarray gene expression and phenomics. Information contained in biological databases includes gene structure and function, macromolecular structure, cellular and chromosomal localization and SNP and mutations in sequences and structures (Wheeler et al. 2005; Galperin and Fernández-Suárez 2012). NCBI is a data model that contains popular search engine Entrez. Entrez is NCBI’s retrieval system and primary text search that integrates the PubMed and PMC database of biomedical literature with so many molecular databases including genome, gene, DNA, genetic variation, gene expression, protein sequence and structure.

32.3.1 NCBI

NCBI stands for the National Center for Biotechnology Information and is strongly associated with the National Library of Medicine (NLM) and National Institutes of Health (NIH), Bethesda, Maryland. The NCBI was founded in 1988 by Senator Claude Pepper. NCBI resources contain chemicals and bioassays data, data and software, DNA and RNA sequence data, domains and structures, genes and expression data, genetics and medicine, genomes and maps, homology data, literature, protein sequence and structure, sequence analysis, taxonomy, training and tutorials data and variation data (NCBI Resource Coordinators 2016; Wheeler et al. 2005) (Figs. 32.7, 32.8, and 32.9).

Fig. 32.7
figure 7

National Center for Biotechnology Information web page

Fig. 32.8
figure 8

NCBI genome details page1 (The genome information can search by different kingdoms, groups, subgroups, organism name present in the NCBI database)

Fig. 32.9
figure 9

NCBI genome details page2 (The genome information of eukaryota kingdom, plants group with their subgroups)

32.3.2 DDBJ

DDBJ (DNA Data Bank of Japan), founded in 1986, is a biological databank that mainly contains DNA sequence information. DDBJ is located at National Institute of Genetics (NIG), Shizuoka prefecture, Japan. It is also a member of INSDC (International Nucleotide Sequence Database Collaboration). The INSDC consists of a joint effort to collect and share DNA and RNA sequence data with GenBank (USA) and the European Nucleotide Archive (UK). DDBJ Sequence Read Archive (DRA), NCBI Sequence Read Archive (SRA) and EBI Sequence Read Archive (ERA) share new data and updated data on nucleotide sequences, and each of the three databases (DDBJ, NCBI and EMBL) are synchronized on a daily basis through continuous interaction between the staff at each of the collaborating organizations (Kodama et al. 2012) (Fig. 32.10).

Fig. 32.10
figure 10

DNA Data Bank of Japan web homepage

32.3.3 EMBL

European Molecular Biology Laboratory (EMBL) is a research institution supported by 25 member states. EMBL was founded in 1974 and is a molecular biology research organization funded by public money from its member states conducted by approximately 85 independent groups. The web-based submission systems include WebIn at EMBL-EBI, Sakura (“cherry blossoms”) at DDBJand BankIt at the NCBI (Madeira et al. 2019) (Fig. 32.11).

Fig. 32.11
figure 11

European Molecular Biology Laboratory web page

32.3.4 Ensembl Plants

Ensembl Plants is an integrative database containing genome-scale information of plants. Ensembl Plants database includes genome sequence, gene models, polymorphic loci and functional annotation and various tools for analysis of sequence data. It contains various additional information, such as variation data, individual genotype data, linkage, population structure and phenotype data (Bolser et al. 2016, 2017) (Fig. 32.12).

Fig. 32.12
figure 12

Ensembl Plants front page for genome-scale information of plant species

32.3.5 PlantGDB

PlantGDB is a resource for comparative genomics and a database of molecular sequence data for plant genomes. PlantGDB contains assembled unique transcripts (PUT), genome survey sequence assemblies (GSS), genome browsers and workflow Management (Dong et al. 2004; Duvick et al. 2008) (Fig. 32.13).

Fig. 32.13
figure 13

PlantGDB database for the comparative plant genomics information 

32.3.6 Phytozome

Phytozome is a comparative hub for plant genomes and gene family’s data and analysis. Phytozome provides a view of genome organization, gene family, gene structure and the evolutionary history of gene at the level of sequence. It also provides access to the sequences and functional annotations of plant genomes and genes (Goodstein et al. 2012) (Fig. 32.14).

Fig. 32.14
figure 14

 Homepage of Phytozome database

32.3.7 UNIPROT

UniProt database is a freely accessible database for protein sequence and functional annotation information, many entries being derived from different genome sequencing projects. UniProt contains a large amount of biological function of protein information derived from the literature mining. The main aim of UniProt is to provide a freely accessible resource, comprehensive and high-quality information of protein sequence and functional annotation information to scientific community (UniProt Consortium 2018) (Fig. 32.15).

Fig. 32.15
figure 15

UniProt database

32.3.8 PDB

PDB (Protein Data Bank) is a databank for the three-dimensional (3D) structural data of a large number of biological molecules, such as nucleic acids and proteins. The structural data is typically obtained by X-ray crystallography, NMR spectroscopy and cryo-electron microscopy. They are submitted by structural biologists from all around the world and are freely accessible on the net via website URLs. PDBmain member organizations are PDBe, PDBj, RCSB and BMRB. The PDB is overseen by an international organization called the Worldwide Protein Data Bank, wwPDB (Berman et al. 2000; Berman 2008; Laskowski et al. 1997) (Fig. 32.16).

Fig. 32.16
figure 16

Protein Data Bank homepage

32.3.9 MMDB

The Molecular Modeling Database (MMDB) is a three-dimensional biomolecular structure database of experimentally determined macromolecules and hosted by the National Center for Biotechnology Information (Chen et al. 2003) (Fig. 32.17).

Fig. 32.17
figure 17

Molecular modeling database of NCBI

32.3.10 GEO

GEO (Gene Expression Omnibus) is a gene expression database that archives and freely distributes microarray datasets, next-generation sequencing analysis details and other high-throughput functional genomics datasets deposited by the research community. The main goals of GEO are to provide versatile and robust database in which researchers can efficiently store high-throughput functional genomic data, offer simple submission procedures and formats to the research community that supports complete and well-annotated data deposits and provide user-friendly mechanisms to researchers that allow users to review, query, locate and download studies and gene expression profiles of interest for query and analysis (Clough and Barrett 2016) (Fig. 32.18).

Fig. 32.18
figure 18

Gene Expression Omnibus database of deposited high-throughput gene expression profiling data 

32.4 Bioinformatics Tools and Software

32.4.1 BiGGEsTS

BiclusterinG Gene Expression Time Series (BiGGEsTS) is a free tool and graphical application based on bi-clustering algorithms mainly developed for analysis of gene expression time series data (Gonçalves et al. 2009) (Fig. 32.19).

Fig. 32.19
figure 19

BiclusterinG Gene Expression Time Series

32.4.2 HCE

HCE (Hierarchical Clustering Explorer) consists of hierarchical clustering algorithm to enable researchers to determine the grouping of data with informative dendrogram and colour mosaic visual feedback and dynamic query controls (Seo et al. 2006) (Fig. 32.20).

Fig. 32.20
figure 20

Hierarchical Clustering Explorer

32.4.3 ClustVis

ClustVis is a web tool which allows researchers to upload their data and create Heat maps and PCA (Principal Component Analysis) plots. Data can be uploaded as a file or by pasting data to the text box (Metsalu and Vilo 2015) (Fig. 32.21).

Fig. 32.21
figure 21

ClustVis web tool

32.4.4 BLAST

BLAST (Basic Local Alignment Search Tool) finds regions of similarity and dissimilarity between sequences. The BLAST programme compares nucleotide or protein sequences to sequence databases and calculates identity with statistical significance (Altschul et al. 1990; Mount 2007) (Fig. 32.22).

Fig. 32.22
figure 22

Basic Local Alignment Search Tool

32.4.5 Clustal

Clustal omega, Clustalw and Clustalx (Clustal series) are widely used programmes for multiple sequence alignment (Higgins et al. 1996; Chenna et al. 2003; Sievers and Higgins 2014) (Fig. 32.23).

Fig. 32.23
figure 23

Clustal series homepage

32.4.6 Bioedit

BioEdit is a free sequence alignment editor for editing and manipulation of sequence alignment data (Tippmann 2004) (Fig. 32.24).

Fig. 32.24
figure 24

BioEdit is a biological sequence alignment editor tool 

32.4.7 MEGA

MEGA is a tool for manual and automatic sequence alignment, phylogenetic tree preparation, estimating rates of molecular evolution, web-based database mining and testing evolutionary hypotheses (Kumar et al. 2018) (Fig. 32.25).

Fig. 32.25
figure 25

Molecular evolutionary genetic analysis

32.4.8 Figtree

Figtree is a graphical viewer of phylogenetic tree visualization and for producing publication-ready figures of phylogenetic trees (Rambaut 2012) (Fig. 32.26).

Fig. 32.26
figure 26

FigTree server

32.4.9 Circos

Circos server is basically for identification and analysis of similarities and dissimilarity/differences generated from gene and genome comparisons (Krzywinski et al. 2009) (Fig. 32.27).

Fig. 32.27
figure 27

Circos server

32.4.10 Prosite

PROSITE server is protein database that consists of protein families, functional domains and functional signature sites and amino acid profiles and patterns in sequence (Sigrist et al. 2002) (Fig. 32.28).

Fig. 32.28
figure 28

PROSITE server

32.4.11 CDD

Conserved Domain Database (CDD) is a protein database that consists of well-annotated multiple sequence alignments as position-specific score matrices (PSSMs) for identification of conserved domains via RPS-BLAST. CDD includes NCBI-curated functional domains based on 3D-structure information to define domain boundaries and provide functional insights into sequence/structure/function relationships, using Pfam, SMART, COG, PRK and TIGRFAMs databases (Marchler-Bauer et al. 2017) (Fig. 32.29).

Fig. 32.29
figure 29

Conserved Domain Database

32.4.12 Interproscan

InterProScan is a server to annotate protein families and domains automatically. InterPro provides functional signature analysis of proteins by classifying them into families, domains and important sites (Mitchell et al. 2019) (Fig. 32.30).

Fig. 32.30
figure 30

InterProScan server

32.4.13 EasyModeller

EasyModeller is a graphical user interface programme used for homology modeling for predicting models of protein tertiary structures (Kuntal et al. 2010) (Fig. 32.31).

Fig. 32.31
figure 31

EasyModeller

32.4.14 RAMPAGE/PROCHECK

PROCHECK server checks the stereochemical quality of a protein structure model; it produces Ramachandran plot to analyze the overall and residue-by-residue geometry (Laskowski et al. 2017; Lovell et al. 2003) (Figs. 32.32 and 32.33).

Fig. 32.32
figure 32

RAMPAGE server

Fig. 32.33
figure 33

PDBSum

32.4.15 VERIFY3D

VERIFY3D server is used for determination of an atomic model (3D) with its amino acid sequence, by assigning a structural class based on alpha, beta, loop, polar, non-polar, etc. location and comparing the results to template structures (Eisenberg et al. 1997) (Fig. 32.34).

Fig. 32.34
figure 34

SAVES server

32.4.16 YASARA

YASARA (Yet Another Scientific Artificial Reality Application) is a computer programme for molecular vizualization, modeling and docking (Krieger and Vriend 2014) (Fig. 32.35).

Fig. 32.35
figure 35

Yet Another Scientific Artificial Reality Application

32.4.17 BIOVIA Discovery Studio 2019

BIOVIA Discovery Studio contains BIOVIA Pipeline Pilot used for simulations, macromolecule design and analysis, antibody modeling, structure-based design, pharmacophore and ligand-based design, QSAR, ADMET and predictive toxicology, X-ray and visualization (Fig. 32.36).

Fig. 32.36
figure 36

BIOVIA Discovery Studio

32.4.18 Patchdock

The PatchDock server performs protein–protein docking and generates protein-small molecule complexes (Schneidman-Duhovny et al. 2005) (Fig. 32.37).

Fig. 32.37
figure 37

PatchDock server

32.4.19 Hex

Hex tool/server is a graphics programme for docking calculation and visualizing docking modes of pairs of protein and DNA molecules. Hex is also useful for calculation of protein-ligand docking; it can superpose molecules (Macindoe et al. 2010) (Fig. 32.38).

Fig. 32.38
figure 38

Hex server

32.5 Plant and Pathogen Genomics

Five main types of pathogenic organisms that cause plant diseases are viruses, bacteria, fungi, protozoa and worms/nematodes, which can lead from damage to death. The genome availability of plants and pathogens gives us opportunities to understand the bio systems and disease mechanisms (Tables 32.1, 32.2, and 32.3).

Table 32.1 List of important plant diseases with their causing organism, in which most of pathogen genomes are available in the NCBI database
Table 32.2 List of important plant pathogen genome details
Table 32.3 Plant genome sequence details

32.6 Conclusion

The applications of bioinformatics to plant pathology have been pivotal role in understanding of host and pathogen evolution and molecular interactions between host and pathogen. Availability of next-generation sequencing data of candidate model organisms of all kingdom through high-throughput technology is convenient to deal with biological systems and understand the biological sequence–structure–function correlation using in-silico biology tools, technology and databases. Genome annotation, assembly, bioproject, biosample submission, sequence data submission, retrieval of data, data analysis, variation analysis, conserved domain analysis, gene identification, regulatory elements analysis, gene expression analysis, structure prediction, structure visualization, structure analysis, structure classification, molecular modeling, epitope identification and mapping using 3D, drug designing, active site analysis and molecular docking, etc. play an important role to achieve biological function and understand the sequence–structure–function relationship. These all in-silico biology techniques will be further helpful in genomics-assisted crop improvement and development of designer crops with high yield and super quality.