Keywords

Introduction

The complete genome has been sequenced in three legume species namely, Medicago truncatula, Lotus japonicus and soybean (Glycine max) (Bertioli et al. 2009; Cannon et al. 2009; Sato et al. 2008; Zhu et al. 2005; Schmutz et al. 2010). Among these, M. truncatula is considered as model species, and is taxonomically more related to cool-season legumes such as pea, lentil, faba bean, and chickpea (Bordat et al. 2011). Integrating the genomic and biological knowledge from model legumes to other economically important cool-season pulse crops, e.g., pea, lentil, and chickpea, warm-season food legumes, e.g., peanut and common bean, and forage legumes, e.g. alfalfa and clover, will provide a major opportunity for advancing their genomic resources (Young et al. 2005; Young and Udvardi 2009; Varshney and May 2012). For example it can foster gene identification in such species, which are less noticeable due to their large genomes (Gepts et al. 2005). Sequencing of other legumes, including common bean (Ramírez et al. 2005; David et al. 2008) is progressing rapidly and draft genome sequences of some of them like pigeonpea (Varshney et al. 2009, 2011; Singh et al. 2012) and chickpea (Garg et al. 2011; Varshney et al. 2013) are already available.

Various genome sequencing projects have produced a wealth of sequence data, which need to be properly analysed to enable prediction of the potential functional elements, genes and transcription factors. Rapid progress has been made to develop bioinformatics tools and databases for such analyses as well as for understanding of the various features of the sequenced genome (Kushwaha et al. 2008; Dutt et al. 2010; Kumari et al. 2010). Similarly, in-silico comparative genomics provides a great opportunity in unravelling the behaviour of genes and genomes (Udvardi 2002; Kushwaha et al. 2012). Comparative genomics uses information about signature parts at the gene level and syntenic relation at the genome level to understand the structure and function of a newly sequenced genomes, as well as to deduce its evolutionary relationships (Goffard and Weiller 2006). Gene hunting is another important application of comparative genomics to investigate coding and non-coding functional elements of the genome (Yadav et al. 2007; Kushwaha et al. 2011). It attempts to discover both similarities and differences in the genes, proteins, RNA, and regulatory regions of different organisms to infer structural and functional relationships. Comparative genomics is now focusing on discovery of regulatory regions and siRNA molecules in the genome. The available biological datasets in web repository databases allow for comparative analysis and real data validation with the existing datasets. Different databases maintained by a data model like NCBI are integrated with each other to enable their effective utilization. The experimental datasets thus give us opportunities to understand the functional and biological roles of unknown genes/proteins from different legumes. The availability of different biological databases related to legumes provides valuable information resource for research and analysis (Table 12.1). However, the main aim of bioinformatics is the identification of regulatory mechanisms and function of genomes and their evolution (Marla and Singh 2012).

Table 12.1 Important biological databases related to legumes

Bioinformatics for Legume Genome Annotation

Sequencing determines the primary structure of an unbranched biopolymer. The elements with the associated function can be predicted by using DNA/protein sequences. Sequencing of a genome is a complicated and typical task that uses DNA sequencing to determine the order of nucleotides in small DNA fragments that together make up the genome. The first generation DNA sequencing was performed by using the chain termination method developed by Frederick Sanger and co-workers (Sanger and Coulson 1975; Sanger et al. 1977). This technique uses sequence-specific termination of a DNA synthesis reaction using modified nucleotide substrates. However, new sequencing technologies such as pyrosequencing are gaining an increasing share of the sequencing work and the next generation DNA sequencers that achieve sequencing by synthesis are based on this approach. These sequencer do not require in vivo library construction, are faster and much cheaper to use; they are being used for rapid genome sequencing. An example of nearly completed C. cajan genome sequenced by a group of Indian scientists using the second generation DNA sequencers is depicted in Fig. 12.1.

Fig. 12.1
figure 1

An Example of pigeonpea (C. cajan) genome sequence deposited in NCBI by a group of Indian scientists [Reprinted from Singh N. K., Gupta D. K., Jayaswal P. K., Mahato A.K., Dutta S., Singh S., Bhutani S., et al. (2012) The first draft of the pigeonpea genome sequence. J. Plant Biochem Biotechnol 21: 98–112 with permission from Springer Science + Business Media]

After completion of the full genome sequence, it is necessary to assemble and annotate new sequences. In fact, genome assembly is a very difficult computational task owing to large numbers of identical sequences (repeats) found in genomes. These repeats can be of thousands of nucleotides in length, and some of them may occur in a number of different locations. In a shotgun sequencing project, the entire DNA from a source (usually a single organism, ranging from a bacterium to a mammal) is first fragmented into millions of small pieces. These pieces are then “read” by automated sequencers, and each read can be up to 1,000 nucleotides long. A genome assembly algorithm works by taking all the reads and aligning them with one another, to detect all the places where two of the reads are overlapping. These overlapping reads can be merged together to form a contig and then linking information of contigs is used to create scaffolds. Subsequent to this, scaffolds are positioned along the physical map of the chromosomes.

Most of the assembler tools and packages were developed by different research groups, e.g., short oligonucleotide analysis package and de novo assembly tools were developed by Beijing Genomics Institute (BGI).

In genome annotation one can elucidate the biological information based on assembled genome sequences. In this process, called “gene prediction”, one can identify functional elements in the genome and generate biological information about these elements. The genome annotation is done by the methods prescribed by Kawaji and Hayashizaki (2008). The basic level of genome annotation can be done using Basic Local Alignment Search Tool BLAST to find out similarities and differences. However, nowadays more and more additional information is added to the annotation platform. The complete annotated genome data are deposited in different biological databases, i.e., NCBI, DDBI, Phytozome, Ensembl and EMBL. These databases use genome context information, experimental datasets, and integrations of tools and resources to provide gene and genome annotations through their subsystems approach. Sequence Assembly AMOS tool can be used for manipulation with sequence files. AMOS tool is currently maintained by University of Maryland. CABOG is a tool that assembles large genomic DNA sequences produced by whole-genome shotgun sequencing. Some important annotation tools like Apollo, BLAST, Parser, MATLAB, Bioconductor package in R, Artemis and AAT tool are available. Manatee is a web-based gene evaluation and genome annotation tool for visualization, modification and storage for genomes. PASA can be used as eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to gene model. Several bioinformatics tools are available for annotation, genome sequence alignment, de novo assembly, sequence alignments, evolution and RNA sequence analysis; some of these tools are listed in Table 12.2.

Table 12.2 Bioinformatics softwares available for genome annotation and de novo assembly

Hiremath et al. (2011) carried out a large-scale transcriptome analysis in chickpea (C. arietinum L.) using next generation sequencing technologies such as, Roche 454 and Illumina/Solexa. They determined a total of 103,215 tentative unique sequences (TUSs) and assigned functions for 49,437 (47.8 %) of the TUSs. Comparison of the chickpea TUSs with the M. truncatula genome assembly (Mt 3.5.1 build) resulted in 42,141 aligned TUSs with putative gene structures (including 39,281 predicted intron/splice junctions). These TUSs were also used to identify 728 SSR, 495 SNP, 387 conserved orthologous sequence (COS) markers, and 2,088 intron-spanning region (ISR) markers. Similarly, transcriptome assembly has been done in pigeonpea by Kudapa et al. (2012) referred to as CcTA v2, comprised 21,434 transcript assembly contigs (TACs) and 77.5 % TACs (16,622 TACs) of the total could be mapped on to the soybean genome. Based on knowledge of intron junctions, so far 10,009 primer pairs were designed from 5,033 TACs for amplifying intron spanning regions (ISRs). By using in silico mapping of BAC-end-derived SSR loci of pigeonpea on the soybean genome as a reference, putative mapping positions at the chromosome level were predicted for 6,284 ISR markers, covering all the 11 pigeonpea linkage groups. The transcript assembly and markers developed will provide a useful resource for basic and applied research for genome analysis and crop improvement in chickpea and pigeonpea.

ORFs and their localization, gene structure optimization, coding region identification and location of regulatory motifs explain the complete organization of gene family with their associated functions. Identification of gene family is a better approach to investigate the various types of members related to each other and the manner in which they have evolved (Thornton and DeSalle 2000). Availability of EST datasets for a genome gives a better understanding of transcripts with tissue-specific expression. Based on bioinformatics tools and databases any one can compare biological experiment datasets with any query sequence. In-silico based approaches utilize information from expressed sequence tags and proteins, often derived from mass spectrometry, to improve genomic annotations. A variety of software tools have been developed to help scientists in their quest for gene and genome annotations. Identification of gene locations and the sites of other genetic control elements are often described as the biological “parts list” for the assembly of an organism. Scientists are still at an early stage of delineating this “parts list” and in understanding how all the parts fit together and work together. Gene and genetic control elements investigation can be done using publicly available biological databases and tools accessible via the web and other electronic means. Some statistical tools are available for the analysis of deep sequencing like ANDES Tools and DAG chainer that computes chains of syntenic genes within complete genome sequences. DNA sequence analysis tools include k-mer tool, ESTmapper, Snapper mapping reads and ATAC are available for aligning genomes. For rapid aligning of the entire genomes, a software MUMmer, can be used.

Bioinformatics for Sequence Analysis

In bioinformatics, sequence analysis refers to the process of subjecting a DNA, RNA or protein sequence using analytical methods and algorithms to understand its features, function, structure, or evolution. Methodologies used are biological database mining, comparative analysis and sequence alignment. With the development of statistical algorithm, matrices based tools for prediction of gene and protein sequences, the rate of addition of new sequences to the databases has increased exponentially. Such a collection of sequences does not, by itself, increase the scientist’s understanding of the biology of organisms. However, comparing these new sequences to those with known functions is a key way of understanding the biology of an organism from which the new sequence comes. Thus, sequence analysis can be used to assign functions to genes and proteins by a study of the similarities between the compared sequences. Nowadays, there are many tools and techniques are available that provide the sequence comparisons (sequence alignment) and analyze the alignment of a product to understand its biology. Sequence analysis in molecular biology includes a wide range of applications, some of which are listed below.

  1. 1.

    Comparison of different sequences in order to detect similarities among them and, often, to infer if the sequences are related (homologous).

  2. 2.

    Identification of intrinsic features of the different sequences, such as active sites, post-translational modification sites, gene structures, reading frames, distributions of introns and exons and the regulatory elements.

  3. 3.

    Identification of sequence differences and variations such as point mutations and single nucleotide polymorphisms (SNPs) in order to develop the genetic markers.

  4. 4.

    Unraveling the evolutionary process and assessment of genetic diversity of the sequences and the organisms.

  5. 5.

    Identification of molecular structure from sequence data alone.

Sequence analysis is based on sequence alignment, i.e., comparison between query and subject sequences, in which two or more sequence sets can participate. Alignment between two sequences is called pairwise alignment, and alignment between more than two sequences is called multiple sequence alignment. Two methods are used for searching for a series of identical or similar characters in the sequences to find out similarities and dissimilarities within sets of sequences; these are called global and local alignments. Global alignment finds the best alignment across the whole length of two sequences and forces alignment in such regions that show differences. Local alignment finds regions of high similarity in parts of the participating sequences, and concentrates on regions of high similarity. Basic local alignment search tool (BLAST) is an example of local alignment (Fig. 12.2). Mainly five flavors of Basic BLAST are available for comparison of the query with the subject for sequence. In case of protein query sequence, one can use BLASTp and tBLASTn. In case of nucleotide query sequence, any one of the BLASTn, BLASTx and tBLASTx can be used. Other specialized blasts are also available for conserved domain detection, SNP detection, global sequence alignment, etc.

Fig. 12.2
figure 2

A page showing basic local alignment search tool (BLAST; http://blast.ncbi.nlm.nih.gov/)

Gene Identification and Characterization Using Comparative Genomics/Proteomics

In computational biology gene hunting or gene prediction refers to the process of identifying the regions of genomic DNA that function as genes, i.e., encode proteins or various types of RNA molecules, or as other functional elements like regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. Earlier “gene finding” was based on cumbersome experiments on living cells and organisms. But the availability of comprehensive genome sequences and powerful computational resources have greatly facilitated gene finding, and some of the tools and database servers dedicated to gene prediction are listed in Table 12.3.

Table 12.3 A list of some important gene prediction servers

Genome sequence of “Asha” variety of pigeonpea was obtained using GS-FLX Phase D chemistry and the GS-FLX Titanium chemistry and reads were assembled using “Newbler GS De Novo assembler version 2.5.3” that compares all sequence reads pairwise and reads with overlaps are joined into contigs (Singh et al. 2011). An average of all aligned reads at a specific nucleotide position is used to determine the consensus sequences for a contig, and overlapping contigs are finally merged to make scaffolds. The finished sequence was passed through fgenesh tool of Molquest software using Arabidopsis thaliana gene models as a reference. Predicted genes with size of >500 bp were BLAST-searched against the NCBI database, and the search output was processed using BLAST Parser software and gene annotations were manually curated and categorized based on function. Singh et al. (2012) were able to predict a total of 59,515 genes with the largest size of 11,523 bp and the smallest gene size of 501 bp of these 47,004 were protein coding genes of which 1,213 were related with plant defense and 152 were involved in abiotic stress tolerance.

Comparative phylogenetic studies within the legume family revealed high syntenic relationships between sequenced legumes and other important legumes (Wojciechowski et al. 2004), e.g. between Medicago truncatula and pea (Kaló et al. 2004), and common bean and soybean (Lee et al. 2001), but limited synteny is also reported to be present among other legumes, e.g., between cool-season and warm-season legumes (Zhu et al. 2005). Whole genome sequencing of some important legumes is likely to be completed in the near future, and this will facilitate a comprehensive assessment of synteny. Comparative genomics for synteny studies can accelerate exploitation of genomic resources, and facilitate more rapid progress in research efforts in an efficient and cost-effective manner. A detailed study of the syntenic relationships is a critical issue to be addressed for better allocation of genomic information from sequences of model legumes to other legumes and to other crop species. Based on conservation of synteny between pigeonpea and soybean genomes, Singh et al. (2012) found that chromosomes 1, 3, 4 and 9 of pigeonpea showed the maximum conservation with chromosomes 2, 5, 7, 8, 12, 13, 15 and 17 of soybean. Chromosome 1 of pigeonpea showed the highest number of matches with chromosomes 8 and 5 of soybean. Similarly, chromosome 2 of pigeonpea showed the maximum number of hits with chromosomes 19 and 10 of soybean. Pigeonpea chromosome 3 showed the maximum number of hits with chromosomes 13 and 15 of soybean, pigeonpea chromosome 4 showed the maximum number of hits with chromosomes 12 and 13 of soybean, chromosome 5 showed the highest number of matches with chromosomes 13, 12 and 17 of soybean, chromosome 6 showed the maximum number of matches with chromosomes 9 and 3 of soybean, chromosome 9 showed maximum number of matches with chromosomes 2, 12, 3, 11 and 16 of soybean, chromosome 10 showed the maximum number of hits with chromosomes 18, 17 and 2 of soybean, chromosome 11 showed the maximum numbers of hits with chromosomes 14 and 18 of soybean, and chromosome 7 showed maximum number of hits with chromosomes 10 and 20 of soybean, while chromosome 8 of pigeonpea showed minor synteny with chromosomes 13 and 14 of soybean. However, Singh et al. (2012) concluded that the overall synteny between the genomes of pigeonpea and soybean was only to a limited extent.

Bioinformatics for Computational Evolutionary Biology

The phylogenetic tree (phylogeny) is textual and visual representation that describes evolutionary relationships among various groups of organisms or among a family of related nucleotide or protein sequences and other entities based upon similarities and differences in their physical and genetic characteristics. In such a study, one can use morphological features (e.g., shape, size, length, etc.) and molecular data (e.g., DNA and protein sequences). The taxa/entities joined together in the tree are implied to have descended from a common ancestor. Phylogenetic trees are useful in fields of bioinformatics, systematics and comparative biology. There are rooted and unrooted types of tree inferences and main approaches for phylogeny reconstruction, i.e., distance based methods, topology search methods and Bayesian methods. Some phylogenetic tree terminologies are shown in Fig. 12.3.

Fig. 12.3
figure 3

Figure showing phylogenetic tree terminologies

A rooted phylogenetic tree defines common ancestor of all the entities at the leaves of the tree, i.e., the operational taxonomic units (OTUs). One example showing root based phylogenetic classification of Toll interleukin 1 receptor (TIR) domain among different organisms depicts the way this family might have been derived during evolution (Fig. 12.3). Phylogenetic relationships among genes can help to predict the genes that might have similar function e.g. ortholog detection.

TIR domain is mainly involved in plant immune responses against various pathogens. An example of Toll/interleukin-1 receptor classification is provided here TIR domain for C. cajan was used for find out similar homologues in different organisms using basic local alignment search tool (BLAST). Selected homologues from different species were used for multiple sequence alignment and phylogenetic classification. ClustalW tool was used for multiple sequence alignment and for tree classification, MEGA tool was used to find out the best tree topology. Figure 12.4 shows the rooted inferences of selected sequences of TIR domains from seven different plant species (Populus, Vitis, Solanum, Arachis, Medicago, Glycine, Cajanus and Oryza). Interestingly, it was found that TIR, Oryza spp. forms an outer group, while the remaining six TIR domains are much more closely related this may be expected because Oryza is a monocot.

Fig. 12.4
figure 4

Example of rooted tree of TIR domain homologues from C. cajan with six other plant species (Singh et al., unpublished data)

The identified TIR domain from C. cajan was further used to determine the number of TIR loci present in the Cajanus genome, and a total of 148 TIR domains have been successfully identified based on the available datasets of C. cajan genome sequence (Taxid: 3821). Figure 12.5 shows an unrooted tree depicting the various TIR domains derived from Cajanus genome itself. Unrooted trees specify relationships but they do not depict the evolutionary path. For phylogenetic study, different online and offline softwares are available (Table 12.4). Legume diversity and evolution in a phylogenetic context has been reviewed earlier by Doyle and Luckow (2003).

Fig. 12.5
figure 5

Example of unrooted tree of identified TIR domains from C. cajan

Table 12.4 Tools and servers for multiple sequence alignment and phylogenetic analysis

In-Silico Analysis for Gene Expression Data

An expressed sequence tag (EST) is a short, ordinarily, terminal sequence of a cDNA sequence. Thus an EST results from one-shot sequencing of a cloned mRNA, i.e., several hundred base pairs of sequence starting from an end of a cDNA sequence. The cDNAs used for EST generation are typically individual clones from a cDNA library. ESTs may be used to identify gene transcripts; they are instrumental in gene discovery and gene sequence determination. The identification of ESTs has proceeded rapidly, and ~73 million ESTs are now available in the public database GenBank. The dbEST is a division of Genbank established in 1992, and the data in dbEST is directly submitted by laboratories worldwide. Based on EST datasets any one can determine the gene function based on expression datasets. ESTs contain enough information to permit the design of precise probes for DNA microarrays that can be used to determine the gene expression. For expression microarray data analysis normalization and management, one can use Ginkgo (Comparative Genomic Hybridization package). TM4 and Magnolia packages are also designed for microarray data management for researchers who use PFGRC microarrays. The programme SNP Filter Scripts can be used to identify and detect false positive SNP calls that are present in raw data from affymetrix gene chip resequencing arrays. There are several other tools freely available, including MAGIC, CLUSFAVOUR, etc. for microarray data analysis. Short nucleotide variation analysis server is also available for this type of study (Fig. 12.6).

Fig. 12.6
figure 6

Short nucleotide variation BLAST page

Bioinformatics in Legume Nutritional Genomics

By manipulating the promoter region of seed-specific protein encoding genes one can improve the nutritional quality of any crop species. Bioinformatics tools can play a major role in the study of the promoter region of genes and for identification of cis-acting elements or cis-regulatory elements. A cis-acting element is a region of DNA or RNA that regulates the expression of genes located in the same chromosome. This term is derived from the Latin word cis, which means “on the same side as”. The cis-regulatory elements are often binding sites for one or more trans-acting factors. These cis-elements may be located upstream of the coding sequences of the concerned genes, i.e., in the promoter region or even further upstream, in an intron, or downstream of the gene’s coding sequence. In molecular biology and genetics, a transcription factor (sometimes called a sequence-specific DNA-binding factor) is a protein that binds to specific DNA sequences, thereby controlling the flow of genetic information (or transcription) from DNA to mRNA. Transcription factors perform this function alone or with other proteins in a complex, by promoting (as an activator)/or blocking (as a repressor) the recruitment of RNA polymerase to transcribe specific genes. Therefore, identification of potential cis-acting elements can help in improving the nutritional quality of seeds of plant species, and/or other traits of economic/agronomic value.

Databases of plant cis-acting regulatory elements like PlantCare and PLACE can be used as a portal for in-silico analysis of promoter sequences of plant genes (Fig. 12.7). Yadav et al. (2007) successfully identified the seed storage protein promoter specific cis-acting elements in cloned and sequenced promoter regions of seed storage protein genes from different cultivars of wheat, rice and oat. A database containing collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start-sites from various plant species is available on server PlantProm DB. For retrieval and investigation of transcription factor associated genes PlnTFDB (plntfdb.bio.uni-potsdam.de/) and PlantTFDB (http://planttfdb.cbi.pku.edu.cn/) are important databases. In addition, species transcription factor databases are also available online (Fig. 12.8).

Fig. 12.7
figure 7

Plant cis-acting elements prediction server (PLACE; http://www.dna.affrc.go.jp/PLACE/)

Fig. 12.8
figure 8

Plant transcription factor database PlantTFDB (http://planttfdb.cbi.edu.cn/)

Prediction for Function of Protein Sequences

In the prediction of the function of a protein sequence of interest, structural visualization, 3D prediction, classification and structural alignment play important roles. In this connection homology modeling, threading and ab-initio prediction methods can be used for protein structure prediction. Homology modeling (comparative modeling) is a process for constructing an atomic-resolution model of the “target” protein using an experimental three-dimensional structure of a related homologous protein (the “template”) derived by NMR, X-ray techniques. Homology modeling relies on the identification of one or more known protein structures likely to resemble the structure of the query sequence, and on the production of an alignment that maps residues in the query sequence to residues in the template sequence. It has been shown that protein structures are more conserved than the amino acid sequences amongst homologues, but sequences falling below 20 % sequence identity can have very different structures. For homology modeling, threading and ab-initio prediction several servers are available in public domain (Table 12.5). Some commercial software like MOE, Schrödinger and Discovery Studio can also be used for protein modeling and simulation. For Ab-initio or de-novo protein modeling one can use I-TASSER and ROBETTA, which are freely available. Based on different protein modeling servers, one can predict the three dimensional structure of the target protein.

Table 12.5 List of servers for homology modeling, threading and ab-initio based structure prediction

Qualitative and Quantitative Study of Predicted Models

Finally, predicted 3D models can be subjected to a series of tests for assessing their internal consistency and reliability. The Quality of the model can be checked with verify3D [http://nihserver.mbi.ucla.edu/Verify_3D/], Errat [http://nihserver.mbi.ucla.edu/ERRATv2/] etc. The stereochemical properties based on backbone conformation can be evaluated by inspection of Psi/Phi/Chi/Omega angle using Ramachandran plot of PDBSum database [http://www.ebi.ac.uk/pdbsum/], RAMPAGE [http://mordred.bioc.cam.ac.uk/~rapper/rampage.php] etc. Quantitative analysis can be done using accessible surface area prediction using Volume Area Dihedral Angle Reporter [VADAR; http://vadar.wishartlab.com/]. Standard bond lengths and bond angles of the model can be determined using WHAT IF [http://swift.cmbi.ru.nl/whatif/]. ResProx (Resolution-by-proxy; http://www.resprox.ca/) can be used for quality and quantity measurements at resolution level. For example, we have successfully predicted 3D model of toll-like interleukin receptor (TIR) domain of R genes from C. cajan using comparative homology modeling and the best evaluated model has been deposited to Protein Model DataBase (PMDB; http://mi.caspur.it/PMDB/) (Fig. 12.9).

Fig. 12.9
figure 9

Structure of TIR domain (PM0078097) from C. cajan developed using TIR domain structure from Arabidopsis thaliana (3JRN) based on homology modelling [Courtesy of Vinay Kumar Singh]

Integrated Bioinformatics Tools

Some integrated tools like MEME and MAST are useful servers for motif elucidation (Fig. 12.10). For protein functional elucidation and characterization, one can use INTERPROSCAN, PROSITE, PFAM and PRODOM etc. (Fig. 12.11). SWISSPROT, DBSNP and SNP flanks tools and databases can be used for SNP/variant detection. An example of signature part of toll-like interleukin receptor domain from C. cajan is given in Fig. 12.12.

Fig. 12.10
figure 10

A server to discover motifs (highly conserved regions) in groups of related DNA or protein sequences

Fig. 12.11
figure 11

Server for protein functional elucidation based on domain and signature motifs

Fig. 12.12
figure 12

Toll-like interleukin receptor domain form C. cajan

Molecular Docking

In bioinformatics, molecular docking is a method that predicts the possible orientation of one molecule in relation to a second when the two are bound to each other to form a stable complex. The knowledge of the possible orientations in turn, can be used to predict the binding affinity between the two molecules using energy scoring functions. Using molecular docking approach, one can predict the binding orientation with energy total and energy shape of a ligand (small molecule) to its protein target (receptor) to predict the affinity and activity of the small molecule. The interaction between ligand and receptor protein can result in activation or inhibition of the protein enzyme. Two main approaches are the most popular of the different molecular docking strategies. The first strategy uses a matching technique that explains protein and ligand as complementary surfaces. The second approach, however, simulates the actual docking process, in which the ligand–protein interaction energies are calculated. Molecular docking plays an important role in the rational drug designing. For a study of interaction of ligand (inhibitor and cofactor) and protein target one can use HEX, BIOSOLVEIT, DOCKING SERVER and other servers listed in Table 12.6.

Table 12.6 List of servers related to inhibitor, cofactor and protein docking

Plant–Pathogen Interactions

Many microbes establish wide range of interactions with host plants. Some of these are pathogenic and some are symbiotic in nature. Such interactions involve complex recognition events between the plant and the microbe, leading to a cascade of signalling events and regulation of a number of genes is required for, or associated with, the interaction. The combined components of the transcriptomes of both plant and microorganism that are expressed during the interaction give rise to the term “interaction transcriptome”. High-throughput methods to study differential gene transcription, or proteomics coupled with bioinformatics will accelerate our understanding of the molecular bases of plant–microbe interactions (Birch and Kamoun 2000; Samac and Graham 2007). For example, Soria-Guerra et al. (2010) conducted a transcriptome profiling study for soybean rust (Phakopsora pachyrhizi) to identify soybean rust resistance genes in Glycine tomentella. Among 38,400 genes monitored using a soybean microarray, 1,342 genes exhibited significant differential expression between uninfected and P. pachyrhizi-infected leaves at 12, 24, 48, and 72 h post-inoculation (hpi) in both rust-susceptible and rust-resistant genotypes. Differentially expressed genes were grouped into 12 functional categories, and a large numbers of these genes relate to the basic plant metabolism. These findings provided a better insight into the mechanisms underlying resistance and general activation of plant defense mechanisms in response to rust infection in soybean.

Further, sequencing of EST libraries from pathogen-inoculated or elicitor-treated plants and microarray transcript analyses have enabled the elucidation of genome-wide gene expression changes associated with defence (Ameline-Torregrosa et al. 2006). Samac et al. (2011) used microarray analysis to identify the genes associated with disease defence responses in M. truncatula. They compared the genes expressed in response to three pathogens (Colletotrichum trifolii, Erysiphe pisi and Phytophthora medicaginis) and identified genes unique to an interaction.

Fusarium wilt, the most serious disease of pigeonpea, is a common vascular wilt fungal disease caused by Fusarium sp. A release draft genome assembly of six strains of different Fusarium sp. (Rep and Kistler 2010) gives opportunities to understand the host–pathogen interaction at computational level. In this context, bioinformatics approaches help in understanding the host–pathogen interaction at protein level, in which protein–protein interactions are used to investigate the biological process. Protein–protein interactions are interactions between two or more proteins that bind together to carry out their biological function. Protein-protein docking will help understand protein–protein interactions at computational level. HEX, Z-DOCK and other tools are commonly used for protein–protein interaction studies (Fig. 12.13).

Fig. 12.13
figure 13

An automated protein–protein interaction server

Bioinformatics in Molecular Marker Development

For trait analysis using association mapping approaches, and for various other studies on populations including pattern of evolution, population structure, genetic diversity a number of software are available in public domain (Table 12.7). Bioinformatics plays very important role in molecular marker developments, for which several bioinformatics tools and servers are available (Table 12.8). Best optimized primers are essential for good specificity and efficiency. Anyone can design the primer pairs using genomics, mRNA, cDNA, SNP-based sequences. One can design degenerate, expression and universal primers using bioinformatics tools based on servers listed in Table 12.8. For example, Jayashree et al. (2006) have developed a database for EST based simple sequence repeats from cereals and legumes. Based on the available resources any one can design EST SSR-based markers for wet-lab experimentation. Large-scale transcriptome assembly using next generation sequencing technologies such as, Roche/454 and Illumina/Solexa, are now used for development of molecular markers, which will serve as a useful resource to accelerate genetic research and breeding applications in legumes. For example, Hiremath et al. (2011) developed 728 SSR, 495 SNP, 387 conserved orthologous sequence (COS) markers, and 2,088 intron-spanning region (ISR) markers in chickpea. Kudapa et al. (2012) predicted for 6,284 intron spanning regions (ISR) covering all the 11 pigeonpea linkage groups.

Table 12.7 Statistical analysis tool and software details with uniform resource locator
Table 12.8 List of servers used in molecular marker development

Mishra et al. (2012) retrieved a total of 18,552 EST sequences (equivalent to 11.3 MB) from the EST database available in the NCBI public domain and analysed for repeat patterns using the tandem repeat finder program at http://c3.biomath.mssm.edu/trf.html, followed by their assembly using the CAP3 software program (Huang and Madan 1999). After pre-processing, they identified SSR-containing sequences by a perl script-based program, MISA software (MICROSATELLITE identification tool, http://pgrc.ipk-gatersleben.de/misa/). They detected 10,800 unigenes from 18,522 pea EST sequences and screening of 10,800 unigenes by MISA revealed 2,612 (14.1 %) eSSRs in 2,395 (12.9 %) SSR-containing ESTs, from which 577 (24.1 %) primer pairs were designed. Out of these, 68 randomly selected primer pairs showed high rate (48–85 %) of transferability in leguminous species with high level of polymorphism, reproducibility and presence of 3.8 alleles/locus. Similarly, De Caire et al. (2012) retrieved a total of 6,327 mRNA sequences and screened them through a JAVA based programme to design gene-based SSR markers. They successfully identified 45 new polymorphic eSSR markers. e-SSRs identified in these two studies will be used in linkage mapping analyses and provide a good scaffold for comparative mapping in pea and other sequenced legumes.

The molecular markers can be used for linkage mapping using mapping populations developed from biparental crosses. Software like MAPMAKER, QTL-ALL, QTLNETWORK, QUANTO, QU-GENE, QUTIE etc. are used for mapping of markers and oligogenes, while QTL cartographer, QGENE, QTL CAFE, QTL EXPRESS etc. are available for mapping of quantitative trait loci (QTLs). The genes/QTLs detected for target traits need to be confirmed in other replicate studies. Further the marker found linked to the genes/QTLs have to be validated in unrelated germpasm/materials before they can be used for markers-assisted selection (MAS) in plant breeding programmes. Alternatively, marker trait associations can be detected by linkage disequilibrium (LD) based association mapping that uses germplasm collections/breeding lines in the place of biparental mapping populations.

Conclusion and Perspectives

Omics era in the twenty-first century provides us opportunities to understand the legume genome at sequence-structural-functional levels. While legume omics is still in its infancy, it holds great promise, and is expected to yield insights into many aspects of evolution and regulatory mechanisms of legume species. The rapid development of various molecular tools and techniques including large scale analysis of genome organization, gene expression, protein–protein interaction and protein–ligand interaction etc. are generating enormous amount of data, which need to be analyzed and interpreted to develop a biologically meaningful concepts. The need for handling such large amounts of data as forced rapid development of bioinformatics techniques to create, manage and utilize databases of biological information and development of tools and software packages to make efficient and meaningful use of these tools and databases. A variety of software packages are now available to serve various needs of the researchers. However, there is need to develop user friendly bioinformatics tools to decipher functional features of legume genome sequences.