Keywords

1 Introduction

Bioinformatics presents one of best example of interdisciplinary science. Actually, it is the mixture of various disciplines such as biology, mathematics, computer science and statistics. The term ‘Bioinformatics’ was given by a Dutch system-biologist Paulien Hogeweg, in the year of 1970 [1]. For the last few decades, it has become an important part of biological research to process the biological data quickly. Nowadays, bioinformatics tools are regularly used for identification of novel genes and their characterization. Bioinformatics is also used for calculating physiochemical properties, prediction of tertiary structure of proteins, evolutionary relationship and biomolecular interactions. Although these bioinformatics tools cannot generate as reliable information as those generated by experimentation. But the experimental techniques are difficult, costly and time consuming. Therefore the in silico approach facilitates in reaching an approximate informed decision for conducting the wet lab experiment. The role of bioinformatics is not only limited to generation of data but also extended to storage of large amount of biological data, retrieval, and sharing of data among researchers. The design of databases, development of tools to retrieve data from the databases and creation of user web interfaces are the major roles of bioinformatics scientists. Life sciences researchers are using these databases since 1960s [2]. In mid 1980s, bioinformatics came into existence and National Center for Biotechnology Information in 1988 was established by USA government.

There are many types of biological databases which are called primary, secondary and composite databases. Primary databases contain gene and protein sequence information as well as structure information only. Secondary databases contain derived information from primary databases and composite databases contain criteria for searching multiple resources. Along with theses databases Literature databases, Structural databases, Metabolic pathway databases, Genome databases for specific organisms, protein-protein interaction databases, phylogenetic information databases, RNA molecules databases and protein signalling databases are also discussed in detail.

Bioinformatics is also used to integrate the data mining techniques e.g. Genetic algorithms, Support vector machines, Artificial intelligence, Hidden Markov model etc. for developing software for sequence, structure and function based analysis.

Due to flooding of genome sequencing projects, vast amount of data have been accumulated at very high rate. However, pure data are not meaningful because knowledge/information in such data is hidden. Knowledge/information is much more valuable than data many times. Thus, a new technology field has emerged in mid 1990s to extract knowledge/information from raw data which is called knowledge discovery in databases (KDD) or simply data mining (DM) [3, 4]. First of all, Data which is the raw material and related to some specific problem are collected, checked and finally selected. After careful selection of data, preprocessing of data is required i.e. erroneous data which is also called outliers are identified and removed. After preprocessing, algorithms or mathematical models are applied to extract the useful patterns, which are called data mining. These patterns are interpreted by experts and thereafter evaluated for by their novelty, correctness, comprehensibility and usefulness. Finally, information in graphics or presentable form is available to the end user in databases. The systematic diagram of KDD process in the form of flow chart is shown in Fig. 1.

Fig. 1
figure 1

Systematic diagram of KDD process

A number of reviews and scientific articles related to databases have been published in the specialized area of Bioinformatics [5, 6]. However, none of these articles prove useful for a scientist who is not from bioinformatics/computational biology discipline. Therefore in the present chapter, we proceed to introduce various bioinformatics databases to a non-specialist reader to help extract useful information regarding his/her project. In this chapter every section contains a basic idea of each area supported by the literature and a tabulated summary of related databases, where necessary, towards the end of each section.

2 Biological Databases

Due to advancement in the high throughput sequencing techniques, sequencing of whole genome sequence of organisms are quite easy today and thereby leading to production of massive amount of data. Storage of large amount of biological data, retrieval and sharing of data among researchers are efficiently done by creation of databases which is a large, organized body of persistent data in a meaningful way. These databases were usually associated with computerized software designed to update, query, and retrieve the various components of the data stored in databases. There are different types of databases, which is based on the nature of the information being stored (e.g., sequences, structures, 2D gel or 3D structure images, and so on) as well as on the manner of data storage (e.g., whether in flat-files, tables in a relational database, or objects in an object-oriented databases. The sequence submission and storage of this information turn out to be freely accessible to the scientific world has directed to develop a number of databases worldwide. Respectively, every database becomes an autonomous illustration of a molecular unit of life.

Biological sequence database refers to a massive collection of data which have biological significance. Each biological molecule such as nucleic acids, proteins and polymers is identified by a unique key. The stored information can be used for future but also serves as an important input which for sequence and structure analyses. Biological Databases are mainly categorized into primary, secondary and composite databases and are discussed in detail in following sections.

2.1 Primary Databases

In primary database, the data related to sequence or structure are obtained through experiments such as yeast-two hybrid assay, affinity chromatography, XRD or NMR approaches. SWISS-PROT [7], UniProt [810], PIR [11], TrEMBL (translation of DNA sequences in EMBL) [7], GenBank [12], EMBL [13], DDBJ [14], Protein Databank PDB [15] and wwPDB (worldwide Protein DataBank) [16] are the well known examples of primary databases. A primary database is basically a collection of gene, protein sequence and structure information only. GenBank (USA), EMBL (Europe) and DDBJ (Japan) exchange data on a daily basis to ensure comprehensive coverage of these databases. SWISS-PROT is a protein sequence database which was established in 1986, collaboratively by University of Geneva and the EMBL [17]. SWISS-PROT includes annotations which has made it the database of choice for most of the researchers. The SWISS-PROT [17] contains information of its entries, which has been produced both by wet lab work as well dry lab. It is also interconnected to several other databases such as GenBank, EMBL, DDBJ, PDB and several other secondary protein databases. The protein data in SWISS-PROT mainly focuses only on model organisms and human only. On the other hand, the TrEMBL provides information on proteins from all organisms [7]. Similarly, the PIR is one more inclusive collection of protein sequences which provides its user several attractive features like enabling to search for a protein molecule through text search. PIR also provides facility for web based analyses such as sequence alignment, identification of peptide molecules and peptide mass calculations [11, 17, 18]. The PIR Protein Sequence Database was developed by National Biomedical Research Foundation (NBRF) in 1960 s by Margaret Dayhoff. PIR is a database of protein sequences for investigating evolutionary relationships among proteins [11, 17, 18]. UniProt is another comprehensive collection of protein sequence which is available freely. The UniProt database is combination of SWISS-PROT, PIR and TrEMBL [810]. The worldwide Protein Data Bank (wwPDB) contains over 83,000 structures and they planned to provide each single 3D structure of protein molecules freely to the scientific community.

2.2 Secondary Databases

A secondary database is based on derived information from the primary database i.e. it contains information about the conserved sequence, active site residues of the protein families, patterns and motifs [19, 20]. Examples of secondary databases are SCOP [21], CATH [22], PROSITE [23], PRINTS [24] and eMOTIF [25]. The first secondary database to be developed was PROSITE, which is maintained by Swiss Institute of Bioinformatics. Within PROSITE, motifs are encoded as regular expressions which are also called patterns. PRINTS fingerprint database is another secondary database, which is maintained in University College London (UCL) and contains motifs as ungapped, unweighted local alignments [24]. The SCOP (Structural Classification of Proteins) database is maintained by MRC Laboratory and Centre for Protein Engineering which describes structural and evolutionary relationships among proteins for which structure are known [21]. In SCOP, proteins are classified in a hierarchical fashion to reflect their structural and evolutionary relationship. This hierarchy basically describes the family, superfamily and fold. The CATH (Class, Architecture, Topology, and Homology) is another secondary database which is a hierarchical classification of protein structures maintained at UCL [26]. CATH includes five levels within the hierarchy which are as follows:

  • Class includes secondary structure content and packing. Four classes of domain are recognised: (i) mainly-α, (ii) mainly-β, (iii) α−β, which includes both alternating α/β and α + β structures, and (iv) Protein structures with low secondary structure content.

  • Architecture includes arrangement of secondary structures, without connectivities; (e.g., barrel, roll, sandwich, etc.).

  • Topology describes the overall shape and the connectivity of secondary structures.

  • Homology includes domains that share 35 % sequence identity and are thought to share a common ancestor, i.e. are homologous.

  • Sequence is the last level, where structures are clustered on the basis of sequence identity.

2.3 Composite Databases

Composite database is basically an amalgamation of variety of different primary database sources, which are meant to search multiple resources by putting different criteria in their search algorithm. Example of composite database is National Center for Biotechnology Information (NCBI) which is an extensive collection of nucleotide, protein sequence and many other databases providing free access to research community. NCBI provides interconnections between genetic sequence data, protein sequence data, structure data, phylogenetic tree based data, Genomes data and literature references. These links may also be between the same types of records in different databases, for example, literature articles in literature database Pubmed provide gene sequences, protein sequences, 3D structure, genome information and their links. Links between genetic sequences records are based on Blast sequence comparisons [27] while structure records are based on Vast structure comparisons [28]. NCBI includes one of the literature database called PubMed contains citations for biomedical literature from MEDLINE, journals and online books. NCBI also includes nucleotide sequence database called GenBank [12] which is collection of genome sequences of more than 2,50,000 species and these data can be retrieved by the NCBI’s integrated retrieval system, i.e. Entrez, whereas the literature is easily accessible via PubMed [12, 29, 109]. It provides the information for related literature, organism, untranslated regions, exons/introns, repeat regions, coding regions, terminators, translations, promoters, bibliography etc. for each sequence. Sequence submission in GenBank can be done by individual laboratories along with large-scale genome sequencing projects. Protein sequence database in NCBI contains sequences from several sources which includes translations from annotated coding regions in GenBank, RefSeq. It also contains data records from SwissProt, PIR, PRF and PDB. The genome database in NCBI contains information on genomes which includes sequences, maps, chromosomes, assemblies as well as annotations. Protein structure databases at NCBI is called Molecular Modeling Database (MMDB) which contains data from experimentally resolved proteins structures, RNA and DNA molecules which are derived from the Protein Data Bank (PDB). MMDB also aid value-added features such as computationally identified 3D domains which can be used to identify similar 3D structures, as well as links to literature and information about chemicals/drug bound to the structures. Small chemical structure database integrated with NCBI is called Pubchem which includes small chemical structure and their biological activity (Fig. 2).

Fig. 2
figure 2

Circles represent various databases; straight lines between circles represent links between different data types among different databases

NRDB (Non-Redundant DataBase) is another composite database which contains data from GenPept (derived from automatic GenBank CDS translations), SWISS-PROT, PIR, GenPeptupdate (the daily updates of GenPept), SPupdate (the weekly updates of SWISS-PROT) and PDB sequences. Similarly, INSD (International Nucleotide Sequence Database) is another composite database, which is collection of nucleic acid sequences from EMBL, GenBank and DDBJ. The UniProt (universal protein sequence database) which is also a composite database which contains sequences derived from various other databases such as PIRPSD, Swiss-Prot, and TrEMBL. In the same way, wwPDB (worldwide PDB) is a composite database of 3D structures which is maintained by RCSB (Research Collaboratory for Structural Bioinformatics), PDB, MSD and PDBj [30].

2.4 Specialized Databases

The Rfam database contains secondary structure of RNA molecules and their gene expression pattern. This database is introduced by the Wellcome Trust Sanger Institute and it is similar to the Pfam database for annotating protein families [31]. There are numerous curated databases which are accessible worldwide such as IntAct contains data of various protein interactions. MINT (Molecular INTeraction database) is another curated database which is merged with IntAct database maintained by EMBL-EBI [32]. MINT is basically a database that stores information about protein-protein interactions derived from published articles [33]. For the metabolic pathway analysis in human, Reactome is one of the freely available databases which provide the diverse information regarding metabolic pathway and signal transduction pathways in human [34].

The Transporters Classification Database (TCDB) is a database of membrane transporters [35] which is based on Transport Classification (TC) system for the classification of protein similar to that of Enzyme Commission [36]. Similarly, the Carbohydrate-Active enzyme Database (CAZy) is a database of carbohydrate modifying enzymes and relevant information related to them. These enzymes are classified into different families which are based on the amino acid similarities or the presence of various catalytic domains [37].

Xenbase is a specialized database which contains genomic and biological informations about frogs [38], whereas the Saccharomyces Genome Database (SGD) provides complete information about yeast (Saccharomyces cerevisiae) which also offers web based bioinformatics tools to analyse the data available in SGD [39]. The SGD may be used to study functional interactions among gene sequence and gene products in other fungi including eukaryotes. Likewise, WormBase is a specialized database which is developed and maintained by an international consortium to make available precise, recent data related to the molecular biology of C. elegans and other associated nematodes. Wormbase also provides some web based tools for analysis of the stored information. Another up-to-date database is “FlyBase” devoted to provide information on the genes and genomes of Drosophila melanogaster along with various web based bioinformatics tools to search gene sequences, alleles, different phenotypes as well as images of the Drosophila species [40]. Similarly, wFleaBase provides information on genes and genomes for species of the genus Daphnia (water flea). Daphnia is considered as a model organism to study and understand the complex interplay between genome structures, gene expression and population level responses to chemicals and environmental changes. Although, wFleaBase contains data for all Daphnia species but the primary species are D. pulex and D. magna.

MetaCyc is a curated database of metabolic pathways which were taken from published literatures from all domains of life. It contains 2260 pathways from 2600 different organisms. MetaCyc contains pathways which are involved in both primary and secondary metabolism. It also includes their reactions, enzymes and associated genes [41]. PANTHER (Protein ANalysis THrough Evolutionary Relationships) is another metabolic pathway which consists of over 177 primarily signaling pathways. It contains different pathway components where component is basically a single protein/group of proteins in a given organism [42]. Pathway diagrams are interactive which also includes bioinformatics tools for visualizing gene expression data. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database of many databases which was developed and maintained by Bioinformatics Center of Kyoto University and the Human Genome Center of the University of Tokyo [43]. KEGG covers metabolic pathways in yeast, mouse and human etc. KEGG has expanded these days by the addition of signaling pathways for cell cycles and apoptosis. Reactome is a collection of metabolic and signaling pathways and their reactions [34]. Theses pathways and reactions can be viewed but not edited through a web browser. Although humans are the main organism catalogued, but this database also contains data for 22 other species such as mouse and rat.

TreeBASE is a collection of phylogenetic trees and the data associated to construct them. TreeBASE accepts all types of phylogenetic data for species tree as well as gene tree from all domains of life [44]. PhylomeDB [45] is another public database of phylogenetic information based on genes which allows users to explore evolutionary history of genes. Moreover, phylomeDB provides automated pipeline used to reconstruct trees of different genomes based on phylogenetic trees. Table 1 illustrates a list of genomic, protein sequences and specialized databases.

Table 1 List of gene and protein based databases, their description along with their webpage’s URL

2.5 Drawbacks of Biological Databases

Many times life sciences researchers are interested not only in a few entries of a database but in huge amount of entries or large amount of data, which needs to be processed further, searching through web interfaces are not good options. To support large amount of data, the collection of relevant data and its processing must be automated. Therefore, each database should have programming options which make bioinformatics software developers to query and search databases from their own programs [57]. Modern database management systems such as ODBC (Open Database Connectivity) and JDBC (Java Database Connectivity) provide these web interfaces for bioinformatics programmers to query, search and analyze data. But the major limitation is that database providers allow public access only sometimes to these interfaces. These databases are DDBJ (DNA Data Bank of Japan) and KEGG (Kyoto Encyclopedia of Genes and Genomes). Apart from lacking in providing the programming interfaces, present biological databases also contain other limitations/drawbacks such as description of the database contents and authenticity of data produced and data sources. One of the drawbacks associated with these web interfaces is that these interfaces don’t allow to be searched by using all fields in a database. These search modes such as ‘and’, ‘or’ and ‘not’ are not supported at all. Mostly theses modes are supported only in a limited way. Hence desired results for the query are not obtained. It is observed that in primary nucleotide and sequence databases, redundancy of many nucleotide and protein sequences are present, which should be removed. Biologists are usually not familiar with the principles of database design languages. Biologists are mostly ignorant about database query languages also and they should have knowledge of the database schema. In biological databases, in most of the cases flat files are used for data exchange which does not have standardized format. There are many formats for thousands of biological databases which create problem in biological data preprocessing. Many types of information are often missing in biological databases which include functional annotations of genes and proteins. Many biological databases also provide missing information in terms of genotype phenotype relationships along with detailed pathway information and their reactions.

3 Gene Identification and Sequence Analyses

Due to lack of genome annotation and high-throughput experimental approaches computational gene prediction has become challenging and interesting area for bioinformatics/computational biology scientists. Gene prediction is very crucial especially for disease identification in human. Computational gene prediction can be categorized into two major classes which are ab initio methods and similarity/homology based methods [58]. These types of analyses are mainly useful for gene expression analysis. Gene expression is directly or indirectly related to the identification of promoter, terminator and untranslated regions. These regions are involved in the expression regulations, recognition of a transit peptide, introns/exons as well as an open reading frame (ORF). These regions are also involved in identification of variable regions which are used as signatures for various diagnostic purposes. Therefore, sequence analyses are one of the commonly used analyses for gene prediction in bioinformatics.

Gene prediction is relatively more difficult in eukaryotes as compared to prokaryotes due to presence of introns. Homology based gene predictions depend on complementary DNA (cDNA) and Expressed Sequence Tags (ESTs), though, the cDNA/ESTs information is often unusual and incomplete, and thus makes the task of finding novel genes extremely difficult. Homology based on sequence based information has been shown to be useful for better identification of prokaryotic and eukaryotic genes with higher accuracy. Local and global sequence alignments are performed by BLAST and NEEDLE respectively which is widely used in homology/similarity based gene prediction. These days protein homology has been introduced in many gene prediction programmes such as GENEWISE [59] and GENOMESCAN [60] GeneParser [61] and GRAIL [62]. Novel gene finding is often possible by ab initio gene identification. Examples of ab initio programs are GENSCAN [63], GENIE [64], HMMGene [65] and GENEID [66]. These methods were used in Drosophila melanogaster where it showed a very low rate of false-positive. These methods also predict 88 % of exons (already verified) and 90 % of the coding sequences [67]. Due to high accuracy, this methodology could be used for annotating large genomic sequences and prediction of new genes.

Recently, Lencz et al. identified an intergenic single nucleotide polymorphism (SNP; rs11098403) at chromosome 4q26 which was linked with schizophrenia and bipolar disorder. They performed a genome wide association study (GWAS) which was coupled with cDNA as well as RNA Seq on a set of 23,191 individuals [68]. Similarly, Peng and co-workers predicted 31,987 genes from Phyllostachys heterocycle draft genome by gene prediction approaches based on FgeneSH ++  [69]. Sequence analyses refer to the understanding of various features of biomolecules like nucleic acids and proteins, which are responsible for providing unique function(s). The first step is retrieval of sequences from public databases which are subjected to analysis by various tools which help in the prediction of specific features which might be associated to their function, structure, evolutionary relationship or identification of homologs with high accuracy. The database should be selected depending upon the nature of analysis. For example, Entrez of PubMed [70] allows one to search about different patterns in the given data. Also, pattern discovery can be performed by Expression Profiler [71], GeneQ [72] which allow scientists to search out different patterns in the given data. A different set of databases are dedicated to carry out sequence comparison like BLAST (Basic Local Alignment Search Tool) [27], ClustalW [73], for data visualization Jalview [74], GeneView [75], TreeView [76] and Genes-Graphs [77] allowing researchers to visualize data in graphic representation. Table 2 illustrates a list of databases used in primary sequence analyses.

Table 2 List of gene identification and sequence analyses programmes and their description along with their webpage’s URL

4 Phylogenetic Analyses

Phylogenetic analysis help to determine the evolutionary relationship among a group of related organism or related genes and proteins [83, 84], to predict the specific feature of a molecule with unknown functions, to track the gene flow and also to determine genetic relatedness [85]. Phylogenetic analysis is mainly based similarity at sequence level i.e. higher is the similarity; the closer will be the organisms on a tree. Phylogenetic tree is constructed by various methods which are distance, parsimony and maximum likelihood methods. None of the methods is perfect; each one has its own strengths and weaknesses. For example, the distance based methods performs average whereas the maximum parsimony and maximum likelihood methods are accurate. The major disadvantage of maximum parsimony and maximum likelihood methods is these methods takes more time to run as compared to distance based methods [86]. Among the distance-matrix methods Neighbour Joining (NJ) or Unweighted Pair Group Method with Arithmetic mean (UPGMA) are the simplest. Table 2 illustrates a list of phylogenetic analyses programmes. (Table 3).

Table 3 List of phylogenetic analyses programmes and their description along with their webpage’s URL

In functional genomics where a function of a particular gene is not known phylogenetic analysis is used to find their relative genes which ultimately help to the identification their function and other features of that particular gene.

5 Predicting Protein Structure and Function

Protein molecules initiate their life as amorphous amino acid strings, which finally fold up into a three-dimensional (3D) structure. The folding of the protein into a correct topology is needed for proteins to perform its biological functions. Usually, 3D structures are mostly determined by X-ray crystallography and NMR which are costly, difficult and time taking. X ray crystallography method fails if we do not get good crystals. Moreover NMR is limited to small proteins [88]. There are very few structures submitted monthly using NMR and XRD in NCBI. Correct prediction of secondary and tertiary structure of proteins is one of the challenging tasks for bioinformatics/computational biologist till date. Predicting the correct secondary structure is the key to predict not only a good/satisfactory tertiary structure of the protein but also helps in prediction of protein function [88]. Protein structure prediction is classified into three categories: (i) Ab initio modeling [89] (ii) Threading or Fold recognition [90] and (iii) Homology or Comparative modeling (Šali and Blundell 1993 [91]. Threading and comparative modeling build protein models by aligning query sequences with known structures which are determined by X-ray crystallography or NMR. When templates having identity ≥30 % are found, high resolution models could be built by the template-based methods. If templates are not available from the protein data bank (PDB), these models are built from scratch, i.e. ab initio modeling [92]. Homology modeling is the most accurate prediction method so far and it is used frequently. In one of our studies good quality homology models of superoxide dismutase (SOD) has been obtained by Modeller software package in antarctic cyanobacterium Nostoc commune which aids to cope with environmental stresses prevailing at its natural habitat [93]. Bioinformatics tools can also identify secondary structure elements such as helices, sheets and coils. Protein tertiary structures are stabilized by the presence of helices, sheets and coils which play an important role in establishing weaker electrostatic forces. Table 4 illustrates a list of tools to predict the secondary structure of protein molecules.

Table 4 List of protein structure and function predictions programmes their description along with their webpage’s URL

6 Predicting Molecular Interactions

Biomolecules interacting with each other affect various biological activities which has nowadays become one of the popular areas for research [100]. For example, protein-protein interaction, protein-DNA or protein-RNA interaction etc. Protein-protein interactions play an essential role in various cellular activities like signalling and transportation. Protein-protein interactions also play major role in homeostasis, cellular metabolism etc. [101]. In this regard, bioinformatics helps to predict the 3D structure of proteins and also helps in predicting the interaction pattern between different biomolecules. These predictions are based on various parameters such as interface size, amino acid position, types of chemical groups involved. These predictions are also based on vander wall forces, electrostatic interaction and hydrogen bonds. Table 5 illustrates a list of tools to study protein-protein interactions.

Table 5 List of molecular interactions database and programmes, their description along with their webpage’s URL

7 Discussion, Conclusion and Future Prospects

Bioinformatics has emerged as a challenging discipline which has developed very fast in the last few years due to generation of large amount of data generated by various genome sequencing projects. Such a large amount of data needs pre-processing to extract useful knowledge/information by data mining techniques. These processed data are not only stored but also retrieved in a meaningful manner from biological databases. These biological databases containing nucleotide and protein sequences are called primary databases. These primary databases have a drawback that these databases contain redundant sequences. Secondary database has solved this issue to a greater extent which contain derived information from primary databases and redundancy is also minimized at lowest in Swiss-Prot database. Composite databases e.g. NCBI provides better search criteria to search multiple primary resources at a time. NCBI also provides the linking with literature, structure, chemical molecules, genome information, gene and protein sequences databases. Apart from these databases, various specialized databases are also available these days which provide informations about protein-protein interactions, protein families, experimentally known metabolic pathways, genome sequence, protein structure and phylogenetic tree for evolutionary relationship. These databases also have few drawbacks e.g. lack of description of data contained, redundancy of sequences etc. One of the major drawbacks of most of the databases is that they don’t provide the programming interface so that researchers can write their programmes to download and process huge amount of stored data from the database. Bioinformatics is not only used in designing the biological databases but also used in developing software tools for sequence, structure and evolutionary analysis of genes/proteins etc. which save our time, energy and cost in biological research. A number of bioinformatics softwares were designed to predict the correct genes in genomic sequences which use various machine learning approaches like artificial intelligence, genetic algorithm, support vector machine, hidden markov model, dynamic programming etc. However, the best predictors are based on hybrid methods which use more than one machine learning approaches to predict the correct genes. Bioinformatics tools were also developed to construct parsimony, distance based and maximum likelihood based trees to explore the evolutionary relationship among species. Parsimony method is successful when sequence identity is high while maximum likelihood performs well when sequence variation is high. Bioinformatics have proved to be a boon in structure based drug design by predicting the structure of drug targets immaterial of whether template structure are available in PDB or not by different approaches. Homology modelling proved the best predictor among all the methods. Moreover, bioinformatics tools also predict protein-protein interactions which play an essential role in various cellular activities like signalling, transportation, homeostasis, cellular metabolism and also various biochemical processes. It can also be expected, based on the developments in the field of bioinformatics, that the bioinformatics tools and software packages would be able to give more specific, more accurate and more reliable in upcoming years. In future the field of bioinformatics will contribute in functional understanding of whole genome of organisms which will lead to enhanced discovery of gene expression, their interaction pattern, individualised gene therapy and new drug discovery. Thus, bioinformatics and other scientific disciplines should move together in order to flourish for the welfare of humanity.