Keyword

1 Introduction

There has been a flood of nucleic acid sequence information, bioinformatics tools and phylogenetic inference methods in public domain databases, literature and World Wide Web space. Last 20 years has seen the rapid development of prokaryotic genomics. Since the sequencing of Haemophilus influenzae in 1985 (Johnston 2010; Fleischmann et al. 1995), currently over 11,364 whole genome sequences organized in three major groups of organisms i.e. eukaryota, prokaryota (archaea and bacteria) and viruses are available in Genome database of NCBI including complete chromosomes, organelles and plasmids as well as draft genome assemblies. Out of 11,364 whole genome sequences, 7,473 genome projects running across the world belong only to microbes with 1,696 completed microbial genomes projects whereas assembly is being done for 2,247 organisms and 3,531 genome project are still unfinished (Benson et al. 2002). The developing technology of nucleic acid sequencing, together with the recognition that sequences of building blocks in informational macromolecules (nucleic acids, proteins) can be used as ‘molecular clocks’ that contain historical information, led to the development of the three-domain model in the late 1970s, primarily based on small subunit ribosomal RNA sequence comparisons. The information currently accumulating from complete genome sequences of an ever increasing number of prokaryotes are now leading to further modifications of our views on microbial phylogeny. Prokaryotic genomics has had a revolutionary impact on our view of the microbial world and also on the methodologies for microbiological studies.

2 Complexity of Microbial Genomes

Analysis of genomic sequences has revealed that microbial genomes are very diverse. This is due to the complicated nature of microbial evolution. Mutations play a key role in evolution of eukaryotic genomes whereas, the contents of prokaryotic genomes are also changed by gene losses, gene rearrangements, horizontal gene transfer, and so on (McHardy et al. 2007; Doolittle 1999; Woese 1987). This means that even strains from the same species can differ significantly. For example, two Escherichia coli strains O157:H7 and K-12 have more than 1,000 different genes (Perna et al. 2001). The dynamic nature of microbial genomes complicates several tasks in microbiological studies. One of these is the development of strategies to prevent and treat microbe related diseases. Since microbe related diseases are common threats to the public health, microbes especially bacteria have been studied for many years. One point of progress was the introduction of antibiotics to treat bacterial infections. However, the use of antibiotics has been challenged by the emergence of antibiotic resistance among bacteria.

Plasmids play an important role in conferring antibiotic resistance in microbes. It is believed that antibiotic resistance evolves via natural selection. However, antibiotic resistance can also be introduced to bacteria via horizontal gene transfer (Boerlin and Reid-Smith 2008). Plasmids are extra-chromosomal genetic elements that constitute upto 10% of the total DNA found in many species of bacteria (Mølbak et al. 2003; Thomas 2000). Because plasmids are capable of cell-to-cell transfer between bacterial species, genes harboured by plasmids are widely shared, playing a critical role in the evolution of bacteria (Feinbaum 2001; Summers 1996). Establishing accurate relationships between plasmids will help us to understand an important factor in the dissemination of antibiotic resistance genes, and establishing accurate relationships between bacteria will help us to identify the factors that cause diseases, the risks of outbreaks, and methods for preventing disease transmission. Unfortunately, the complexity of microbial genomes is apparent when we try to compare the genetic contents of strains and to build a phylogeny tree from them (McHardy et al. 2007; Doolittle 1999).

3 Obtaining Data (Wet Lab Approach)

One characteristic of microbiological studies in the genomics era is that we can generate a huge amount of data efficiently. Numerous different genomics based experimental methods are available. These methods are usually called molecular methods since they are often based on genetic characteristics. Compared to traditional phenotype-based methods, molecular methods are cost effective, easy to implement, and generate highly discriminatory data (Foxman et al. 2005; Tenover et al. 1997). Of these methods, the most widely used method for nucleic acid amplification is the polymerase chain reaction assay i.e., PCR. This assay includes a specific primer pair to amplify a unique genomic target nucleotide sequence for analysis. Following PCR, a variety of post-amplification methods are used to evaluate the product such as direct sequence analysis, use of genus or species specific probes, and utilization of restriction enzymatic analysis of the product, e.g., restriction fragment length polymorphism analysis (RFLP). Pulse-field gel electrophoresis (PFGE) is also considered as the gold standard. Multiple locus variable-number tandem repeat analysis (MLVA) assays are also a potentially powerful alternative or complementary tool. Another most powerful technique is DNA microarrays which provide a powerful high-throughput genomic method that has been widely used in biological studies. To construct a DNA microarray, single-strand fragments of DNA (also called probes) representing the genes of an organism are attached to a surface of glass or plastic. Each fragment can bind to a complementary DNA or RNA strand. Typically, more than 30,000 spots can be put on one slide, and it is possible to create a microarray representing every gene in a genome. Thus, microarrays can provide genome wide information which allows a comprehensive genetic analysis of an organism or a sample. DNA microarrays have been used for genotyping, expression analysis, and studies of protein-DNA interactions (Bilitewski 2009). When used for assessing the genetic relationships of bacterial strains, microarrays may be prepared for whole genome composed of open reading frames (ORFs) of one complete genome sequence (Zhou 2003). However, this type of microarray is limited by the requirement of representing one complete reference sequence which may not contain genetic content specific to nonsequenced strains. One possible improvement is to include specific genes from multiple whole-genome sequences or to use mixed-genome microarrays (MGMs) which use randomly-selected gene fragments from many strains of bacteria as probes (Wan et al. 2007; Borucki et al. 2004; Call et al. 2003).

From the enormous data to knowledge of microbial genomic information makes it possible to study microorganisms systematically. Sequence-based identification requires the recognition of a molecular target that is large enough to allow discrimination of a wide variety of microbes. One such target area that has been recognized is the rDNA gene complex which is present in all microbial pathogens. In bacteria, this complex is composed of a 16S rRNA gene and a 23S rRNA gene separated by a genomic segment called the internal transcribed spacer (ITS). Within fungi there are three genes (18S, 5.8S, and 28S) with spacers located between the genes (ITS1 and ITS2). Located in the rDNA gene complex are highly variable sequences that provide unique signatures for the identification of species and also conserved regions that contain genomic codes for the structural restrains that are present within organism groups. It has been shown that the ITS regions contain the most variability and that these regions are useful under most circumstances for species recognition. The availability of these variable sequence regions (ITS) surrounded by conserved sequences (16S/23S and 18S/5.8S/28S) allows for the utilization of an amplification system using universal (or consensus) bacterial or fungal primers. Once amplification has occurred using the consensus primers, the sequence is determined and comparison analysis of the unknown sequence to known sequences contained within a large database (such as the National Center for Biological Information (NCBI), GenBank databases) can be done to determine similarity and subsequently may lead to species identification. However, how to manipulate the massive amount of available data, how to retrieve genomic information effectively, and how to process the large scale data efficiently are all challenging problems. Because of these problems, the field of bioinformatics has emerged and has become an integral part of microbial studies (Foster et al. 2012).

4 Bioinformatics

Bioinformatics has evolved into a full-fledged multidisciplinary subject that integrates developments in information and computer technology as applied to biotechnology and biological sciences. Bioinformatics uses computer software tools for database creation, data management, data warehousing, data mining and global communication networking. In this, knowledge of many branches are required like biology, mathematics, computer science, laws of physics & chemistry, and sound knowledge of information technology to analyze the data. Bioinformatics is not limited to the computing data, but in reality it can be used to solve many biological problems and find out how living things work. It is the comprehensive application of mathematics (e.g., probability and statistics), science including biochemistry, molecular biology and a core set of problem-solving methods e.g. computer algorithms to the understanding of living systems. Bioinformatics is the recording, annotation, storage, analysis, and searching/retrieval of nucleic acid sequence genes, RNAs, protein sequences and structural information. This includes databases of the sequences and structural information as well methods to access, search, visualize and retrieve the information.

Functional genomics, biomolecular structure, proteome analysis, cell metabolism, biodiversity, drug and vaccine designs are some of the areas in which bioinformatics is an integral component. Bioinformatics concern the creation and maintenance of databases of biological information whereby researchers can both access existing information and submit new entries. The most pressing tasks in bioinformatics involve the analysis of sequence information. Computational Biology is the name given to this process.

5 Bioinformatics and Its Scope

Bioinformatics has evolved into a full-fledged scientific discipline over the last decade. The definition of Bioinformatics is not restricted to computational molecular biology and computational structural biology. It now encompasses fields such as comparative genomics, structural genomics, transcriptomics, proteomics, cellunomics and metabolic pathway engineering. Developments in these fields have direct implications to healthcare, medicine, discovery of next generation drugs, development of agricultural products, renewable energy, environmental protection etc.

Bioinformatics integrates the advances in the areas of computer science, information science and information technology to solve complex problems in life sciences. The core data comprises of the genomes and proteomes of human to microbes, 3-D structures and functions of proteins, microarray data, metabolic pathways, cell lines, hybridoma and biodiversity etc. The sudden growth in the quantitative data in biology has rendered data capture, data warehousing and data mining as major issues for biotechnologists and biologist. Availability of enormous data has resulted in the realization of the inherent biocomplexity issues which call for innovative tools for synthesis of knowledge. Information technology, particularly the internet, is utilized to collect, distribute and access ever-increasing data which are later analyzed with mathematics and statistics-based tools. Bioinformatics has a key role to play in the cutting edge research and development areas such as functional genomics, proteomics, protein engineering, pharmacogenomics, discovery of new drugs and vaccines, molecular diagnostic kits, agro-biotechnology etc. This has attracted attention of several companies and entrepreneurs. As a result, a large number of bioinformatics based start-ups have been launched and the trend is likely to continue. A Bioinformatician must acquire/possess expertise in the essential multi-displinary fields that comprise the core of this new science. Quality research and education in bioinformatics are vital not only to meet the existing challenges but also to set and accomplish new goals in life sciences.

6 The Potential of Bioinformatics

The potential of bioinformatics in the identification of useful genes leading to the development of new gene products, drug discovery and drug development has led to a paradigm shift in biology and biotechnology. These fields are becoming more and more computationally intensive. The new paradigm, now emerging, is that all the genes will be known “in the sense of being resident in database available electronically”, and the starting point of biological investigation will be theoretical and a scientist will begin with a theoretical conjecture and only then turning to experiment to follow or test the hypothesis. With a much deep understanding of the biological processes at the molecular level, the bioinformatics scientist have developed new techniques to analyze genes on an industrial scale resulting in a new area of science known as ‘Genomics’. This is the science that deals with the study of whole genome, largely encompasses biology of genetics at molecular level i.e., the constitution of DNA and RNA, its analysis, translation of the chemical information carried over by these materials into biological data and digitizing that huge biological data through computational means.

The shift from gene biology has resulted in the development of strategies from lab techniques to computer programmes to analyze whole batch of genes at once. Genomics is revolutionizing drug development, gene therapy, and our entire approach to health care and human medicine. The genomic discoveries are getting translated into practical biomedical results through bioinformatics applications. Work on proteomics and genomics will continue using highly sophisticated software tools and data networks that can carry multimedia databases. Thus, the research will be in the development of multimedia databases in various areas of life sciences and biotechnology. There will be an urgent need for development of software tools for data mining, analysis and modelling, and downstream processing. It has now been universally recognized that bioinformatics is the key to the new grand data-intensive molecular biology that will lead us in this century.

7 Activities in Bioinformatics

We can split the activities in bioinformatics in two areas:

  1. 1.

    The organization: this includes the creation of databases of biological information and the maintenance of the databases. This is very important as we are sequencing tens of millions of bases a year and undertaking to sequence whole organism genomes. The growth of the sequence databases is an unbroken exponential.

  2. 2.

    Analysis of the data: this includes the following:

    • Development of methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences.

    • Clustering protein sequences into families of related sequences and the development of protein models.

    • Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships

    • The development of new algorithms and statistics with which to assess relationships among members of large data sets.

    • The development and implementation of tools that enable efficient access and management of different types of information and

    • The analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures in studying microbial diversity.

8 The Need for Bioinformatics

  • Whole genome analysis and sequences

  • Experimental analysis involving thousands of genes simultaneously

  • DNA Chips and Array Analyses – expression arrays, Comparative analysis between species and strains

  • Proteomics: ‘Proteome’ of an organism.

  • Medical applications: Genetic Disease – Pharmaceutical and Biotech Industry

  • Forensic applications

  • Agricultural applications

9 Databases

Computational analysis and comparative microbial genomic studies are taking shape at a faster rate leading to the development of different types of function prediction concepts, most important of them being the gene context and gene content analysis. Gene content analysis is a comparison of gene repertoires across different genomes (Shah et al. 2005; Luscombe et al. 2001). The postgenomic problems like protein structural determination and issues of gene function identification become more promising (Gomez et al. 2008) with the rapidly increasing number of completely sequenced genomes. Predicting the structures of proteins encoded by genes of interest provides subtle clues regarding the functions of these proteins (Idekar et al. 2001).

Various databases have been established for storing genomic data, and the internet makes it possible for these data to be accessed and shared by the public. Since there are different types of genomic data, it is impossible to build one database containing all data. Currently there are two types of genomic databases. Primary databases contain sequences and structures (for example, NCBI GenBank) and related annotations, bibliographies, and cross-references to other databases and provide the basis for biological studies; secondary databases contain biological knowledge obtained by analyzing genomic sequences and structure data. The database of Clusters of Orthologous Groups of proteins (COGs, http://www.ncbi.nlm.nih.gov/COG), for example, contains information for phylogenetic analysis (Tatusov et al. 1997, 2003). The Ribosomal Database Project (RDP) provides ribosome related data and annotated bacterial and archaeal small-subunit 16S rRNA sequences (Cole et al. 2005, 2009; Larsen et al. 1993). Knowledge from these databases can help to process biological data efficiently. For example, the Gene Ontology database has been used to process microarray datasets (Barrell et al. 2009; Harris et al. 2004). Nucleic acid sequence analysis has proven to be a valuable asset for organism identification in a number of applications. Some of the most interesting applications of this technology are for the identifications of variant strains of known species, the identifi­cation of un-cultivatable organisms in clinical samples and the recognition of new species.

10 Web-Based Resources for Microbial Genomics

MicrobesOnline: MicrobesOnline is a website for browsing and comparing prokaryotic genomes. MicrobesOnline is a product of the Virtual Institute for Microbial Stress and Survival, which is sponsored by the US Department of Energy Genomic Science Program.

Integrated Microbial Genomes (IMG): The Integrated Microbial Genomes (IMG) system serves as a community resource for comparative analysis and annotation of all publicly available genomes from three domains of life, in a uniquely integrated context.

CAMERA (Community Cyber infrastructure for Advanced Marine Microbial Ecology Research and Analysis): The aim of CAMERA is to serve the needs of the microbial ecology research community by creating a rich, distinctive data repository and a bioinformatics tools resource that will address many of the unique challenges of metagenomic analysis.

DOE JGI Microbial Genomics Database: From this site we can get details about JGI projects, or go directly to the individual microbial sites. All of the individual sites include direct access to download sequence file(s), BLAST, and view annotations.

GOLD™ Genomes OnLine Database: GOLD is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects, as well as metagenomes and metadata, around the world.

JCVI Comprehensive Microbial Resource (formerly The Institute for Genomic Research): The Comprehensive Microbial Resource (CMR) is a free website used to display information on all of the publicly available, complete prokaryotic genomes.

Sanger Centre Bacterial Genomes: The Sanger Institute bacterial sequencing effort is concentrated on pathogens and model organisms. The site provides a list of projects funded, underway or completed; all data from these projects are immediately and freely available.

Microbial Genomes from Genome Channel: Genome Channel is a computer-annotated listing of genomes maintained by the Computational Biology group at Oak Ridge National Laboratory.

Protein Data Bank: The RCSB PDB provides a variety of tools and resources for studying the structures of biological macromolecules and their relationships to sequence, function, and disease.

KEGG (Kyoto Encyclopaedia of Genes and Genomes): A grand challenge in the post-genomic era is a complete computer representation of the cell, the organism, and the biosphere, which will enable computational prediction of higher-level complexity of cellular processes and organism behaviours from genomic and molecular information. Towards this end a bioinformatics resource named KEGG has been developed as part of the research projects of the Kanehisa Laboratories in the Bioinformatics Centre of Kyoto University and the Human Genome Centre of the University of Tokyo.

11 Data Retrieval Methods and Online Resources for Microbial Diversity

In order to use the information available in databases, an efficient information retrieval method should be used to obtain all related information quickly. Such methods are different, depending on the type of data to be retrieved. FASTA and BLAST are the two most widely used methods for retrieving sequence data. FASTA was the first fast sequence searching algorithm used for comparing a query sequence against a database (Plewniak 2008; Pearson 1990). The FASTA algorithm performs a rapid and approximate search for matched sequence segments followed by application of the Smith-Waterman alignment algorithm (Plewniak 2008; Pearson 1991) to these segments. Depending upon the application there are several softwares available online for free to retrieve the microbial data. Some of them are briefly described below:

11.1 Pairwise Alignment

A number of computational methods have been developed and used in genomic studies. Of these methods, genetic sequence alignment is the foundation for many other methods and widely used in comparative genomics. A good alignment method should give biologically meaningful results and at the same time be computationally efficient. There are two types of alignment methods, local alignments and global alignments. The former methods try to identify similar segments between two sequences while the latter try to align the entire length of two sequences. Methods for aligning two sequences are called pairwise alignment methods. BLAST and FASTA are two widely used pairwise alignment methods. BLAST (Basic Local Alignment Search Tool) is a rapid sequence database search tool which is more efficient than FASTA. The output of BLAST is a list of high-scoring segment pairs (HSPs) and an “E value” which is an estimate of the probability of finding an HSP with score S. The E value is often used as a standardized measure for estimating the statistical significance of sequence similarity.

These methods can be extended to multiple sequences; however, multiple sequence alignment (MSA) is more complicated. ClustalW (Larkin et al. 2007; Thompson et al. 2002) is a widely used MSA method which is efficient for aligning protein sequences and short nucleotide sequences. However, it may fail for distantly related sequences (Lin et al. 2011). PSI-BLAST (Lee et al. 2008; Schäffer et al. 2001; Altschul et al. 1997) is a very successful method for detecting weak similarities. Two recently developed algorithms, MLAGAN (Brudno et al. 2003) and MAVID (Dewey 2007; Bray and Pachter 2003, 2004), are designed for global alignment of both evolutionarily close and distant megabase length genomic sequences. However, a phylogenetic tree is assumed to be known for use with MLAGAN. MAVID is a progressive global alignment program that works by recursively aligning the ‘alignments’ at ancestral nodes of the guide phylogenetic tree. MAUVE is used for comparing long genome sequences efficiently and takes into account possible large-scale evolutionary events among sequences (Darling et al. 2004).

11.2 Phylogenetic Analysis

The goal of phylogenetic analysis is to reconstruct the evolutionary history of a set of organisms. In molecular epidemiology, it helps to elucidate mechanisms that lead to microbial outbreaks and epidemics. Phylogenetic analysis usually begins with multiple sequence alignment of the sequences of a set of organisms. After obtaining an MSA, a number of different phylogenetic methods can be used to compute phylogenetic trees. These methods can be broadly classified into maximum parsimony, distance, and maximum likelihood methods (Stark et al. 2010; Takahashi and Nei 2000). The difference between these methods is how they define which tree is best among all possible trees. Maximum parsimony tries to find an evolutionary tree or trees which require a minimum number of changes from the common ancestral sequences. For maximum likelihood methods, given the MSA, the probability of a specific tree occurring is computed, and the one or ones with the highest values are considered to be the evolutionary tree or trees. Distance-based methods construct a tree by hierarchical clustering methods using a distance matrix for all organisms that is computed using MSA. To use MSA for phylogenetic analysis, it is necessary to assume an underlying mutation model. Of the ones that have been proposed, the Jukes-Cantor (JC) model (Som 2006; Takahashi and Nei 2000) is the simplest one. In the JC model, each base in a DNA sequence has an equal mutation rate and all complementary pairs of the four nucleotides A, T, C and G have equal substitution rates. These assumptions are not realistic in practice, so many complex models have been proposed and tried. Successful phylogenetic analysis requires a suitable model. Phylogenetic analysis of microbial strains is problematic due to its dynamic nature (Wilmes et al. 2009). Different genes among strains may contain contradictory information about their evolution. Consensus trees have been suggested as a solution. An alternative is the introduction of networks that represent the evolutionary relationships between microbial strains.

11.3 AGeS: A Software System for Microbial Genome Sequence Annotation

AGeS is genome sequence annotation software which is a fully integrated with high performance software system to analyze DNA sequences and predict the protein-coding regions for completed and draft bacterial genomes. It predicts genomic features using a number of bioinformatics methods and provides visualization based on the familiar genome browser.

11.4 SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data

Ribosomal RNA sequence data analyzing tool SILVA is available online at http://www.arb-silva.de/. Sequencing ribosomal RNA (rRNA) genes is currently the method of choice for phylogenetic reconstruction, nucleic acid based detection and quantification of microbial diversity (Pruesse et al. 2007). To cope with the flood of data, the SILVA system was implemented to provide a central, comprehensive web resource for up to date, quality controlled databases of aligned small and large subunit rRNA sequences from the bacteria and archaea domains. This programme is designed as a central comprehensive resource by integrating multiple taxonomic classifications and the latest validly described nomenclature as well as additional information, such as if a sequence was derived from a cultivated organism, a type strain, or belongs to a genome project.

11.5 S and 23S Ribosomal RNA Mutation Database

Access to the expanded versions of the 16S and 23S Ribosomal RNA Mutation Databases has been improved to permit searches of the lists of alterations for all the data from (1) one specific organism, (2) one specific nucleotide position, (3) one specific phenotype. The URL for the searchable version of the Databases is:  http://ribosome.fandm.edu.

11.6 S Ribosomal RNA Database

5S Ribosomal RNA Database provides information on nucleotide sequences of 5S rRNAs and their genes. The sequences for particular organisms can be retrieved as single files using a taxonomic browser or in multiple sequence structural alignments. This programme is freely available at http://biobases.ibch.poznan.pl/5SData/.

11.7 Greengenes

This is an online full-length small-subunit (SSU) rRNA gene database called greengenes available at <http://greengenes.lbl.gov/> that keeps pace with public submissions of both archaeal and bacterial 16S rDNA sequences has been established (DeSantis et al. 2003). It addresses a number of limitations currently associated with SSU rRNA records in the public databases by providing automated chimera-screening, taxonomic placement of unclassified environmental sequences using multiple published taxonomies for each record, multiple standard alignments and uniform sequence-associated information curated from GenBank records. Greengenes also provides a suite of utensils for manipulation of sequences including an alignment tool and has been streamlined to interface with the widely used ARB program.

11.8 Ribosomal Database Project

The Ribosomal Database Project – II (RDP-II) (Maidak et al. 2001) available at <http://rdp.cme.msu.edu/> provides data, tools and services related to ribosomal RNA sequences to the research community. It offers aligned and annotated rRNA sequence data, analysis services, and phylogenetic inferences derived from these data. Currently available on the RDP-II website as a beta release, 9.0 provides over 50,000 annotated (eu) bacterial sequences aligned with a secondary-structure based alignment algorithm (Brown 2000). Data subsets are available for sequences of length 1,200 or greater and for sequences from type material. Annotation goals include up-to-date name, strain and culture deposit information, sequence length and quality information. In order to provide a phylogenetic context for the data, RDP-II makes available over 100 trees that span the phylogenetic breadth of life. Web based research tools are provided for comparing user submitted sequences to the RDP-II database (Sequence Match), aligning user sequences against the nearest RDP sequence (Sequence Aligner), examining probe and primer specificity (Probe Match), testing for chimeric sequences (Chimera Check), generating a distance matrix (Similarity Matrix), analyzing T-RFLP data (T-RFLP and TAP-TRFLP), a java-based phylogenetic tree browser (Sub Trees), a sequence search and selection tool (Hierarchy Browser) and a phylogenetic tree building and visualization tool (Phylip Interface). The latter tool has been enhanced to allow a choice of either the Phylip neighbor-joining (Felsenstein 1993) or Weighbor weighted neighbor-joining (Bruno et al. 2000) programs for tree construction.

11.9 RISSC – Ribosomal Internal Spacer Sequence Collection

This is a database of ribosomal 16S-23S spacer sequences intended mainly for molecular biology studies in typing, phylogeny and population genetics. It compiles more than 2,500 entries of edited DNA sequence data from the 16S-23S ribosomal spacers present in most prokaryotes and organelles. Ribosomal spacers have proven to be extremely useful tools for typing and identifying closely related prokaryotes due to their high variability in size and/or sequence, much more so than the flanking 16S and 23S rRNA genes. These genes are commonly used to establish molecular relationships among microbes at a taxonomic level of species or higher (e.g. genus, domain). However their internal transcribed spacers (ITS) are much more useful to discriminate at the species or even strain level (Iwen et al. 2002). RISSC available at <http://ulises.umh.es/RISSC> provides the scientific community with a comprehensive set of ribosomal spacer sequences, fully edited and characterized with a key feature as is the presence/absence of tRNA genes within them, ready to be used and compared with their own ITS sequences.

11.10 probeBase

probeBase is a curated database of annotated rRNA-targeted oligonucleotide probes and supporting information (Loy et al. 2003, 2007). Rapid access to probe, microarray and reference data is achieved by powerful search tools and via different lists that are based on selected categories such as functional or taxonomic properties of the target organism(s), or the hybridization system in which the probes were applied. Additional information on probe coverage and specificity is available through direct submissions of probe sequences from probeBase to RDP-II and Greengenes, two major rRNA sequence databases.

ProbeBase available at <http://www.microbial-ecology.net/probebase> entries increased from 700 to more than 1,200 during the past 3 years. Several options for submission of single probes or entire probe sets, even prior to publication of newly developed probes, should further contribute to keeping probeBase an up-to-date and useful resource.

11.11 RRNDB

The Ribosomal RNA Operon Copy Number Database (RRNDB) available at <http://rrndb.cme.msu.edu/> contains annotated information on rRNA operon copy number among prokaryotes. Gene redundancy is uncommon in prokaryotic genomes, however rRNA genes can vary from one to as many as 15 copies. Despite the widespread use of 16S rRNA gene sequences for identification of prokaryotes, information on the number and sequence of individual rRNA genes on a genome is not readily accessible. Each entry in RRNDB contains detailed information linked directly to external websites including the Ribosomal Database Project, GenBank, PubMed, and several culture collections.

12 Identification of New Species or Variant Strains of Known Species

Bioinformatics has facilitated researchers to study microbial biodiversity because of its direct interventions in molecular identification, data storage and retrieval system that were the objects and the worrisome of systematic research. The bioinformatics driven approaches enabled people to work efficiently on microbial diversity, identification, characterization, molecular taxonomy and community analysis patterns of both culturable and unculturable organisms. Description of new species, genera and even molecular taxa emerged dramatically in the literature after 1990s and these efforts are largely driven by advances in sequencing technologies. The utilization of phenotypic identification methods classically requires a probability-based analysis to determine identity. In cases where identification probabilities are ≥98% with known species, the identification is generally considered acceptable. The lower the probability percentage however, the less accurate the identification becomes, frequently resulting in supplemental testing to resolve discrepancies among test results. It is not unusual for the laboratory to be unable to identify variant strains of known species using phenotypic methods. DNA sequencing now allows the laboratory a means to resolve those instances where phenotypic testing cannot differentiate among closely related organisms.

The recognition of a species that does not match known schemes for phenotypic identification may represent a previously unrecognized species (Relman 2002). Sequencing of areas within the rDNA complex may be useful to suggest a new species when there is a <98% of the sequence similarity with known species. The ability to separate a new species from an atypical strain of a known species is however, difficult. The first approach to recognition of a new species is to determine the phylogenetic position of the suspect new species compared to closely related known species. Phylogenetic trees using the 16S gene for bacteria and the 18S gene for fungi are commonly used for this type of analysis. The 16S rRNA approach is rooted in the concept of point mutation due to their slow mutation rate. Before microbial genomes were sequenced, using 16SrRNA database was considered and bacteria, archaea, and eukaryotes were identified.

A high degree of phenotypic consistency and rDNA sequence similarity as well as, a significant degree of DNA-DNA hybridization, is suggestive of a new species.

13 Bioinformatics Challenges

Many bioinformatics tools have been borrowed from the fields of artificial intelligence, data mining, and statistical methods. However, the characteristics of biological data may differ significantly from those of the original data for which the methods were developed. Though many computational methods have been introduced for genomic data analysis based on these methods, several challenges still exist. Though public databases such as GenBank are useful, the lack of quality sequences and the absence of sequence information on a large number of species as well as the availability of computational tools to reliably analyze the results are drawbacks to this technology. A typical DNA microarray might have thousands of features (probes) for, at most, one hundred samples. Feature reduction is typically required before these sorts of analyses can be performed (Al-Khaldi et al. 2012; Bier et al. 2008; Yauk and Berndt 2007). Another challenge is integrating data from different sources. These datasets might show a high degree of heterogeneity and might also vary in quality. They might be generated using different experimental platforms or based on different molecular methods. Using these data together efficiently requires developing suitable bioinformatics methods. Of these methods, the simplest one is to put several datasets together to build a larger dataset and then analyze this larger dataset. However, this method will not work if the formats of the original datasets differ. Furthermore, the best processing methods for different datasets are not the same. For example, Dice coefficents work well for some PFGE data but does not work well for some VNTR data. Thus, it might be an impossible task to choose an optimal method for a combined dataset. An alternate method is to process different datasets separately and then combine the results to obtain the final result. The difficulties with this kind of method, however, are determining the extent to which the different sources of data should contribute and explaining the combined results.

14 Conclusion

The development of computational methods based on the organized algorithms, interpretational skills and high storage capacities facilitated comparison of entire genomes and thus permit biologists to study more complex evolutionary trends like gene duplication, horizontal gene transfer and prediction of factors important in speciation (Nakashima et al. 2005). Bioinformatics researchers have compared extensively multiple genomes to correlate and classify the genomes into various families and to study evolution. It has been established by many researchers that overall evolution is a combination of point based mutation giving rise to restructuring of genomes based upon gene duplications, gene insertion, gene deletion, horizontal gene transfer etc. The ultimate aim of such studies lies in deciphering the evolutionary lineages among the group of organisms in a quest to determine the tree of life and the last universal common ancestor. The progress in bioinformatics and wet-lab techniques has to remain interdependent and focused complementing each other for their own progress and for the progress of biotechnology in future.

15 Some More Web Addresses for Bioinformatics Tools

Name of tool/database

Web address

ASD

http://www.ebi.ac.uk/asd

AUGUSTUS

http://augustus.gobics.de/bin/npsa_automat.pl?page=npsa_gor4.html

BLAST

http://www.ncbi.nlm.nih.gov/blast

CFSSP

http://www.biogem.org/tool/chou-fasman/

Clustal W

http://www.ebi.ac.uk/Tools/clustalw2/index.html

ComputpI/Mw

http://web.expasy.org/compute_pi/

CpG Island Searcher

http://www.uscnorris.com/cpgislands2/cpg.aspx

CpGPlot

http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html

DDBJ BLAST

http://blast.ddbj.nig.ac.jp

DNA tools

http://biology.semo.edu/cgi-bin/dnatools.pl

Entrez Gene

http://www.ncbi.nlm.nih.gov/sites/entrez

ESLPred2

http://www.imtech.res.in/raghava/eslpred2/

ExPaSy

http://expasy.org/tools/

FEX

http://www.softberry.ru/berry.phtml

FGENESH

http://www.softberry.ru/berry.phtml

GeneMark.hmm

http://www.itba.mi.cnr.it/webgene/

GOR

http://npsa-pbil.ibcp.fr/cgi-

HMMgene

http://www.cbs.dtu.dk/services/HMMgene/

MGI

http://www.informatics.jax.org/

MultiLoc2

http://abi.inf.uni-tuebingen.de/Services/MultiLoc2

Myristoylator

http://web.expasy.org/myristoylator/

NetAcet

http://www.cbs.dtu.dk/services/NetAcet/

NetOGlyc

http://www.cbs.dtu.dk/services/NetOGlyc

NetPhos

http://www.cbs.dtu.dk/services/NetPhos/

NetPhosK

http://www.cbs.dtu.dk/services/NetPhosK/

NetSurfP

http://www.cbs.dtu.dk/services/NetSurfP/

NMT

http://mendel.imp.ac.at/myristate/SUPLpredictor.htm

OligoCalc

http://www.basic.northwestern.edu/biotools/oligocalc.html

PSIPRED v3.0

http://bioinf.cs.ucl.ac.uk/psipred/

SherLoc2

http://abi.inf.uni-tuebingen.de/Services/SherLoc2

SIGSCAN

http://www-bimas.cit.nih.gov/molbio/signal/

SMS

http://www.bioinformatics.org/sms/

TermiNator

http://www.isv.cnrs-gif.fr/terminator3/index.html

TFBIND

http://tfbind.hgc.jp/

TFSEARCH

http://www.cbrc.jp/research/db/TFSEARCH.html