Keywords

1 Introduction

Databases are the convenient system to properly store, search, and recover several types of data. A database helps to easily handle and share large amount of data and supports large-scale analysis by easy access and data update (Liu and Özsu 2009).

Due to the vast amount of data generated in experiments of genome, transcriptome, and exome sequences of various organisms in current times, the biological data has stored with an exponential rate. The availability of enormous amount of biological data (sequences as well as structural data) has generated a need for managing, storing, and retrieving this huge data.

Therefore the biological databases have come into existence as invaluable sources for the biological community. In a nutshell, databases are libraries for storage and representation of biological data obtained from the scientific community which converts data into knowledge.

2 History

A book published in 1965, Atlas of Protein Sequences and Structures, was the first biological database by Margaret Dayhoff and colleagues, and further they have published other editions of the book in the 1970s; however the first edition was limited to 65 sequences only (Dayhoff and Foundation 1973, 1976; Foundation 1972).

With the discovery of the integrated circuit, the powerful and reliable third generation computers are became the choice of storage of biological databases for scientists. An English scientist Tim Berners-Lee in 1989 invented the “World Wide Web” (WWW) which is the primary tool people use to interact on the Internet and is the way to access all biological databases. Production of high throughput sequencing machines leads production of data rich science, needs an interdisciplinary arena to develop software tools which is used to understand biological data. The field of science with the involvement of computer, statistics and engineering to study biological data is called Bioinformatics.

3 Classification of Biological Databases

figure a

3.1 Databases Based on Data Types

This database was divided into several databases; some of the databases were discussed below in detail.

figure b

3.1.1 Sequence Databases

Sequence databases contain both nucleic acid and protein sequences. First we will discuss about nucleotide sequence repositories.

  1. (I)

    Nucleic Acid Sequence Database

There are three main nucleotide sequence repositories:

  1. (A)

    GenBank

  2. (B)

    European Molecular Biology Laboratory (EMBL)

  3. (C)

    DNA Data Bank of Japan (DDBJ)

Raw nucleic acid sequences are stored in these databases and make available through Internet sources. Initially, these databases worked independently, but later the International Nucleotide Sequence Database Collaboration (INSDC, http://insdc.org) was developed to maintain collaboration between DDBJ, GenBank, and EMBL (Fig. 1.1). These databases started exchanging their data through constant communication between the team at each collaborating organizations in order to access the sequences present in all three different formats.

Fig. 1.1
figure 1

The home page of International Nucleotide Sequence Database Collaboration (INSDC) (http://insdc.org)

  1. (A)

    GenBank

GenBank is a collection of raw and annotated nucleotide as well as protein information. GenBank is maintained and accessed through the National Center for Biotechnology Information (NCBI). Every 2 months a new release is made. It is maintained by NCBI as part of the INSDC (Benton 1990). There are approximately 137384889783 bases, from 149819246 sequence records in the GenBank release 188.0 on February 15, 2012. Type “insulin” in the search tab on the GenBank home page to view list of sequences of insulin gene, partial or complete from different organisms (Fig. 1.2).

Fig. 1.2
figure 2

Using GenBank to query insulin sequences (http://www.ncbi.nlm.nih.gov/nuccore/?term=insulin)

  • Example of GenBank Format

figure c
  • Format Explanation

GenBank format includes locus name which is similar to the accession number and unique to the entry, and it is followed by sequence length. In our example sequence length is 587 bp. Definition includes description of source organism, gene/protein name, and other details about sequence.

  • Accession number is the unique identifier of the sequence (NM_013564).

  • Version is similar to accession number, but whenever a change occurs in sequence data, the version increases by 1. In our example, version is NM_013564.7; this indicates that sequence has been changed seven times.

  • GI (GenInfo Identifier) number also runs parallel to the accession number and version system. A new GI is allotted, if the sequence has been changed and the version has increased by unity. In our example, GI is 365192585.

  • Keywords are words or expressions about sequence. The keyword field contains a dot if nothing is provided.

  • Source contains name of the organism from which the sequence has been derived.

  • Organism is a related sub-keyword of source and contains the scientific name of the organism along with the lineage as described in NCBI taxonomy database.

  • Reference contains the publication by the authors of the sequence.

  • Authors contain list of authors in the same order as appears in publication.

  • Title shows the title of published/unpublished work.

  • Journal contains MEDLINE abbreviations of the journal name where the work is published.

  • PubMed field provides the PubMed identifier (PMID) of that article.

  • Comment points out the change occurred in the submitted sequence.

  • Features provide information about genes and their products, segment of biological significance in the submitted sequence, as well as other characteristics.

  • Gene provides gene length and gene name and its function and synonyms. CDS represents coding sequence which codes for protein sequence.

  • Origin contains the sequence data. Finally, GenBank record ends with // sign.

  • Sequence Submission to GenBank

Sequence submission is done by using different tools available at NCBI. Few of them are:

NCBI was started in 1988, as a part of the US National Library of Medicine (NLM) located at Bethesda, Maryland. It is a division of the National Institutes of Health and is directed by David Lipman. The responsibility of NCBI is to make available the GenBank nucleotide sequence database since 1992. NCBI is playing a very remarkable role for biological scientists by making available various public databases and software tools for sequence analysis (Table 1.1). GenBank manages with individual laboratories and other sequence databases like those of the EMBL and the DDBJ. Meanwhile in 1992, NCBI has developed to run other databases in addition to GenBank ((US) 2013). The home page of NCBI is shown in Fig. 1.3.

  • Databases and Tools of NCBI

Table 1.1 Various databases and software tools of NCBI for sequence analysis
Fig. 1.3
figure 3

The home page of National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

  • Database Retrieval Tool

Entrez (www.ncbi.nlm.nih.gov/Entrez/) in Fig. 1.4 is a primary text search engine which comprises of 40 molecular and literature databases. It extracts huge information from the PubMed database, such as DNA and protein sequences and structure, gene, genome, genetic variation, and gene expression.

Fig. 1.4
figure 4

The home page of Entrez (www.ncbi.nlm.nih.gov/Entrez/)

  1. (B)

    European Molecular Biology Laboratory (EMBL)

The European Molecular Biology Laboratory (EMBL) (http://www.embl.org/) in Fig. 1.5 is a molecular biology organization which is maintained by 20 European countries, with Australia as associate member state. It is an intergovernmental organization created in 1974. It develops and maintains a large number of databases, and scientists can access the data free of cost. This research laboratory functions from five different locations, the main laboratory, the European Bioinformatics Institute (EBI), Heidelberg, Germany, is a hub for bioinformatics research and services, directed by Dr. Rolf Apweiler and Dr. Ewan Birney. It is a part of INSDC, which includes DDBJ and GenBank. Typing insulin gene at EMBL search engine produced a result in Fig. 1.6.

Fig. 1.5
figure 5

The home page of European molecular biology laboratory (http://www.embl.org/)

Fig. 1.6
figure 6

Insulin gene search at European molecular biology laboratory website (https://www.ebi.ac.uk/ebisearch/search.ebi?query=insulin&db=allebi&requestFrom=searchBox)

  • EMBL File Format

figure d
figure e
  • Sequence Retrieval System (SRS)

SRS (http://srs.ebi.ac.uk/) (Fig. 1.7) is a powerful searching tool to retrieve sequences (and other types of data) and also to perform various operations on retrieved information for EMBL. It is similar to Entrez of NCBI, a search engine for extracting all sort of information available at EMBL.

Fig. 1.7
figure 7

The home page of Sequence Retrieval System (http://srs.ebi.ac.uk/)

  • Sequence Submission at EMBL

There are mainly three tools available for submitting data at EMBL.

  1. 1.

    Webin: for nucleotide sequence submission

  2. 2.

    Sequin: a stand-alone tool for submitting nucleotide sequences to GenBank, EMBL, and DDBJ developed by NCBI

  3. 3.

    Webin-Align: a tool for sequence alignment submission

  1. (C)

    DNA Data Bank of Japan (DDBJ)

DDBJ, (http://ddbj.sakura.ne.jp/) (Fig. 1.8) part of INSDC, was established at the National Institute of Genetics (NIG), Japan, in 1986 with the support of the Ministry of Education, Culture, Sports, Science and Technology, Japan.

Fig. 1.8
figure 8

The home page of DNA Data Bank of Japan (http://ddbj.sakura.ne.jp/)

  • SAKURA

SAKURA (http://sakura.ddbj.nig.ac.jp/top-e.html) is a source for data (nucleotide sequence) submission system through the WWW-based server where one can enter and submit nucleotide sequences and translated amino acid sequences. Since 1995 it is open to the public and scientists community.

  • DDBJ Format

figure f
  1. (II)

    Protein Sequence Databases

The different protein sequence databases available are the following:

  1. (A)

    Protein Information Resource

  2. (B)

    UniProt

  1. (A)

    Protein Information Resource (PIR)

Margaret Dayhoff was the inventor of Protein Information Resource (PIR) in the 1960s at the National Biomedical Research Foundation (NBRF) for investigation of evolutionary relationships among proteins. Analysis tools for protein database are provided by PIR which are freely available to the scientists (George et al. 1997).

In 2002 Protein Information Resource and its worldwide partners, EBI and Swiss Institute of Bioinformatics (SIB), were granted an award from the National Institutes of Health (NIH) to make UniProt, by merging the databases of PIR-PSD, SWISS-PROT, and TrEMBL (Fig. 1.9).

Fig. 1.9
figure 9

The home page of Protein Information Resource (http://pir.georgetown.edu/)

  1. (B)

    UniProt

It comprises of two sections:

  1. (a)

    SWISS-PROT

  2. (b)

    Translated EMBL (TrEMBL)

  1. (a)

    SWISS-PROT

SWISS-PROT (http://www.uniprot.org/) (Fig. 1.10), established in 1896, is the most widely used protein sequence database created by the University of Geneva and the EMBL, collaboratively. After 1994, the collaboration moved to EMBL’s UK outstation, the EBI.

  • SWISS-PROT Format

Fig. 1.10
figure 10

The home page of UniProt (http://www.uniprot.org/)

Each line starts with a two-character line code, which specifies the kind of data contained in the line.

  1. (b)

    Translated EMBL

TrEMBL benefits from the SWISS-PROT format and comprises translations of all coding sequences (CDS) in EMBL. It has two core divisions, designated SWISS-PROT-TrEMBL and REM-TrEMBL.

3.1.2 Structure Databases

  • PDB (Protein Data Bank)

  • MMDB (Molecular Modeling Database)

  • VAST (Vector Alignment Search Tool)

  • CDD (Conserved Domain Database)

  • NDB (Nucleic acid Structure Database)

From the above databases, some of the database is shown below in detail.

  1. (I)

    Protein Data Bank (PDB)

The PDB (http://www.rcsb.org/pdb/home/home.do) in Fig. 1.11, a source for the three-dimensional structural data of huge biological molecules, includes proteins and nucleic acids. It was established in 1971 by the Research Collaborators for Structural Bioinformatics (RCSB). The data submitted by scientists from different parts of the world are easily without cost available through the Internet. The PDB is supervised by the Worldwide Protein Data Bank (wwPDB) (Berman 2008).

Fig. 1.11
figure 11

The home page of PDB with the query Hemoglobin (http://www.rcsb.org/pdb/home/home.do)

As on March 20, 2012 at 5 PM PDT, there were 80,264 structures. Each structure has been assigned a PDB ID, which contains four characters both alphabets and numerical. The first character is a numeral, while the last three characters can be either numerals or letters. Search results and structure for hemoglobin were showed in Figs. 1.11 and 1.12.

Fig. 1.12
figure 12

Search result of Protein Data Bank (http://www.rcsb.org/pdb/results/results.do?qrid=57082E24&tabtoshow=Current)

  • PDB File Format

This format was primarily practiced by the Protein data bank and previously was known as the PDB file format. The PDB also retains data on biological macromolecules, “macromolecular crystallographic information file format” (mmCIF), initiated to be phased in 1996. In the year 2005, an Extensible Markup Language (XML) version of PDBML was described (Westbrook et al. 2005).

  • Data Deposition Tool of PDB

Auto Dep Input Tool (ADIT) (http://deposit.rcsb.org/adit/) (Fig. 1.13) is developed by RCSB, and it is responsible for depositing structures to PDB in an efficient manner.

Fig. 1.13
figure 13

The home page of Auto Dep Input Tool (http://deposit.rcsb.org/adit/)

  1. (II)

    Nucleic Acid Structure Database (NDB)

This database (http://ndbserver.rutgers.edu/) (Fig. 1.14) provides us 3D structures of nucleic acids.

Fig. 1.14
figure 14

The home page of nucleic acid structure database (http://ndbserver.rutgers.edu/)

3.1.3 Literature Database

Literature databases provide us library of life science work done all over the world. Various literature databases available are the following:

  • MEDLINE

  • CiteXplore

  • OMIM

  • Patent abstracts

  • FlyBase archives

3.1.4 Pathway Database

To comprehend molecular interactions and chemical reaction networks, the pathway database is used by pathway maps. Various pathway databases available are the following:

  • BioCyc database collection comprising EcoCyc and MetaCyc

  • KEGG PATHWAY Database (www.genome.jp/kegg/)

  • MANET database

  • Reactome (Laboratory of Cold Spring Harbor, EBI, Gene Ontology Consortium)

3.1.5 Chemical Database

A collection of the chemical information precisely planned is called chemical database. These are the few freely available chemical databases:

  • Chemical Entities of Biological Interest (ChEBI)

  • PubChem

  • Zinc

  • eMolecules

  • DrugBank

3.1.6 Enzyme Database

Enzyme databases cover an extensive range of properties and functions, such as structure, occurrence, kinetics of enzyme-catalyzed reactions, and metabolic function. Various enzyme databases available are the following:

  • ExPASy

  • BRENDA

  • REBASE

  • EC enzyme database

3.1.7 Disease Database

The disease database provides all disease-related information; it is a cross-referenced index of diseases, symptoms, medications, signs, abnormal investigation findings, etc.

  • OMIM

  • OMIA

3.1.8 Domain Database

Domain database is a database for ancient domains and full-length proteins.

  • CDD (Conserved Domain Database)

3.1.9 Structural Classification of Protein Database

It provides hierarchical classification of protein structure which defines the evolutionary association between proteins.

3.1.10 Genome Database

Genome databases are a collection of genome sequences of many species; it interprets and examines them and provides free public access.

  • Genome Databases at the National Center for Biotechnology Information (Index)

  • Genome Databases at the National Center for Biotechnology Information (Entrez)

  • Genome Databases at the National Center for Biotechnology Information (PMGif) Genome List in NIH

  • Mitochondrial DNA Database (MitBASE)

  • Mouse Genome Informatics

  • Plant Genome Project maintained by the National Science Foundation

  • Organelle Genome Sequences (PMGif)

3.2 Biological Databases Based on Database Source

This database is subdivided into two databases, primary and secondary.

  1. 1.

    Primary: databases comprising of data generated experimentally like nucleotide sequences and 3D structures are identified as primary databases.

Examples are GenBank, DDBJ, EMBL, PIR, PDB, NDB, UniProt, TrEMBL, SWISS-PROT, etc.

  1. 2.

    Secondary: it contains databases directly derived from the primary databases.

Examples are PROSITE, Pfam, Blocks, Prints, SCOP, CATH, OMIM, KEGG, etc.

3.3 Composite Databases

It combines various different primary database sources. This makes searching the query more efficient. So, composite database amalgamates various primary databases for easy access.

Examples are OWL, NRDB, MIPSX, SP, and TrEMBL.

3.4 Biological Databases Based on Database Design

This database is subdivided into two databases, object-oriented and relational databases.

3.4.1 Object Oriented

A database controlling system in which information is characterized in the form of objects. These databases are unlike table-oriented relational databases.

Objects mostly comprise of Attributes and Methods.

How Data Is Stored

There are two methods used for the storage of objects:

  • Each object has an exclusive ID and is known as a subclass of a base class, by inheritance to explain attributes.

  • For management and object storage, virtual memory mapping has been used.

3.4.2 Relational Database

Relational databases can be assumed as comprehensive tables of data. Each record from a flat file could be applied as a row in a table. Although a relational database can be applied in a single large table or “relation,” it is often helpful to split the database up into multiple tables (Fig. 1.15).

Fig. 1.15
figure 15

Four tables are shown: plasmid, vector, DNA, and location. Arenas that reference other tables are mentioned to as links. Numerous factors have to be considered when designing a relational database (http://home.cc.umanitoba.ca/)

A benefit of relational databases is that by breaking up the database to various tables, in many circumstances, only one table needs to be rewritten when creating changes in fields. In other cases, addition of a record may need rewriting many or most tables.