Key words

1 Introduction

This chapter is designed to serve as a practical guide to using the GenBank nucleotide sequence database in biological research. The chapter is divided into five sections including this introduction. Subheading 2 provides a summary of the content of the database and describes various methods of access. Subheading 3 presents several common ‘methods’ that can be applied to the GenBank data. Subheading 4 provides additional details and examples regarding the described methods. Subheading 5 provides email addresses to use when submitting data to GenBank and to getting help on using the data.

2 Materials

The sections below describe the GenBank database along with its release formats, release cycle, composition, methods of access and integration with other biological resources.

2.1 The GenBank Database

GenBank [1] is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotation. GenBank is maintained and distributed by the National Center for Biotechnology Information (NCBI) , a division of the National Library of Medicine (NLM) at the US National Institutes of Health (NIH) in Bethesda, MD. NCBI builds GenBank from several sources including the submission of sequence data from authors and from the bulk submission of expressed sequence tag (EST), genome survey sequence (GSS), whole genome shotgun (WGS) and other high-throughput data from sequencing centers. The US Office of Patents and Trademarks also contributes sequences from issued patents. GenBank, the European Nucleotide Archive (ENA) [2], and the DNA Databank of Japan (DDBJ) [3] comprise the International Nucleotide Sequence Database Collaboration (INSDC) (www.insdc.org) whose members exchange data daily to ensure a uniform and comprehensive collection of sequence information.

2.1.1 The GenBank Release Formats and Release Cycle

NCBI provides free access to GenBank using either FTP or the web-based Entrez search and retrieval system [4]. The FTP release consists of a mixture of compressed and uncompressed ASCII text files, 2052 in release 202, containing sequence data and indices that cross reference author names, journal citations, gene names and keywords to individual GenBank records. For convenience, the GenBank records are partitioned into 19 divisions (see Note 1 ) according to source organism or type of sequence. Records within the same division are packaged as a set of numbered files so that records from a single division may be contained in a series of many files; for example, there are 70 files in the PLN division (containing plant and fungal sequences) in release 202. The full GenBank release is offered in two formats; the GenBank ‘flatfile’ format (see Note 2 ), and the more structured and compact Abstract Syntax Notation One (ASN.1) format used by NCBI for internal maintenance. Full releases of GenBank are made every 2 months beginning in the middle of February each year. Between full releases, daily updates are provided on the NCBI FTP site (ftp.ncbi.nlm.nih.gov/genbank/, ftp.ncbi.nlm.nih.gov/ncbi-asn1/). The Entrez system always provides access to the latest version of GenBank including the daily updates.

2.1.2 The Composition of GenBank

From its inception, GenBank has doubled in size about every 18 months. Release 202, in June 2014, contained 162 billion nucleotide bases from more than 173 million individual sequences. Contributions from WGS projects supplement the data in the traditional divisions to bring the total beyond 780 gigabases. The number of eukaryote genomes for which coverage and assembly are significant continues to increase as well, with 1200 such assemblies now available. Database sequences are classified by and can be queried using a comprehensive sequence-based taxonomy [5] developed by NCBI in collaboration with ENA and DDBJ with the assistance of external advisers and curators. Some 300,000 named species are now represented in GenBank and new species are being added at the rate of over 3000 per month. Detailed statistics for the current release may always be found in the GenBank release notes (ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt).

2.1.3 Sources of Plant Sequences

In recent years high-throughput sequencing techniques, such as whole genome shotgun (WGS) sequencing, have become the dominant source of sequence data for many organisms, including green plants. More than 98 % of the sequence data for plants in GenBank release 202 (224 Gbp from some 120,000 plant species) were derived from a high-throughput method, leaving less than 2 % from the traditional PLN division of GenBank. About 80 % of the plant data were produced from WGS methods (Table 1). These data include both the WGS contigs as well as genomic scaffolds assembled from the WGS contigs, and such scaffolds are available in the CON division of GenBank. Another 17 % of the data are either expressed sequence tags (EST division), genome survey sequences (GSS division), other high-throughput sequences (HTG division) or sequences from transcriptome shotgun assembly projects (TSA division). A listing of prominent plant species in the PLN, EST, and TSA divisions is provided in Table 2.

Table 1 Distribution of plant sequences among the GenBank divisions in GenBank release 202
Table 2 Prominent plant species in GenBank Release 202

Another increasingly important source of sequence data for plants is next-generation sequencing projects that deposit data into the NCBI Sequence Read Archive (SRA). While SRA [6] is not formally part of GenBank, the sequence reads it contains may be assembled into larger sequences or alignments that can be deposited into GenBank (for example, into the TSA division).

In addition to the high-throughput sequences mentioned above, NCBI encourages the submission of sequencing data ranging in complexity from a transcript sequence annotated with a single coding region, to sets of aligned sequences supporting population or phylogenetic studies, or large scale genomic assemblies with detailed annotations.

2.1.4 Submitting Sequence Records to GenBank

Virtually all records enter GenBank as direct electronic submissions, with the majority of authors using the BankIt or Sequin programs described on the GenBank submission Web page (www.ncbi.nlm.nih.gov/genbank/submit/). Most journals require authors with sequence data to submit the data to a public database as a condition of publication. GenBank staff can usually assign an accession number (see Note 3 ) to a sequence submission within two working days of receipt. The accession number serves as confirmation that the sequence has been submitted and can be used to retrieve the data when it appears in the database. Direct submissions receive a quality assurance review that includes checks for vector contamination, proper translation of coding regions, correct taxonomy and correct bibliographic citations. A draft of the GenBank record is passed back to the author for review before it enters the database, and authors may ask that their sequences be kept confidential until the time of publication. Since GenBank policy requires that deposited sequence data be made public when the sequence or accession number is published, authors are instructed to inform GenBank staff of the publication date of the article in which the sequence is cited in order to ensure a timely release of the data. Although only the submitting scientist is permitted to modify sequence data or annotations, all users are encouraged to report lags in releasing data or possible errors or omissions to GenBank by writing to update@ncbi.nlm.nih.gov.

About a third of author submissions are received through BankIt (www.ncbi.nlm.nih.gov/WebSub/?tool=genbank), a Web-based data submission tool. Using BankIt, authors enter sequence information and biological annotations, such as coding regions or mRNA features, directly into a series of tabbed forms that allow the submitter to describe the sequence further without having to learn formatting rules or controlled vocabularies. Additionally, BankIt allows submitters to upload source and annotation data using tab-delimited tables. Before creating a draft record in the GenBank flat file format for the submitter to review, BankIt validates the submissions by flagging many common errors and checking for vector contamination using a variant of BLAST called Vecscreen. Help using BankIt, as well as example submission scenarios, is available in the GenBank Submissions Handbook (http://www.ncbi.nlm.nih.gov/books/NBK51157/).

NCBI also offers a standalone submission program called Sequin (www.ncbi.nlm.nih.gov/Sequin/index.html) that can be used interactively with other NCBI sequence retrieval and analysis tools. Sequin handles simple sequences such as a cDNA, as well as segmented entries, phylogenetic studies, population studies, mutation studies, environmental samples and alignments. Sequin offers complex annotation capabilities and contains a number of built-in validation functions for quality assurance. In addition, Sequin is able to accommodate large chromosome-scale sequences and read in a full complement of annotations from simple tables. Once a submission is completed, submitters can e-mail the Sequin file to gb-sub@ncbi.nlm.nih.gov. Versions of Sequin for common computer platforms are available via anonymous FTP (ftp.ncbi.nlm.nih.gov/sequin).

NCBI works closely with sequencing centers to ensure timely incorporation of bulk data into GenBank for public release. Submitters of large, heavily annotated genomes may find it convenient to use ‘tbl2asn’ (www.ncbi.nlm.nih.gov/Sequin/table.html), to convert a table of annotations generated from an annotation pipeline into an ASN.1 record suitable for submission to GenBank. Special procedures for the batch submission of EST and GSS sequences are described on the GenBank submission page (www.ncbi.nlm.nih.gov/Genbank/submit.htm). WGS and TSA projects can be uploaded to GenBank using the NCBI submission portal (submit.ncbi.nlm.nih.gov). The tbl2asn output file can be uploaded directly through the portal along with appropriate project and sample metadata. In addition, FASTA and AGP files can be submitted directly for WGS.

2.1.5 Annotations Found in GenBank Records

Each GenBank entry includes a concise description of the sequence, the scientific name and taxonomy of the source organism, bibliographic references, and a table of biological features (see Note 4 ). Annotation is best for GenBank records in the PLN division, while records in divisions such as EST, GSS and TSA contain either minimal annotation or no annotation at all.

2.1.6 Integration of GenBank Data with Other Resources

Given the increasing amount of data arising from high-throughput sequencing methods, NCBI has developed a suite of four related resources—BioProject, Genome , Assembly and BioSample—that aggregate these data around four central concepts. The BioProject database [7] contains records that represent funding initiatives for a wide variety of genomic projects. Each record contains information about the project itself and provides links to any data that the project has submitted to NCBI. For genome sequencing projects focused on a single species, the Genome database [4] collects all data, ranging from short reads to fully assembled chromosomes, produced by such projects for that species. The Genome record for a species also will contain sequence data for any sub-species or strains, along with organelle genome sequences for the species. Currently there are over 600 Genome records for plants, almost 50 of which represent complete genomes. Many modern genomic data sets, particularly from higher eukaryotes, represent the genome as a collection of individual chromosome sequences called an assembly. These assemblies are often updated over time, with each update labeled as a unique version. The NCBI Assembly database collects the sequences that comprise an individual version of a genome assembly, along with associated metadata for that assembly. There are currently over 170 assemblies for more than 100 plant species. Finally, the BioSample database [7] collects information about the specimen used as the source of data submitted to NCBI. Each of these four databases links directly to relevant GenBank data and, as discussed below, offers a unique path to search and retrieve these data dependent on the user’s goals.

For plant species with genome mapping data or genome annotations, NCBI may provide graphical views of the genomic maps in the NCBI Map Viewer or records in the Gene database. Currently over 120 plant species have graphical maps available and more than 450 have Gene records. When available, GenBank records are shown aligned to the genomic maps, and will be linked as supporting data to Gene records. Sequence variations may also be displayed in the Map Viewer, and over 25 million SNPs have been mapped to plant genomes, the vast majority to genomes of Glycine max, Oryza sativa, Sorghum bicolor and Arabidopsis thaliana. In addition, the more than 50 plant species that have more than 70,000 EST sequences in GenBank have been incorporated into the UniGene database [4], where these ESTs are combined with other transcript sequences in GenBank and partitioned into over 1.3 million gene-oriented clusters. Links from UniGene are also available to Gene, HomoloGene and Protein where possible. As a consequence, Entrez (Subheading 2.2.1) can be used to match a GenBank EST accession number (see Note 3 ) to a gene location, a protein sequence, and homologous genes in many organisms.

2.2 Accessing GenBank

GenBank data can be accessed in several ways. The Entrez system on the NCBI web site allows users to search, view and download any arbitrary subset of GenBank, while the entire database can be downloaded from the NCBI FTP site (ftp.ncbi.nlm.nih.gov). In addition, programmers can access GenBank data using the Entrez Programming Utilities (E-Utilities), the public API to the Entrez system (eutils.ncbi.nlm.nih.gov).

2.2.1 Interactive Access with Entrez

The sequence records in GenBank are accessible using Entrez [4], a robust and flexible database retrieval system that covers over 40 biological databases containing almost a billion individual records ranging from DNA and protein sequences to genome maps, literature abstracts in PubMed, full text articles in PMC, gene expression data in GEO [8], variations in SNP and dbVar [9, 10], the full NCBI taxonomy [5], protein domains and 3D structures [11, 12], chemicals in PubChem [13, 14] and many other data types. GenBank data are found in three Entrez databases: the EST and GSS databases contain all sequences in the EST and GSS GenBank divisions, respectively, while the Nucleotide database contains sequences from all other GenBank divisions (as well as sequences from databases other than GenBank). The GenBank data may be selectively accessed within Entrez using query limitations (see Note 5 ). Conceptual translations of coding regions annotated on GenBank sequences are available in the Protein database .

Records within the Entrez system are linked to other records both within and across databases. An example of a simple linkage is that between a GenBank sequence record and the PubMed abstract for the paper listed in the ‘Journal’ section (see Note 2 ) of the GenBank record. Computational linkages are also made between nucleotide and protein sequences, such as those based on sequence similarities discovered using BLAST [15, 16]. In addition, records available in Entrez may offer LinkOuts (www.ncbi.nlm.nih.gov/projects/linkout/) that lead to a variety of external databases. Queries on the Entrez databases are made with simple text words combined using boolean logic and either limited to a particular record field (see Note 5 ) or applied to all fields. A web-based service called Batch Entrez allows bulk sequence downloads specified by an arbitrary set of GenBank identifiers supplied from a local file.

2.2.2 Scripted Access Through Entrez with the Entrez Programming Utilities

Entrez queries of GenBank and downloads of individual records or sets of records may be made through the Entrez system from scripts using a set of server-side utilities called the Entrez Programming Utilities (see Note 6 ). Full documentation for these utilities is available at eutils.ncbi.nlm.nih.gov.

2.2.3 Bulk Downloads of GenBank via FTP

The full bimonthly GenBank release and the daily updates, which also incorporate sequence data from ENA and DDBJ, are available by anonymous FTP from NCBI in both flatfile and ASN.1 formats (ftp://ftp.ncbi.nlm.nih.gov/genbank, ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1). The GenBank data are also available at a mirror site at Indiana University (ftp://bio-mirror.net/biomirror//genbank). As described in Subheading 2.1.1, the full release in flatfile format is distributed as a set of compressed files. To download GenBank files using a command line FTP client, connect to ftp.ncbi.nlm.nih.gov, log in as ‘anonymous’ and give your email address as password.

For many purposes a download of the entire GenBank databases is not required since standard sequence-similarity searches, such as BLAST , may be performed remotely on the GenBank data at NCBI . However, if a local copy of GenBank is required, one major consideration is local storage space.

As of GenBank release 202, the uncompressed GenBank flatfiles require 642 gigabytes of disk space. An alternative option is to download the ASN.1 format data, which requires 538 gigabytes once uncompressed. Once a full release of GenBank had been saved locally, it can be kept current using incremental updates. For this purpose, NCBI provides a noncumulative set of updates at ftp.ncbi.nlm.nih.gov/genbank/daily-nc. A Perl script at ftp.ncbi.nlm.nih.gov/genbank/tools/ converts a set of daily updates into a cumulative update (ftp.ncbi.nlm.nih.gov/genbank/tools/).

3 Methods

Four general methods that are central to making use of GenBank include downloads of all or a subset of the database to support local analysis, the construction and use of local GenBank-derived databases for sequence similarity searches, and the execution of remote searches of the database using Web or command line clients. These methods are discussed in the sections that follow.

3.1 Download All GenBank PLN Division Sequences

3.1.1 Strategy

The GenBank PLN division is a source of well-annotated sequences that can serve as a compact, information-rich local database. For such simple, division-oriented bulk downloads, FTP transfers are the most convenient given that the division name forms part of the file names on the GenBank FTP site. Downloading these files is simply a matter of connecting to the NCBI FTP site and specifying files of the name ‘gbpln*.seq.gz’ (where the asterisk is a wild card that matches any set of characters). The download will result in a local set of compressed files for sequence records in the GenBank flatfile format. To view and use the records, they must be uncompressed.

3.1.2 Execution

To begin, perform an anonymous FTP login to the NCBI FTP server using either a command line FTP client or a web browser. If using a command line FTP client, connect to ftp.ncbi.nlm.nih.gov and then issue the following commands:

  1. 1.

    cd genbank

  2. 2.

    get gbpln*.seq.gz

  3. 3.

    Following the completion of the transfers, type ‘quit’ at the FTP prompt.

Although less convenient, these files can be downloaded using a web browser as follows:

  1. 1.

    Navigate to ftp://ftp.ncbi.nlm.nih.gov/genbank

  2. 2.

    Click on each gbpln*.seq.gz file in turn to download it.

As described in Subheading 2.1.1, the GenBank data are also available in the compact and versatile ASN.1 format. Besides their significantly smaller size, the ASN.1 format files offer other advantages over the compressed flatfiles. Using a suite of command line tools available from NCBI (see Note 7 ), ASN.1 files can be used to generate records in a variety of other formats. These formats include the GenBank flatfile (see Note 2 ), FASTA , 5-column Feature Table (see Note 8 ) and INSDC XML. In addition, both the nucleotide sequences, and the protein sequences derived from their coding sequence (CDS) annotations are readily accessible. ASN.1 formatted GenBank records can also be used to generate databases, both nucleotide and protein, for local analysis using BLAST (Subheading 3.4).

3.2 Download a Set of GenBank Sequences for a Single Plant Species

3.2.1 Strategy

A method to download a complete set of sequences for a single plant species, such as Vitis vinifera, must retrieve all Vitis vinifera sequences regardless of GenBank division. In addition to the PLN division, divisions such as EST, STS, and GSS also contain plant sequences. Rather than downloading each division in its entirety merely to get the Vitis vinifera data, it is more practical to use the flexibility of the Entrez search and retrieval system. Using Entrez, one can specify the subset of GenBank to download, choose a download format, and then download the sequence records as a single batch.

3.2.2 Execution

As described in Subheading 2.2.1, GenBank sequences are contained in three Entrez databases: Nucleotide , EST and GSS. To retrieve all Vitis vinifera sequences in GenBank, one therefore needs to retrieve data from each of these three databases. To begin, first retrieve all sequences for Vitis vinifera using the following query in the Nucleotide database:

  • vitis vinifera[orgn]

As shown in Fig. 1, links to the results from the EST and GSS databases are provided above the search results. Download these data by following those two links and using the ‘Send to’ menu. To download the data in the Nucleotide database, click on the INSDC limit in the upper right to restrict the set to GenBank data, and then use the ‘Send to’ menu as with EST and GSS. The following Entrez query accomplishes the same GenBank restriction:

Fig. 1
figure 1

Portion of the search results page in Entrez Nucleotide with the query ‘vitis vinifera[orgn]’. Links above the list of results allow access to the records retrieved in the EST and GSS databases, while filters in the upper right (such as INSDC) allow the results to be further restricted

  • vitis vinifera[orgn] AND srcdb ddbj/embl/genbank[prop]

As of May 2014, these queries retrieved 247,000 Vitis vinifera sequences from Nucleotide , 447,000 from EST and 229,000 from GSS for a total of 923,000 sequences. While it is possible to download such a set using a web browser, an alternative approach is to use the Entrez Programming Utilities to accomplish the download (see Note 6 ).

3.3 Download the Complete Genome of a Plant Species

3.3.1 Strategy

In recent years the number of plant species with sequenced genomes has continued to increase. As of this writing, about 60 plant species have a genome assembly that contains assembled chromosomes (with or without gaps), while another 80 have at least scaffold assemblies. As these data continue to mature, the NCBI interfaces will consequently continue to adapt to accommodate these changes; however, the Genome database should remain a central place for users to access genomic datasets for a given species. Another, and often more direct, approach to accessing these datasets is to download them from the NCBI FTP site, and this approach will be discussed here.

3.3.2 Execution

There are two primary areas of the NCBI FTP site in which eukaryotic genome datasets may be found: the GenBank genomes site (ftp.ncbi.nlm.nih.gov/genbank/genomes/) and the RefSeq genomes site (ftp.ncbi.nlm.nih.gov/genomes/). Each of these sites contains a list of subdirectories with names corresponding to the scientific names of each species. Once within the directory for the desired species, ‘readme’ files will describe the contents, which may vary somewhat from species to species depending on the nature of that species’ genome and the methods used in the sequencing project. In general, separate sequence files will be available for each chromosome, and therefore downloading the genome is simply a matter of downloading the individual chromosome data files.

3.4 Establish and Perform Sequence Similarity Searches on a Local Database of PLN Division Sequences

3.4.1 Strategy

The Basic Local Alignment Search Tool, or BLAST , is the most widely used program for sequence similarity searches. In addition to comparing nucleotide sequences, BLAST can also translate nucleotide sequences into all six reading frames at run time in order to compare protein coding regions. There are three basic approaches to performing BLAST searches against NCBI sequence databases: (1) using the NCBI BLAST web interface, (2) using a local installation of BLAST but using the databases on NCBI servers; (3) using a local installation of BLAST with local databases. BLAST binaries for standard computing platforms are available for download on the NCBI FTP site (ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). While a complete discussion of the BLAST algorithm and the interpretation of results is beyond the scope of this chapter, details can be found elsewhere [1519]. Comprehensive documentation on the BLAST executables is also available on the NCBI Bookshelf (http://www.ncbi.nlm.nih.gov/books/NBK1763/).

3.4.2 Execution

To search PLN division sequences on the NCBI web interface using BLAST (blast.ncbi.nlm.nih.gov), it is simply a matter of loading the nucleotide blast page and restricting the default database (nr/nt) to the desired organism (or organisms) using the Organism selection box, and then adding the Entrez query ‘srcdb ddbj/embl/genbank[prop]’ (see Subheading 3.2.2). If many such BLAST searches need to be performed, it may be advantageous to download local BLAST binaries so that the searches can be automated. The BLAST binaries allow searches to access the databases on NBCI servers, obviating the need for a local copy and ongoing updating of these (often large) files. The following command will run a standard nucleotide BLAST search limited to green plant sequences in the INSDC databases:

  • blastn -db nt -query myfile_nuc -out myoutfile -entrez_query "viridiplantae[orgn] AND srcdb ddbj/embl/genbank[prop]" -task blastn -remote

In the above command, the parameter ‘-db’ specifies by the name of the database, and ‘-query’ is followed by the name of a file containing one or more FASTA formatted sequences to be used as queries. The ‘-out’ parameter is given the name of the desired output file, and ‘-entrez_query’ the Entrez query used to restrict the given database. The task option specifies the algorithm used; in this case blastn invokes standard nucleotide BLAST . The default value is ‘megablast’, a faster but less sensitive version of nucleotide BLAST that is useful for finding matches within the same or closely related species [20]. Finally, the ‘-remote’ flag causes blastn to access databases on the NCBI servers rather than local databases.

To run BLAST searches independently of NCBI servers, both the BLAST executable and the desired sequences (such as the GenBank sequences downloaded using Subheadings 3.1 and 3.2 above) need to be downloaded to a local machine. Once this is done, the first task in executing a BLAST search is to convert the sequence data into a local BLAST database. This is accomplished by the makeblastdb program, which can create a local BLAST database from a file of concatenated FASTA format sequences, or from an ASN.1 format GenBank file. The following command line is used to create a local nucleotide database from a file of concatenated FASTA format sequences contained within the file ‘myfasta’:

  • makeblastdb -in myfasta –input_type fasta –dbtype nucl -parse_seqids

The parse_seqids flag causes makeblastdb to create indices that allow individual sequence records to be retrieved by another program in the BLAST package, blastdbcmd (see Note 9 ), on the basis of sequence identifiers found in the definition lines of the records.

To create a nucleotide sequence database with a set of binary ASN.1 format GenBank sequence files, use the following:

  • makeblastdb –in gbpln1.aso –input_type asn1_bin –dbtype nucl -parse_seqids

To create a protein sequence database from the corresponding translations of annotated CDS features on the nucleotide sequences contained in gbpln1.aso, use the following:

  • makeblastdb –in gbpln1.aso –input_type asn1_bin –dbtype prot -parse_seqids

By default makeblastdb will produce several files whose names consist of the input file name (-in) followed by one of various extensions. To search this database, simply provide the name of the input file provided to makeblastdb. For example, to search the nucleotide database formatted above using the ‘blastn’ program with a nucleotide query sequence in FASTA format within a file named ‘myfile_nuc, use the following:

  • blastn -query myfile_nuc -db gbpln1.aso –task blastn -out myoutfile

To search the protein translations arising from the CDS features on the records in gbpln1.aso, assuming that the protein sequence version of gbpln1.aso has been created using makeblastdb as described above, use:

  • blastp -query myfile_prot -db gbpln1.aso -out myoutfile –evalue 1e-6

A nucleotide query can also be used to query the protein translations using the blastx algorithm, which will translate the query into all six reading frames:

  • blastx -query myfile_nuc -db gbpln1.aso -out myoutfile –evalue 1e-6

The ‘-db’ parameter in these commands is followed by the name of the database formatted using makeblastdb. The quality of the alignments returned by BLAST can be controlled using the ‘-evalue’ parameter to set an ‘expect value’ limit. In this case, an expect value of 1e-6 (0.000001) has been specified, which should exclude those alignments expected to occur by chance more than 0.000001 times in a database of the size of gbpln1.aso.

A large number of parameters may be specified on the various BLAST executables, and these parameters determine the number, format, and quality of the alignments returned. To see them all, type ‘-help’ after any of the executable names. For detailed documentation on the parameters, see the online documentation at www.ncbi.nlm.nih.gov/books/NBK1763/.

3.5 Identify Potential Coding Regions in TSA Datasets for Green Plants

3.5.1 Strategy

In the past few years the TSA division of GenBank has been one of the most rapidly expanding divisions. TSA data are a more valuable version of raw next-generation sequencing data, such as SRA, because they have been partially assembled and therefore are more likely to represent a full transcript. For plant species with little assembled genomic data, TSA data can therefore be a rich collection for identifying putative protein coding regions (CDS) in species without genome assemblies. By using the translated BLAST algorithm tblastn, one can search a protein query against a TSA database translated into all six reading frames, easily revealing these putative CDS sequences.

3.5.2 Execution

To search the TSA data for CDS regions corresponding to a given protein (or proteins), first assemble the query protein sequences either as accession numbers or FASTA files. As with Subheading 3.4 above, BLAST searches against TSA data can be performed on the NCBI web interface or using local BLAST installations, with or without local copies of the TSA database. On the NCBI web interface, the TSA database is one of the options on the tblastn search page, which also allows the database to be limited by organism and/or Entrez query (see Subheading 3.4.2). Figure 2 shows the best hits resulting from such a search using a soybean protein (NP_001236858) as query. To run tblastn locally but against the databases on NCBI servers, use the following command:

Fig. 2
figure 2

Portion of the results page of a tblastn search against the TSA database restricted to green plants (viridiplantae[orgn]). The query used was NP_001236858, the sequence for ACC synthase from soybean. The results contain hits from a variety of plant species, and as shown the hits have sequence identities with the query from 64 to 90 %, and all but two cover at least 97 % of the query sequence

  • tblastn -db tsa_nt -query myfile_prot -out myoutfile -entrez_query "viridiplantae[orgn]" –remote

As mentioned above, this approach has distinct advantages: (1) there is no need to download and update a local copy of a large database (as of this writing the TSA collection requires 17 GB of disk space); and (2) organism and Entrez limitations to the database can be applied at runtime. On the other hand, the method will be limited by network and NCBI traffic, and so it may be desirable to download the preformatted TSA database to local machines. Because of its size, the TSA database on the NCBI ftp site is split into several volumes, each approximately 1 GB of compressed data. To facilitate downloading such sets of files, NCBI provides a utility script named update_blastdb.pl as part of the BLAST software package. This script will download all of the component files of the TSA database (or other preformatted databases) to local disk, where they can then be uncompressed and extracted. Because these files are preformatted, running makeblastdb is unnecessary, and the files can be immediately used in a search:

  • tblastn -query myfile_prot -db tsa_nt -out myoutfile -evalue 1e-6

As with all local BLAST executables, it is possible to restrict the TSA database to only those sequences matching a GI list in a local file (with one GI per line in the file). For example, running the following query in the Nucleotide database and then downloading the data as a GI list will create such a file that can accomplish an organism restriction to green plants:

  • tsa[keyword] AND viridiplantae[orgn]

If this file is named greenplantTSA.gi, then the following search will be restricted to green plants:

  • tblastn -query myfile_prot -db tsa_nt -out myoutfile -evalue 1e-6 –gilist greenplantTSA.gi

While the above approach for restricting the search is valid at the time of this writing, these functions will be maturing over time, and users are encouraged to visit the BLAST web site regularly for updates and improvements to these methods.

4 Notes

  1. 1.

    The files in the GenBank releases are partitioned into 19 ‘divisions’ that correspond roughly to taxonomic groups such as bacteria (BCT), viruses (VRL), primates (PRI), and rodents (ROD). Additional divisions have been added over time to support specific sequencing strategies. These include divisions for expressed sequence tag (EST), genome survey (GSS), high throughput genomic (HTG), high throughput cDNA (HTC), environmental sample (ENV) and transcript shotgun assembly (TSA) sequences. To facilitate downloads, most divisions are partitioned into multiple files for the bimonthly GenBank releases on NCBI ’s FTP site. In addition, three special classes of records exist that do not appear within the usual 19 divisions of GenBank: WGS, TPA and TSA records. Over 600 billion bases of WGS sequence appear in GenBank as sets of WGS contigs, many of them bearing annotations, originating from a single sequencing project. Third Party Annotation (TPA) (www.ncbi.nih.gov/Genbank/TPA.html) records support the reporting of published, experimentally confirmed sequence annotation by a scientist other than the original submitter of the primary sequence. TSA records are assembled from short reads (from SRA or the Trace Archive) or from ESTs. The content of the GenBank divisions is summarized in Table 3.

    Table 3 Division codes and content of the 19 GenBank divisions
  2. 2.

    The GenBank flatfile is the standard data format used for GenBank records and is the format of the data in the GenBank FTP files. Each record begins with a LOCUS line followed by a header containing the database identifiers, the title of the record, references, and submitter information. The header is followed by the feature table (see Note 4 ) and the sequence itself on the line following the ‘origin’ field. The ‘//’ symbol in the FTP files marks the boundary between successive records. The GenBank flatfile is described in detail in the GenBank release notes (ftp.ncbi.nlm.nih.gov/genbank/gbrel.txt). In the Entrez system, the GenBank format is the default display for records in the traditional divisions. An interactive sample record is linked from the GenBank home page (www.ncbi.nlm.nih.gov/genbank/).

  3. 3.

    Each GenBank record, consisting of both a sequence and its annotations, is assigned a unique identifier called an ‘accession number’ that is shared across the three INSDC members (GenBank, DDBJ, ENA). The accession number appears on the ACCESSION line of a GenBank record and remains constant over the lifetime of the record, even when there is a change to the sequence or annotation. Changes to the sequence data itself are tracked by an integer extension of the accession number, and this Accession.version identifier appears on the VERSION line of the GenBank flat file. An entry appearing in the database for the first time has an ‘Accession.version’ identifier equivalent to the ACCESSION number of the GenBank record followed by ‘.1’. In addition, each version of the sequence is assigned a unique integer identifier called a ‘GI number’ that also appears on the VERSION line of GenBank flatfile records:

    • ACCESSION AF000001

    • VERSION AF000001.1 GI: 987654321

    When a change is made to a sequence in a GenBank record, a new GI number is issued to the updated sequence and the version extension of the Accession.version identifier is incremented. The accession number for the record remains unchanged, and will always retrieve the most recent version of the record; the older versions remain available under the old Accession.version identifiers and their original GI numbers. The Revision History report, available from the ‘Display Settings’ menu on the sequence record view, summarizes the various updates for that GenBank record.

    A similar system tracks changes in the corresponding protein translations. These identifiers appear as qualifiers for CDS features in the FEATURES portion of a GenBank entry, e.g., /protein_id=AAA00001.1’ Protein sequence translations also receive their own unique gi number, which appears as a second qualifier on the CDS feature:

    /db_xref=‘GI:1233445

  4. 4.

    The feature table is the portion of the GenBank record that provides information about the biological features annotated on the nucleotide sequence. These features include coding regions and their protein translations, noncoding regions, genes, variations, sequence tagged sites, transcription units, repeat regions, and sites of mutations or modifications. The International Sequence Database Collaboration (www.insdc.org) produces a document describing and identifying the features allowed on GenBank, DDBJ and ENA records (http://www.insdc.org/documents/feature-table).

  5. 5.

    The GenBank database and protein sequences arising from coding sequence annotations on GenBank records can be searched at NCBI using BLAST using either a web interface or a command-line client. In either case, subsets of the data may be selected for searching using Entrez (Subheading 2.2.1) queries. A query used to limit a search to sequences from a particular organism has the form ‘organism[orgn]’ where ‘organism’ is an organism name and the search is limited to terms indexed within the Entrez ‘organism’ field by specifying ‘orgn’ within square brackets. For example, to specify only sequences from Arabidopsis thaliana, use ‘Arabidopsis thaliana[orgn]’. Some Entrez queries involving terms indexed in the Entrez ‘properties’ field are listed in Table 4. Entrez queries can be combined using boolean operators such as ‘AND’, ‘OR’, and ‘NOT’ (Subheading 3.2.2). On the BLAST web pages, Entrez queries are typed into the box labeled “Entrez query”.

    Table 4 Entrez queries that are useful in limiting BLAST searches
  6. 6.

    The Entrez Programming Utilities (E-utilities) are the public API for the Entrez system and consist of a set of nine server-side programs that allow automated access to the Entrez search and retrieval functions. The E-utilities (eutils.ncbi.nlm.nih.gov) accept a set of parameters that may be URL-encoded or transferred via the SOAP protocol. Searches of Entrez are performed using ‘esearch’; short record summaries are retrieved using ‘esummary’; full records may be downloaded using ‘efetch’; and linking between records may be performed using ‘elink. Additional E-utilities are available for more specialized functions.

    The E-utilities may be used from within any programming language that supports the posting of a URL. Results are returned in XML for all E-utilities except ‘efetch’, which supports return modes of XML, HTML, text and ASN.1 as well as return formats such as the GenBank flatfile, FASTA , and the INSDC XML format. Several sample E-utility URLs are shown in Table 5. Additional examples, including sample Perl scripts, are provided in Chapter 4 of the online documentation (eutils.ncbi.nlm.nih.gov).

    Table 5 Representative URLs for Entrez Programming Utility calls
  7. 7.

    NCBI offers command line utilities for working with ASN.1 formatted data. These utilities are available for several platforms and may be downloaded from ftp.ncbi.nlm.nih.gov/asn1-converters/. To see a complete list of command line parameters for any of the programs, run the program with a trailing dash and no parameter. A list of several of these programs with brief descriptions is given in Table 6. One particularly useful program is asn2all, and some examples of using it follow.

    Table 6 Selected NCBI utility programs for conversion of data from and to the ASN.1 format

    The program asn2all is primarily intended to generate reports from the binary ASN.1 Bioseq-set GenBank release files (ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1).

    The following command will generate GenBank flatfile records for the nucleotide sequences as well as GenPept flaftile records for protein sequences contained within gbpln1.aso (one of the uncompressed ASN.1 files from the PLN division). These two sets will appear in the files “gbpln1.nuc” and “gbpln1.prt”, respectively.

    asn2all -i gbpln1.aso -a t -b T -f g -o gbpln1.nuc -v gbpln1.prt

    Additional formats can be obtained by changing the value of the “-f” parameter (Table 7). The “-a t” parameter value invokes batch processing of a GenBank release file and “-b T” indicates that the input file is binary ASN.1.

    Table 7 Output format options for asn2all

    A remote fetching option, “-r T”, allows the download of an ASN.1 record from NCBI over a network connection using an accession number or NCBI ‘gi’ identifier (see Note 3 ). For example, to perform a remote fetch of the feature table within the GenBank record for the Epifagus virginiana chloroplast genome (accession number M81884) use the following:

    asn2all -r T -A M81884 -f t

    This produces output in the 5-column Feature Table format described in Note 8 .

  8. 8.

    When submitting sequences to GenBank that have annotations, submitters have the option to upload these annotations using a file format commonly referred to as the “5-column Feature Table.” This format specifies a simple text file where the annotation data are entered in tab-delimited columns. Details about this format are provided in the GenBank Submission Handbook (www.ncbi.nlm.nih.gov/books/NBK63592/).

  9. 9.

    The program ‘blastdbcmd’ is part of the standalone BLAST package and is a tool for interacting with BLAST databases formatted by makeblastdb. For example, blastdbcmd can provide basic statistics about a BLAST database and download specific records from that database. Full details of how to use blastdbcmd are provided in the BLAST+ documentation (www.ncbi.nlm.nih.gov/books/NBK1763/).