Keywords

14.1 Introduction

Bioinformatics has emerged into a fully fledged multidisciplinary field that integrates statistics and informatics for the analysis of biological data. Due to the advancement in next-generation sequencing (NGS) technology, there has been a dramatic growth in studies of fish genomics (Kumar and Kocour 2017). Public databases now host a catalogue of complete genomes of biological species (mainly fish), which contain protein sequences, protein three-dimensional structures, metabolic pathways, and biodiversity-related information (Vera-Escalona et al. 2017; Adrian-Kalchhauser et al. 2017). Bioinformatics is helping to solve biological problems using software and databases in areas such as functional genomics, bimolecular structure, proteome analysis, taxonomy, and pesticide molecule design (Cambiaghi et al. 2016).

Our earth harbors approximately 8.7 million species, of which around 2.2 million are marine (Mora et al. 2011). IUCN Red List version 2016–3 estimates that the number of described fish species is 33,400. The challenging part was to identify and classify this many species. Earlier methods employed to identify species relied mainly on morphology, protein electrophoresis, and chromatography (Yilmaz et al. 2007; Strauss and Bond 1990; Viswanathan and Pillai 1956). The barcoding technique is effectively utilized in fisheries and has been used to identify recently radiated megadiverse fauna from neotropical areas. The mitochondrial gene encoding cytochrome c oxidase subunit I (COI) is used as a marker in phylogeny, phylogeography, and population genetics studies (Pereira et al. 2012; Sbordoni 2010). It has been used for systematic study of native freshwater fish, to monitor the geographic distribution of species (Hubert et al. 2008), and to monitor threatened shark species (Velez-zuazo et al. 2015). These applications facilitate authentication of commercially important species and thereby enhance transparency and fair trade in the domestic fisheries market (Cawthorn et al. 2012). Recent developments include meta-barcoding, in which DNA released by organisms into the environment (eDNA) via cells, excreta, gametes, and decaying materials can effectively be used for species identification. A study conducted in the English Lake District described fish communities in large lakes, both quantitatively and qualitatively (Hanfling et al. 2016). The DNA meta-barcoding approach is considered a next-generation tool for biodiversity monitoring in aquatic ecosystems (Valentini et al. 2016). Mini-barcode primer pairs of length 127–314 bp were developed for authentication of fish food products (Shokralla et al. 2015).

In 2004, an international initiative by the Consortium for the Barcode of Life (CBOL) was taken to make DNA barcoding a standard method or tool for identification of species (http://www.barcodeoflife.org/content/about/what-cbol) (Group et al. 2009). The Barcode of Life Data System (BOLD) is the central informatics platform for DNA barcoding (ibol.org). The Fish Barcode of Life (FISH-BOL) and Shark Barcode of Life (Shark-BOL) initiatives are two important fish barcoding projects at the global level. In India, the Fish Barcode Information System (FBIS), a DNA barcode database on fish, was developed by the National Bureau of Fish Genetic Resources (NBFGR). The overall process of DNA barcoding in fish exploits both molecular and computational methods. A unique region of the specimen is considered as a barcoding marker. In the case of fish, the marker is the gene encoding cytochrome c oxidase I (COI) (Hebert et al. 2003).

The general strategy of barcoding involves DNA extraction from the specimen, amplification of a unique marker region using the polymerase chain reaction (PCR), and sequencing. Computational aspects such as editing and aligning sequences is carried out using software such as BOLD v 3.0 (Pereira et al. 2012), TaxI (Steinke et al. 2005), MEGA (Kumar et al. 2008), MEGA 5.05 (Landi et al. 2014), CodonCode Aligner 3.7.1.1(Shokralla et al. 2015), and GENIOUS PRO 5.4.2, (Henriques et al. 2015). Results are later submitted to GenBank or BOLD databases. Hence, once sequencing is completed, the computational aspect plays a key role not only in identification but also in addressing questions related to evolution, diversity (Shen et al. 2016), and taxonomy (Hebert and Gregory 2005).

14.2 Molecular and Computational Approaches for Fish DNA Barcoding

The tissue sample collected from the fish specimen is subjected to DNA extraction. PCR amplifies the target COI gene using a universal primer cocktail (Ivanova et al. 2007). Sequencing of amplified PCR products by BigDye Terminator v.3.1 Cycle Sequencing Kit (Cawthorn et al. 2012) gives both forward and reverse strand sequences. Subsequent important steps are editing, alignment, and sequence submission.

A full-length sequence is made up of aligned reverse and forward strand sequences for all samples of a species ( http://mail.nbfgr.res.in/fbis/protocol.php). All the aligned sequences are translated into amino acids to approve the efficiency of the sequence and to identify the presence of nuclear DNA pseudogenes, insertions, deletions, or stop codons (Shen et al. 2016). Edited sequences are placed into the BLAST tool of the National Center for Biotechnology Information (NCBI) to obtain the nearest similar sequence matches and are later submitted to GenBank or BOLD. (http://mail.nbfgr.res.in/fbis/protocol.php). Available editing packages are DNASTAR multiple packages (Chen et al. 2015), Sequencer 4.8 (Gene Codes) (Velez-zuazo et al. 2015), GAP 4 (Shirak et al. 2016; Baxevanis and Ouellette 2004), MEGA version 4.1 (Costa et al. 2012), and MEGA 5.05 (Landi et al. 2014). Useful software packages, alignment tools, databases, and web pages pertaining to barcoding and other related analysis are listed in Tables 14.1, 14.2.

Table 14.1 Fish DNA barcoding databases
Table 14.2 Software used for DNA barcoding

Sequence alignment is a method for finding commonality and conserved sequence regions between two or more sequences using a statistical algorithm. It is an important step in identifying the functional, structural, and evolutionary roles of a molecular sequence. A number of sequence alignment packages are available, among which BLAST (Altschul et al. 1990; Madden 2013), MUSCLE (Henriques et al. 2015), CLUSTULX 2.0 (Chen et al. 2015), ClustalW (Velez-zuazo et al. 2015), SeqScape v. 2.1.1 (Applied Biosystems. Inc.) (Zhang and Hanner 2012), BOLD v.3.0 (Pereira et al. 2012), and CodonCode Aligner v 3.7.1.1 (CodonCode Corp., Dedham, MA, USA) (Shokralla et al. 2015) are routinely used.

The usefulness of DNA barcode data in deciphering the phylogenetic relationship between and within species is well studied and involves a series of steps such as alignment, determination of substitution model, and tree building. The latter includes either distance-based tree building or character-based tree building. The distance-based method utilizes the distance between two aligned sequences to generate phylogenetic trees, whereas character-based methods use the composition of oligonucleotide frequencies (e.g., di-, tri-, tera-, penta-, hexa-, heptanucleotides) in the sequences (Baxevanis and Ouellette 2004; Higgs and Manchester 2001). The most commonly employed distance-based methods are neighbor-joining (Saitou and Nei 1987), the Fitch–Margoliash method, the unweighted pair group method with arithmetic mean (UPGMA), and minimum evolution (ME). Maximum parsimony (MP) and maximum likelihood (ML) are two major character-based methods used for phylogentics (Felsenstein 1981). In addition, Bayesian analysis has been proposed for phylogeny (Huelsenbeck and Ronquist 2001). Tests for evaluating constructed trees include the skewness test, permutation test, and bootstrapping, which can be parametric or nonparametric, and the likelihood ratio test . Software packages for phylogenetic analysis include PHYLIP, PAUP, PUZZLE, FastDNAml, MACCLADE, and MOLPHY, along with internet-accessible phylogenetic software such as WEBPHYLIP, PhyloBLAST, BLAST 2, and Orthologue Search Server (Baxevanis and Ouellette 2004).

Noncoding internal transcribed spacer genes have also been suggested as candidate barcodes, along with the COI gene for animal and plant DNA barcoding (Gao et al. 2017; Yang et al. 2017). Two new approaches (DV-RBF and FJ-RBF) have been used to align the noncoding regions for DNA barcoding and showed 100% success rate in identifying marine fish species. (Zhang et al. 2012). On other hand, alignment-free methods such as normalized compression distance (NCD) and information-based distance (IBD) have been utilized for taxonomic analysis of barcode sequences (La Rosa et al. 2013). Taxonomic classification methods are mainly categorized into (1) tree-based approaches, (2) composition-based approaches, (3) similarity-based approaches, and (4) hybrids. These methods required reference databases to predict the taxonomy (Tanabe and Toju 2013).

In a recent study, similarity-based methods such as nearest-neighbor, centric auto-k-NN (NN Cauto), and query-centric auto-k-NN (Q Cauto) were proposed for barcoding studies (Tanabe and Toju 2013). A method of string kernel-based sequence analysis of barcode data sets was proposed that considerably improves species identification accuracy compared with traditional approaches (Kuksa and Pavlovic 2007). The few sequence identification methods that use pairwise alignment (e.g., BLAST) are not able to discriminate species that have highly similar sequences, because only very few base pairs are different between the sequences. To address this issue, alignment-free methods (e.g., BRONX) were developed to identify species sequences (Little 2011). BRONX detects short subsequence regions and matching regions in reference sequences. Based on these regions, the algorithm generates a score without use of multiple sequence alignment to identify sequences at the genus level (Little 2011).

14.3 Public Domain Databases

Recent progress in next-generation sequencing (NGS) platforms has led to advancement of the discipline of bioinformatics for the annotation of genome data. Public databases contain huge amounts of accessible data on whole genome sequences, which have improved research in applied fish science. There are some very popular primary, secondary, and specialized databases available from BOLD, FISH-BOL, GenBank, and FBIS.

14.3.1 Barcode of Life Data System

The Barcode of Life Data System (BOLD) (http://www.barcodinglife.org) facilitates a detailed collection of specimens deposited by researchers from different barcoding studies. This database holds three main categories of information. The first category is basic information on the specimen and sequence entries. The second maintains quality assurance and manages barcode data with all related information. The third category facilitates a detailed catalogue of specimen data entries from geographically different researchers. A user can store specimen information in the following sections:

  • Species name

  • Voucher data, institution storing, and catalogue number

  • Collection record, which includes collector name, location with GPS coordinates, and data of collection

  • Identifier of the specimen

  • COI sequence with minimum 500 bp

  • PCR primers referred for amplicon capture of trace files

BOLD is an informatics workbench used for collection, storing, scrutiny, and publication of DNA barcode entries and is freely accessible. It involves more than 65,000 lines of combined code written in Java, C++, and PHP. To gain formal barcode status, certain criteria must be satisfied, including species name, voucher data, and collection record. BOLD employs many tools to identify data anomalies or low-quality records. All acquiesced sequences are translated into amino acids and are matched against a hidden Markov model (HMM) of COI protein to confirm that they essentially originate from the COI sequence. Later sequences are checked for stop codons, and also against a small set of possible contaminants. If any errors are detected, the submitter is informed and the sequence is flagged. After providing a trace file, BOLD further determines a PHRED score for each nucleotide position and a mean value for the full sequence based on these results. Next, it manages each sequence entry into one of four classes: failed (no sequence), low quality (mean PHRED < 30), medium quality (mean PHRED = 30–40), and high quality (mean PHRED > 40). The data stored in BOLD can be readily exported in FASTA format for use in other analytical packages. BOLD provides an examination utility that permits users to determine sequence coverage for a specific taxonomic or geographic region. It includes an integrated analytic system (MAS), which provides data analysis tools such as the taxon identification (ID) tree. Unknown sequences are identified by pasting their sequence record into the input box on the ID form. Core data element records in BOLD consist of a specimen page and a sequence page. Barcodes in the search archives are grouped into two categories. Species are considered with three representatives and maximum divergence of 2%, A HMM method is used to align the query sequence with archive sequences. The HMM method is faster than BLAST because of its efficient data processing capability. BOLD detects species if the query sequence displays a close match with at least <1% divergence against the archive sequences (Ratnasingham and Hebert 2007).

14.3.2 Fish Barcode of Life Campaign and Fish Barcode Information System

The campaign FISH-BOL was started in 2004 with the aim of generating tools for identifying all types of fish species. Its primary goal was to gather barcodes for all of the world’s fish. FISH-BOL comprises sequences, geographical information, and images for examined specimens, thereby creating a valuable public resource. Information organized and analyzed through the BOLD database is later delivered via a data feed to the FISH-BOL web portal. This depository utilizes taxonomic information resulting from FishBase and maintains a catalogue of fish (Ward et al. 2008). The International Nucleotide Sequence Database Collaboration (INSDC) archives DNA sequences from the FISH-BOL campaign and annotates each sequence with the key word “barcode” when it meets the barcode data standards. It requires the bidirectionally sequenced 5′-end of the COI gene sequence, valid species name, details concerning voucher specimens, coordinates of the collection locality, collection date, collecter, and identifier. Also required are a list of the primers used to generate reference sequences and archiving of the underlying electropherogram trace files in a publically accessible NCBI trace archive. All this information is useful for using barcodes in molecular diagnostics applications. BOLD provides an online workbench to FISH-BOL (Ward et al. 2008).

The FBIS web-based tool is designed for the fish of India. The database has a total of 2334 COI gene sequences belong to 472 aquatic species. It works both as a local DNA barcode library and as an analysis system and contains valuable data regarding the phenotype, distribution, and IUCN Red List status of fish (Nagpure et al. 2012). This database enables saving and extracting data in an easy way with simple steps. A user can submit species sequences through a submission protocol. Species identification is performed using similarity search programs; it finds homologues with almost 99% similarity to the query sequence, which accurately assigns the species (Nagpure et al. 2012).

14.3.3 NCBI GenBank

GenBank is a comprehensive database that contains nucleotide sequences for more than 250,000 species (Benson et al. 2013). NCBI offers an online/offline sequence submission platform to deposit sets of barcode sequences to the GenBank database. Along with the barcode data, the submission platform collects other annotations such as specimen voucher, geographical information, sample collection date, primer data, and raw files to help recognize the sequence’s source organism and to maintain the accuracy of the sequence. The GenBank file structure format is easy to understand for users. It contains sequence data along with the accession numbers and gene names, taxonomy, references to published literature, and other meaningful information. The GenBank format comprises the locus, definition, accession, keywords, source, reference, and features fields for the gene. The user can download the FASTA format nucleotide or amino acid sequence from the FASTA link given on files or send to menu option (https://www.ncbi.nlm.nih.gov/genbank/barcode/). It is important to give the publication details related to barcodes and sequences in FASTA format with reverse and forward primers. Protein sequence submission is optional.

14.4 DNA Barcoding Repositories and Their Associated Tools

It is difficult to preserve the data integrity, interoperability, and utility of information generated relating to the “what”, “where”, and “when” of biodiversity data. Furthermore, DNA barcoding and other biodiversity information systems must maintain data standards so that appropriate metadata is efficiently included. Three main organizations (the International Barcode of Life Project (iBOL), CBOL, and BOLD), promote barcoding research with the aim of generating reference barcodes (Group et al. 2009; Ratnasingham and Hebert 2007). These organisations are focused toward development of barcoding as a universal standard and offer an online workbench for collection, management, analysis, and use of DNA barcodes.

iBOL (http://ibol.org) has network of collaborators from about 150 countries, includes more than 190,000 marine species, and has identified 6000 potentially new species (flowering plants, ants, birds, butterflies, ants, mammals, bees, fish, and fungi). It has collections in the form of ecosystems such as rain forests, kelp forests, poles, seas, and coral reefs. CBOL generated the BOLD system as a catalogue of living beings and has collections covering more than 790,000 sequences, conforming to more than 67,000 correctly called “species.” The BOLD database entries contain barcode sequences and specimen information such as images, morphology, collection date, and geographical site. To provide practical utility for BOLD data, the mobile-based software DNA Barcoding Assistant efficiently maintains metadata for the gathering and management of specimen data for BOLD and other biodiversity information databases.

The DNA Barcoding Assistant (http://www.dnabarcodingassistant.org/) enables users to store and retrieve data such as provisional user-allocated taxonomic classification, geospatial data, digital images, and collection event information for specimens found in the field. Another web-based data-processing system tool, BioBarcode (http://www.asianbarcode.org), focuses on the collection of Asiatic organisms and encompasses about 11,300 specimen entries (Lim et al. 2009). On similar lines, a field information management system (FIMS) has been developed that provides information associated with tissues, collecting events, and specimens (Deck et al. 2012). Similarly, the Quick Response (QR) barcode system could be efficiently implemented to identify and track samples, together with relevant information such as site details, time of collection, and taxonomic identity (Diazgranados and Funk 2013). These indicate that continuous progress is being made in DNA barcoding.