Introduction

Tor mahseer, Tor tor (Hamilton 1822) is one of the most important food and game fishes of India. It inhabits rivers of ‘Tor region’ (600–1200 m) and plains with rocky bottom (Singh and Kumar 2000). This species has been reported from the Indus, Ganga (including sub-Himalayan range), Brahmaputra, Godavari and Krishna riverine systems and forms major fishery in Narmada river (Desai 2003; Jayaram 2005; Lal et al. 2013). However, due to overfishing, habitat destruction and other anthropogenic activities, the natural populations of this species are declining and the conservation status is evaluated as ‘near threatened’ (IUCN 2015). For formulating effective management and conservation measures, organisms need to be identified/delimited accurately. But conversely, the systematics of mahseer fishes are highly ambiguous due to phenotype plasticity leading to taxonomic instability (Mohindra et al. 2007; Khare et al. 2014).

Taxonomically, mahseers are defined as carps with big head and scales, two pairs of barbles, fleshy lips that are continuous at the angles of the mouth with an interrupted fold or groove across the lower jaw (Desai 2003). Initially, Hamilton (1822) classified mahseers under the genus ‘Cyprinus’, but later, Gray (1834) proposed new genus ‘Tor’ for these species. Rainboth (1996) classified Tor species under the family Cyprinidae, subfamily Cyprininae, tribe Cyprinini. Thai et al. (2007) placed Tor species under tribe Barbini (interchangeably referred to as subfamily Barbinae by various authors). The tribe name became more contentious after Yang et al. (2010) showed the polyphyletic nature of tribe Barbini within Cyprininae.

In India, information on mahseer genetic resources is inconclusive and several studies using limited markers (COI and D-loop) showed misapplication of different species names to diverse morphotypes of conspecific individuals (Laskar et al. 2013; Khare et al. 2014). As a first step to address the taxonomic ambiguity of mahseer species, several studies have briefly described complete mitochondrial genomes of T. putitora (Sati et al. 2014), T. khudree (Raman et al. 2015) and T. tor (Kumar et al. 2015) with limited phylogenetic analysis. However, molecular phylogeny with more number of characters would generate accurate phylogenetic tree than limited number of characters. Kumar et al. (2015) reported complete mitochondrial genome of T. tor, but their study was limited to brief description of mitogenome and partial phylogeny. Further, molecular phylogeny with conspecific individuals from different geographical locations would give insight on speciation and evolution of genetic lineages. Additionally, comparative analysis of mitogenome could provide leads on pattern of amino acid usage and gene evolution among mahseer fishes. With this back ground, the present study was carried with an objective of characterizing complete mitochondrial genome of T. tor to infer its phylogenetic position within mahseer group.

Materials and methods

Sample collection

Two samples of T. tor (Hamilton 1822) (figure 1 in electronic supplementary material at http:www.ias.ac.in/jgenet/) were collected from Narmada river (Hoshangabad, 7774E, 2276N) during the month of April 2015. The species were identified at the field by observing morphological and meristic characters (Desai 2003). Voucher specimens were prepared by preserving the specimen in absolute alcohol with proper labelling. Fins were collected under aseptic conditions and preserved in absolute alcohol and kept at −80C for further analysis. For phylogenetic and comparative analyses, reported mitochondrial DNA sequences of mahseer species were downloaded from NCBI, GenBank.

Figure 1
figure 1

Plot of divergences among mtDNA sequences. (a) The sliding window analysis calculates the divergence between T. tor specimens (present study: KR868704 and reported specimen: KP795444). (b) Divergence plot between T. tor specimens (present study specimens: KR868704-05). The bar at the top illustrates the position of protein-coding genes and rRNAs, and the tRNAs are represented as black boxes.

DNA extraction and long PCR amplification

Total genomic DNA was isolated from the fins using standard phenol/chloroform method. Whole mitochondrial genome was amplified in two amplicons using two sets of reported primers L-1231-Leu: 5-GGTCTTAGGAACCAA AAACTCTTGGTGCAA-3 and S-LA-16S-H: 5-TGCAC CATTAGGATGTCCTGATCCAACATC-3; H-12321-Leu: 5-TTGCACCAAGAGTTTTTGGTTCCTAAGACC-3 and S-LA-16S-L: 5-GATGTTGGATCAGGACATCCCAATGG TGCA-3 (Kim and Lee 2004). Long range polymerase chain reaction (PCR) was carried out in a 20 μL reaction mixture containing 9.7 μL sterile distilled water, 2 μL of 10 × clean α-Taq reaction buffer, 4 μL of 5 × HQ buffer, 2 μL of dNTPs mix, 1 μL of each primer and 0.3 μL of clean α-Taq DNA polymerase (Krishgen Biosystems, USA) and 2 μL of template DNA using a Bio-Rad iCycler PCR machine in 0.2 mL PCR tubes. The thermocycler was programmed for initial denaturation at 95C for 3 min, followed by 40 cycles of 95C for 20 s and annealing and extension temperature at 68C for 15 min. The PCR products were visualized on 1.0% agarose gels and the amplicons were purified by gel extraction kit (Qiagen, Valencia, USA) following the manufacturer’s protocol.

Amplicon sequencing by ion torrent NGS platform

DNA (amplicon) quality and quantity were measured using Qubit 2.0 fluorometer. Around 500 ng of amplified product was digested using ultrasonification to get the desired fragment size of 400 bp. The fragment’s quality and quantity were verified by Bioanalyzer and were purified by AMPure bead purification. Adapter ligation, size selection, nick repair and amplification were performed as described in the ion torrent protocol. Library preparation was carried out using the Ion XpressTM Fragement Library kit (ThermoFisher Scientfic, MA, USA), with 500 ng of DNA and a different barcode adaptor was used for each library (barcode series 53 and 54 for T. tor specimens 1 and 2, respectively). After appropriate dilution, the two libraries were pooled in equimolar concentration. Emulsion PCR and enrichment steps were carried out using the Ion XpressTM Template kit and associated protocol. Libraries were sequenced using 314 chip and the reads obtained from ion torrent sequencer were grouped according to their barcodes into two separate files.

Data analysis

After checking the quality, raw reads were trimmed, aligned to contigs and the resulted contigs were mapped against T. putitora reference mitogenome (GenBank: NC_021755.1), using CLC Genomic Workbench ver. 5.0. Assembled mitogenome was annotated using MITOFISH (MitoAnnotator) online mitochondrial genome annotation server (Iwasaki et al. 2013). The ribosonal RNA (rRNA) gene boundaries were confirmed by alignment with other reported fish mitogenomes. The transfer RNA (tRNA) structures were predicted using ‘tRNAscan’ web server (Schattner et al. 2005) with a search mode ‘tRNAscan only’ using vertebrate mitochondrial genetic code. The origin of replication on L strand OL region was predicted using online ‘Mfold’ software (Zuker 2003). The sequences were analysed for base composition, variation using MEGA6 (Tamura et al. 2013). The DAMBE5 software was used to test the sequence substitution saturation (Xia 2013), and DnaSp5 software was used to calculate the divergence between mitogenomes using sliding window analysis (Rozas 2009). Codon usage analysis was done using MEGA6 and the heatmap of codon usage was prepared using online tool ‘CIMminer’ (http://discover.nci.nih.gov/cimminer/home.do). The evolutionary models were estimated using jModelTest 2 software (Guindon and Gascuel 2003; Darriba et al. 2012). PAUP ver. 4.0 (Swofford 2003) was used to reconstruct the phylogenetic tree using parsimony (MP), maximum likelihood (ML) and neighbour joining (NJ) methods. Nonparametric bootstrap support for each node of the tree was estimated using 100 heuristic bootstrap replicates. Bayesian inference (BI) was implemented with MR BAYES ver. 3.1.2 (Huelsenbeck and Ronquist 2003) with the prior probability of a flat Dirichlet distribution for the substitution rates and stationary nucleotide frequencies. The Bayesian analysis was performed for five million generations with sampling every 1000 generations.

Results

The mitochondrial genome of T. tor was amplified into two overlapping amplicons of 8–9 kb size (figure 2 in electronic supplementary material). These amplicons were purified and processed for sequencing on ion torrent NGS platform. Ion torrent sequencer has generated 347,885 and 137,605 reads with a mean length of 238 bp for T. tor specimens 1 and 2, respectively (figure 3 in electronic supplementary material). These sequences were assembled into contigs and mapped against reference sequence (T. putitora, NC021755.1) using CLC Genomics workbench. The coverage depth was 4213 × and 1669 × for T. tor specimens 1 and 2, respectively (table 1).

Figure 2
figure 2

Heatmap of codon usage.

Figure 3
figure 3

Transition (S) and transversion (V) saturation plots for dataset (protein-coding genes).

Table 1 Details of reference-guided sequence mapping for T. tor.

Genome content and organization

The total length of T. tor mitochondrial genome was 16571 bp with 13 protein-coding genes, 22 tRNAs, two rRNAs and a noncoding control region (figure 4 in electronic supplementary material). These sequences were submitted to NCBI GenBank with accession numbers of KR868704–KR868705. Most of the genes were encoded on the heavy strand (H-strand), whereas only ND6 and eight tRNA (glutamine, alanine, asparagine, cysteine, tyrosine, serine (UCN), glutamic acid and proline) genes were encoded on the light strand (L-strand). Gene overlapping was observed between ATPase 8 and ATPase 6 (7 bp); ND4L and ND4 (7 bp); ND5 and ND6 (4 bp); CytB and tRNAThr (17 bp); and tRNAlle and tRNAGln (2 bp). Around 15 intergenic spacer sequences were found with a total size of 29 bp (table 2).

Figure 4
figure 4

Phylogenetic analysis of mahseers and other cyprinids using concatenated mitochondrial protein-coding genes. The phylogenetic relationship was analysed by maximum parsimony (MP), maximum likelihood (ML) and Bayesian inferences (BI) methods. Tree topology produced by different methods is similar. Bootstrap values are in MP/ML/BI order.

Table 2 Organization of the complete mitochondrial genome of T. tor.

The overall base composition was found to be A: 31.91, C: 27.38, G: 15.64 and T: 25% with a high A + T content. The overall GC-skews and AT-skews of the H-strand were −0.274 and 0.126, respectively, and it confirmed that the heavy strand compositional bias towards C over G nucleotides and a slight excess of A over T nucleotides.

Protein-coding genes

As observed in all other vertebrates, 13 protein coding genes were also present in T. tor mitochondrial genome. These genes include seven subunits of the NADH ubiquinone oxidoreductase complex (ND1-6, ND4L), three subunits of the cytochrome c oxidase (COI–III), one subunit of the ubiquinol cytochrome b oxidoreductase complex (Cyt b), and two subunits of ATP synthases (ATP-6 and ATP-8). The total length of these genes was 11,408 bp, accounting for 68.8% of the whole mitogenome.

All the 12 genes encoded on the H-strand showed a marked similarity in nucleotide composition with an antiguanine bias, whereas ND6 encoded on the L-strand exhibited an opposite trend (33.9% for G). Among the protein-coding genes, A + T content varied from 52.1 (ND1) to 61.8% (ATPase 8). Comparative analyses of the base composition at each codon base position of the 12 protein-coding genes showed that the codon third base position possessed the highest A + T content (63%) with a strong bias against the use of G (6.8%).

The typical initiation codon ‘ATG’ was used by 12 of 13 protein coding genes, while COI gene has ‘GTG’ as the start codon. A different pattern of codon usage was observed for stop codons such as incomplete codons to stop the protein translation (Oh et al. 2007). Six of 13 protein coding genes are terminated with incomplete codons of T − – (CytB, ND2, ND3, ND4, COII and COIII) and the remaining six genes (ND1, COI, ATPase 6, ND4L, ND5 and ND6) have TAA termination codon. ATPase 8 gene has termination codon of ‘TAG’. Among all the protein-coding genes, ND5 gene (1820 bp) is the longest one, while the shortest is ATPase 8 (164 bp).

A total of 3799 codons were identified including all protein-coding genes. The most and least frequently used amino acids are leucine (16.48%) and cysteine (0.66%), respectively. Analysis of overall codon usage data indicated that codons ending with T or A were used more frequently than those ending with C or G. This observation is in agreement with the overall high A + T content (56%) of the entire mitogenome. In addition, the most frequently used codon is CTA (258/3799), while CGG is the least frequently used codon (4/3799; table 3).

Table 3 Codon usage in T. tor mitochondrial protein-coding genes.

Transfer and ribosomal RNA genes

The mitogenome of T. tor contained 22 typical tRNA genes interspersed between the protein-coding genes and rRNA genes. Twenty tRNA genes code for 20 different amino acids while there is an additional tRNA gene for serine (GCU) and leucine (CUN). The size of tRNA genes varied from 65 (tRNACys) to 75 bp (tRNALeu). Except tRNASer (UGC), the remaining tRNAs have showed typical clover leaf secondary structure. The majority of predicted tRNA secondary structures had a common feature with 7 bp in the amino acid stem, 5 bp in the T ΨC stem, 5 bp in the anticodon stem, and 4 bp in the DHU stem (figure 5 in electronic supplementary material).

Figure 5
figure 5

Phylogenetic analysis of mahseers and other cyprinids using concatenated mitochondrial protein-coding genes. The phylogenetic relationship was analysed by NJ method based on LogDet distance values.

The length of 12S and 16S rRNA genes was found to be 955 and 1676 bp, respectively. The two ribosomal RNA genes are separated by tRNAVal and this entire gene cassette is flanked by tRNAPhe and tRNALeu. The base composition of the 12S and 16S rRNAs genes are A, 35; G, 20.6; C, 25.4; T, 19.0%. The A + T content of rRNA genes is 54%, which is slightly lower than the protein-coding gene (56%), but is higher in comparison with tRNA genes (55.7%).

Noncoding regions

Two noncoding regions; an origin of OL and control region were observed in T. tor mitochondrial genome. The OL (36 bp) was located within the cluster of five tRNA genes (WANCY) region between tRNAAsn and tRNACys. This region showed to fold into a stable stem–loop secondary structure with 13 bp in the stem and 10 bp in the loop (figure ?? in electronic supplementary material). Alignment of this region with other cyprinidae family species showed an overrepresentation of pyrimidines (C, T) in the 5 side of sequence and a conserved motif of ‘GCGGG’ at 3 side (figure ?? in electronic supplementary material). The AT content of OL was 38.8%.

Even though, D-loop (control region) is a highly variable region, it contains few conserved sequences such as termination associated sequence (TAS) and conserved sequence blocks (CSB). The TAS was identified by the presence of unique sequence ‘TACATAAAAYYYAAT’ in the D-loop region. However, in the present study, T. tor showed a partial conserved region ‘TACATAATAT’ in the D-loop. Likewise CSBI, CSBII and CSBIII were also identified based on conserved sequences. Additionally dinucleotide repeats (TA)13 were identified in the control region (figure ?? in electronic supplementary material). The AT content (68%) of D-loop region was higher than the average value of complete mitochondrial genome.

Comparative analysis among mahseer fishes

The mitogenomes of seven mahseer species were compared to find the most variable gene fragment and the pattern of codon usage. Apart from the D-loop, more variable sites were observed in ND1 and minimum in COIII genes (table 1 in electronic supplementary material). However, this variation was much less at protein level due to the degeneracy of the genetic code.

Sliding window analysis showed an unexpectedly large pair-wise nucleotide difference between T. tor (KP795444) and T. tor (present study: KR868704) (figure 1). Heat map analysis showed a remarkable difference in codon usage between Neolissochilus hexagonolepis and other species of ‘Tor’. Within Tor species, T. tambroides and T. sinensis showed similar trend while a slight difference was observed between T. tor(present study: KR868704) and T. tor (KP795444) (figure 2). The mitochondrial D-loop region showed tandem repeats of the ‘TA’ with varying repeat numbers. T. tambroides showed minimum number of repeats (11), while maximum number of repeats (14) was found in T. sinensis. Remaining species of Tor showed 13 repeats of ‘TA’ in the D-loop region.

Phylogenetic analysis

Twelve mitochondrial protein-coding genes excluding ND6 were concatenated (˜10,800 bp) and used for phylogenetic analysis. Base frequency of concatenated sequences was not homogeneous across the taxa ( P= 0.00). Sequence saturation analysis showed the increase of frequency of both transitions and transversions linearly along with the divergence value (figure 3). The index of substitution saturation (ISS = 0.3598) was significantly lower than the critical index of substitution saturation (ISSc= 0.7599). Based on the log-likelihood ratios and Akaike information criterion (AIC) values, evolutionary model ‘general time reversible with gamma distribution (GTR + I+G)’ was selected for phylogeny tree reconstruction using maximum likelihood and Bayesian inference methods. Since the base frequency was heterogeneous across taxa, distance matrix based methods using LogDet transformation were also used for phylogeny reconstruction through neighbour-joining (NJ) method.

Except NJ, all other character based methods (MP, ML and BI) revealed similar tree topologies with significant bootstrap values (figure 4). In all tree topologies, mahseers (genus Tor and Neolissochilus) formed a separate clade within cyprininae subfamily. However, the close relatives of this group were not found as the sample size was less (all species of cyprininae subfamily were not included in the analysis). Within mahseer group, T. putitora and T. tor (present study: KR868704); T. tambroides and T. khudree were formed as sister species. N. hexagonolepis was formed as a separate taxon within the clade of tor species with low bootstrap value.

The phylogeny tree constructed by NJ method showed different tree topologies especially for mahseer group (figure 5). In this tree topology, N. hexagonolepis formed as a basal and distinct taxon/sister group to species of Tor. Further, sister group relationship was observed between T. putitora – T. tor (present study: KR868704) and T. tambroides – T. tor (KP795444) with less bootstrap value. T. khudree formed as a sister species to the group of T. tor and T. tambroides with significant bootstrap value.

Discussion

Traditionally, fish mitochondrial genome was sequenced through Sanger sequencing by cloning different fragments of mitochondrial genome. However, these methods are cumbersome and are time consuming. High throughput sequencing technologies with bioinformatics pipeline have revolutionized the genomics research. Using next-generation sequencing technologies, it is now possible to sequence mitochondrial genome of multiple species in a single reaction through multiplexing (Tang et al. 2014). Several NGS platforms such as Roche 454 FLX, Illumina and ion torrent have been used to sequence mitochondrial genomes of fishes (Sahoo et al. 2015a; Kushwaha et al. 2015). In the present study, the amplicons (8–9 kb) of mitochondrial DNA were sequenced with high coverage using ion torrent platform.

The mitochondrial genome size is not equal among mahseer group and in the present study, the mitochondrial genome size of T. tor is less than T. tambroides (16,690 bp) and higher than N. hexagonolepis (16,563 bp) and T. tor (16,554 bp; KP795444). This variation could be due to the presence/absence of intergenic regions and the length of control region (Chen et al. 2013; Marshall et al. 2013).

The gene size and arrangement are similar to that of other vertebrates (Campbell et al. 2014; Norfatimah et al. 2014). The nucleotide composition of the T. tor mitogenome is biased towards A and T, with A (31.91%) being the most abundant nucleotide while G (15.64%) is the least abundant. The A + T content at codon’s third base position of the protein-coding genes is relatively high (63%) and this observation is in congruence with other previous studies (Cheng et al. 2012; Norfatimah et al. 2014). Incomplete stop codons were observed for some of the protein-coding genes and the presence of such noncanonical stop codons is common in vertebrate mitogenomes (Wei et al. 2013). These truncated stop codons would be converted into a fully functional TAA stop codon via posttranscriptional polyadenylation (Ojala et al. 1981).

The most frequently used codon is CTA and this codon usage bias might be directly proportional to the available tRNA abundance (Lee and Kocher 1995; Xia et al. 2007; Cheng et al. 2010). The predicted tRNAs structure is similar to that of other vertebrates. However, as reported previously for bony fishes, tRNASer (UGC) did not show the typical clover leaf-like structure due to absence of a discernible DHU stem (Oh et al. 2007).

The OL showed conserved stem-loop structures, which are reported to be associated with the transition from RNA to DNA synthesis during mtDNA replication (Hixson et al. 1986). The TAS observed in mitochondrial D-loop region would play a key role in terminating the synthesis of the heavy strand (Cheng et al. 2010). The conserved sequence blocks (CSB-1, CSB-2 and CSB-3) are thought to be associated with positioning RNA polymerase for priming both replication and transcription (Clayton 1991; Shadel and Clayton 1997).

The comparative analysis of mitochondrial genomes of mahseer fishes showed less divergence value at cytochrome c oxidase subunit I locus than the ND1 region. This significant observation warrants that to delineate the species of this group, COI gene alone may not be sufficient. High divergence values at ND1 gene were reported for both vertebrates and invertebrates (Rach et al. 2008; Pilgrim et al. 2012). Comparison of mitogenomes between the present study T. tor species and reported T. tor (KP795444, Kumar et al. 2015) showed significant divergence value and it is confirmed by sliding window and codon usage analysis. The systematics of mahseer fishes is so ambiguous and often species identification based on morphological characters may cause misidentification. We hypothesize that the reported species might be a different species/subspecies with similar morphology as T. tor species.

Previously, most of the phylogeny studies of mahseer species have been based on limited genes/markers (Nguyen et al. 2009; Laskar et al. 2013). After the first description of the genus of Tor (Gray 1834), Rainboth (1985) erected new genus Neolissochilus on the basis of absence of median lobe, number of gill rakers and pharyngeal arches. Several studies have shown the monophyletic nature of Neolissochilus genus and sister group relationship with the species of Tor (Nguyen et al. 2009; Laskar et al. 2013; Khare et al. 2014; Raman et al. 2015). In the present study also, the codon usage analysis showed the distinct codon usage pattern for N. hexagonolepis compared to other species of Tor genus. However, in the phylogeny studies, except NJ, all other methods generated tree topologies, where the N. hexagonolepis formed as a distinct taxon within Tor species clade with low bootstrap value. Several previous studies on mitochondrial genome of mahseer fishes have also reported the same relationship with maximum likelihood method (Kumar et al. 2015; Sahoo et al. 2015b). It has been reported that sophisticated models with large number of parameters may not always generate accurate tree topology especially with large dataset. Sometimes simple tree building methods based on distance (logdet) can generate reliable tree topologies (Russo et al. 1996; Tamura et al. 2004).

In all tree topologies, sister group relationship was observed between T. putitora and T. tor and it is in accordance with the previous studies (Laskar et al. 2013; Khare et al. 2014). Sahoo et al. (2015b) have also shown sister group relationship between T. tor (KP795444) and T. putitora, but with less bootstrap value (<75%). Interestingly, T. tor (KP795444) reported by Kumar et al. (2015) has not clustered with either T. tor (present study) or T. putitora. The reason could be that the T. tor species (KP795444) may be a misidentified species / a distinct genetic lineage (subspecies) and requires additional markers (nuclear and morphological) for further classification of this taxon. Several studies using complete mitochondrial genome sequences have reported occurrence of different species / genetic lineages within a species complex (Morin et al. 2010). Cheng et al. (2012) found unexpected divergence between two Collichthys niveatus mitogenomes and proposed that species misidentification could be the reason for this high divergence value. As reported in previous studies, another sister group relationship was observed between T. khudree and T. tambroides (Nguyen et al. 2009; Raman et al. 2015). The biogeographical analysis also reported that these two species have originated during lower Pleistocene period as a sequence of dispersal and vicariance events (Nguyen et al. 2009; Khare et al. 2014).

In conclusion, the present study reported the complete mitochondrial genome sequence of T. tor species using ion torrent platform. The comparative and phylogenetic analyses showed a genetically distinct specimen of T. tor. The present study data could be useful for further exploration of mitogenomics of other Tor species for better taxonomic understanding.