Introduction

Mango (Mangifera indica L.), a member of family Anacardiaceae is an important fruit crop which is commercially grown in over hundred tropical and subtropical countries (Mukherjee and Litz 2009). According to Food and Agriculture Organization (FAO) of United Nations, after banana, mango is the dominant tropical fruit variety produced worldwide, followed by pineapples, papaya and avocado (http://www.fao.org/docrep/006/y5143e/y5143e1a.htm). Major mango growing countries are India, China, Thailand, Pakistan, Australia, Indonesia, Bangladesh, Philippines, Nigeria, Myanmar and Egypt.

The mango tree is considered to have evolved in the rainforests of South and South-east Asia (Krishna and Singh 2007). Full-grown mango trees reach a height of 40 m and can stay alive for several 100 years. Mango leaves are exstipulate, simple and usually alternate; depending on the cultivar, leaf morphology is highly variable (Mukherjee and Litz 2009). The inflorescence of mango flowers is rigid and erect, usually 30 cm long, widely branched and is always cymose. Mango fruit varieties have been known for attractive colours, savouring smell, delightful taste and high nutritional value (Mukherjee and Litz 2009). Mango is an ever green dicot angiosperm. Although several tetraploid individuals were reported, mango is usually a diploid tree (Mukherjee 1950; Duval et al. 2005; Viruel et al. 2005; Schnell et al. 2005, 2006). It has 2n = 40 chromosomes with estimated genome size of 441 mega basepairs (http://data.kew.org/cvalues/).

During last few years, a number of reports addressed the genetic diversity of mango for application in cultivar identification using PCR and sequencing based techniques viz. RAPD, ISSR, DAMD etc. (Srivastava et al. 2012; Chinag et al. 2012; Souza et al. 2011; Rocha et al. 2012; Ravishankar et al. 2011; Hirano et al. 2010; Khan and Azim 2011). Despite its global importance, genomic sequence resources available for the mango tree are scarce. As of October 2013, there are only 684 highly redundant sequence entries in the GenBank for mango. Large scale discovery and characterization of functional genes via genome sequencing or global exploration of the transcriptome are required for better understanding of fundamental molecular biology of mango.

Recently development of RNA sequencing (RNA-seq) methodology has facilitated the analysis of transcriptomes of a number of crop and medicinal plants (Xu et al. 2013; Duangjit et al. 2013; Sara et al. 2010). RNA-seq is characterized by sequencing of the transcriptome using massively paralleled next generation DNA sequencing technology. It is among the most popular techniques of NGS (Strickler et al. 2012). RNA-seq generate millions of short cDNA reads which either aligned to a reference genome or reference transcripts, or assembled de novo to produce a genome-scale transcription map that consists of both the transcriptional structure and/or level of expression for each gene (Mortazavi et al. 2008). Sequencing of RNA has long been recognized as an efficient method for gene discovery and remains the gold standard for annotation of both coding and non-coding genes (Adams et al. 1991; Haas and Zody 2010). Furthermore, the RNA-seq method offers a holistic view of the transcriptome, revealing many novel transcribed regions, splice isoforms, single nucleotide polymorphisms (SNPs) and the precise location of transcription boundaries (Li et al. 2010; Wilhelm et al. 2010). RNA-seq is expected to revolutionize the manner in which eukaryotic transcriptomes are analyzed (Wang et al. 2009).

Here we report the characterization of mango leaf transcriptome and chloroplast genome using next generation sequencing. We generated over 1.0 billion bases of high quality DNA sequence of mango using RNA-seq technology and demonstrated the suitability of short-read sequencing for de novo assembly and annotation of genes without prior genome information. Moreover, we also sequenced the mango chloroplast genome with the help of a blend of Sanger and next generation sequencing. The results provide a cost effective and efficient way to global discovery of new functional genes in mango.

Materials and methods

Plant materials

Mangifera indica cultivar Langra used in this study was grown in Botanical Gardens of University of Karachi, Karachi, Pakistan. The leaf specimen of the tree used is preserved in the Herbarium of Department of Botany, University of Karachi, Karachi, Pakistan. Healthy and mature leaves from same tree were taken for RNA-seq and chloroplast DNA sequencing.

RNA isolation, cDNA synthesis and sequencing

Total RNA was isolated from mango leaves using RNAesy kit (Qiagen GmbH, Hilden, Germany). RNA integrity was confirmed using a 2100 Bioanalyzer (Agilent Inc., USA). Beads with oligo(dT) were used to isolate poly(A) mRNA from total RNA (Qiagen GmbH, Hilden, Germany). The purified mRNA was fragmented into short fragments using divalent cations under elevated temperature. The cDNA was synthesized with random hexamer primers and mRNA fragments as templates using Superscript®III Reverse Transcriptase (Invitrogen, Carlsbad, CA, USA). Short fragments were purified with the PCR extraction kit (Qiagen GmbH, Hilden, Germany) followed by end repair and poly(A) addition. The short fragments were then connected with sequencing adapters. The paired-end library (with 200 bp insert size) was prepared following the manufacturer’s protocol (Illumina Inc., San Diego, CA, USA). Finally, the library was sequenced using Illumina HiSeq2000 (Illumina Inc., San Diego, CA, USA). The library was linked the flow-cell containing complementary adapters, and then bound fragments were amplified to create ‘clusters’. The adapters were designed to allow selective cleavage of the forward DNA strand after resynthesis of the reverse strand during sequencing. The copied reverse strand was then used to sequence from the opposite end of the fragment. The raw reads were cleaned by removing adaptor sequences, empty read and low quality sequences.

Transcriptome de novo assembly

Transcriptome denovo assembly was carried out with short reads assembling program—Trinity (Grabherr et al. 2011). Trinity combines three independent software modules: Inchworm, Chrysalis, and Butterfly. Trinity applies these programs one after the other to process large volumes of RNA-seq reads. Firstly, Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts. Secondly, Chrysalis clusters the Inchworm ‘contigs’ into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs. Finally, Butterfly processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes. The resultant sequences are termed as ‘unigenes’.

Annotation and classification of unigenes

Mango unigenes were analyzed by BLASTN (Zhang et al. 2000) against the NR database (NCBI non-redundant sequence database) with an E-value cut-off of 10−5. The coding sequences in mango unigenes were also analyzed using BLASTN against genomic sequence datasets of Citrus sinensis (sweet orange), Populus trichocarpa (black cottonwood), Vitis vinifera (Grapevine), Ricinus communis (castor bean), Glycine max (soya bean), Medicago truncatula (Barrel Medic) and Arabidopsis thaliana. Unigene sequences were further aligned by BLASTX to protein databases; SwissProt, KEGG (Kanehisa et al. 2008) and COG. This step retrieved proteins with the highest sequence similarity with the given unigenes along with their functional annotations. In case of disagreement between databases, a priority order of NR, Swiss-Prot, KEGG and COG was followed. For unigenes that did not align to any of the above databases, ESTScan software (Iseli et al. 1999) was used to predict their coding regions and decide sequence direction.

The Blast2GO and WEGO programs were used for GO functional annotations, KEGG and COG analysis of mango unigenes (Conesa et al. 2005; Ye et al. 2006). The analysis mapped annotated unigenes to GO terms and calculated the number of unigenes associated with every term.

Comparison with Genbank M. indica sequence entries

Six hundred and eighty four mango sequences (ESTs and nucleotide sequences) were downloaded from the GenBank and used for nucleotide BLAST search against 30,509 unigenes using an E-value cut-off of 10−5.

Mango chloroplast genome sequencing

A combination of Sanger-based and next-generation sequencing strategies were used for mango chloroplast DNA (cpDNA) sequencing. The mango leaves (5.0 grams) were used for isolation of total DNA using AxyGen multisource DNA miniprepration kit (Axygen Scientific, USA). Initially, a primer walking strategy termed as “ASAP: amplification, sequencing and annotation of plastomes” (Dhingra and Folta 2005) was used for amplification and Sanger-based sequencing of inverted repeat (IR) and large single copy (LSC) regions of cpDNA. Briefly, purified mango DNA was used for generation of 6.0 kb amplicons with consensus set of primers (Supplementary data) (Dhingra and Folta 2005). The 6.0 kb amplicons were then used for generation of 1.0 kb fragments using internal sets of primers (Supplementary data) corresponding to 6.0 kb amplicons. Later on, gap filling primers were designed to fill the gaps within the inverted repeat region (Supplementary data). The Sanger-based sequencing of the above mentioned fragments was carried out by CEQ8000 Genetic Analyzer (Beckman Coulter Inc., USA). For cycle sequencing reactions, the DTCS kit (Beckman Coulter Inc., USA) was used, with conditions as recommended by the suppliers. Further mango cpDNA sequencing was carried out by next-generation sequencing technology of GS FLX System (Roche Inc., USA) using Titanium Mini kit and 7.0 μg of purified DNA.

The sequences obtained from Sanger-based sequencing were assembled using the Lasergene package version 7.1 (DNASTAR Inc., Madison, WI, USA). The sequencing data from the GS FLX system was assembled using CLC Genomics Workbench version 3.5.1 (CLCbio, Denmark). The assembled sequences obtained from Sanger-based and next generation sequencing were combined using CLC Genomics Workbench (CLC bio, Denmark). Genome annotation was performed through the DOGMA server (Dual Organellar Genome Annotator; (Wyman et al. 2004), ORF Finder (http://www.ncbi.nlm.nih.gov/projects/gorf/), and BLAST (Altschul et al. 1990). Repeat analysis was performed using the REPuter program (Kurtz et al. 2001). A circular genome map of mango cpDNA was constructed using the GenomeVx tool (Conant and Wolfe 2008). Construction of multiple alignments and phylogenetic trees of complete cpDNA sequences was carried out by the mVISTA comparative genomics tool (Frazer et al. 2004).

Results and discussion

RNA-seq and de novo assembly of mango transcriptomic sequences

To characterize the mango transcriptome, total RNAs were isolated from leaves. After DNase treatment and confirmation of RNA integrity using bioanalyzer, total RNA was used for mRNA preparation, fragmentation and cDNA synthesis. After cleaning and quality checks, Illumina NGS sequencing generated 12,153,196 sequence reads, encompassing 1,093,787,640 nucleotides, with each sequence read averaging ca. 90 bp in length (Table 1). This dataset has been submitted to the NCBI Short Read Archive with accession number SRR947746.

Table 1 Output statistics of mango RNA-seq experiment

Transcriptome de novo assembly was carried out with short reads assembling program—Trinity (Grabherr et al. 2011). Using this method, initially 85,651 contigs with a mean length of 238 were generated (Table 2). As shown in the contig length distribution in Fig. 1, the contig count is inversely proportional to length. Later on, the contigs sequences were assembled into 30,509 unigenes. Mean size of unigenes was 536 bp with lengths in the range of 300 to >3,000 bp (Fig. 2). Sum of all unigenes length was 16,354,267 nucleotides with depth of coverage of 66.8× [depth of coverage was calculated by dividing number of clean nucleotides i.e. 1,093,787,640 by sum of nucleotides in 30,509 unigenes (16,354,267 nt)]. The unigenes were divided into two clusters. In cluster-1, unigenes with sequence homology >70 % were grouped (prefixed CL); whereas cluster-2 contained singleton unigenes (prefixed unigene). The assembled mango transcriptome sequences have been deposited in the Genbank with Transcription Shotgun Assembly number SUB363843.

Table 2 Statistics of assembly of NGS reads using Trinity
Fig. 1
figure 1

Length distribution of contigs obtained after assembly of mango RNA-seq reads

Fig. 2
figure 2

Length distribution of assembled mango Unigene sequences

Comparison of assembled unigenes with mango sequences in Genbank

As of October 2013, the Genbank contained 684 sequence entries of mango (both gene sequences and ESTs); many of them are redundant, analyzed in the frame of phylogenetic studies and yet unpublished. After removing the redundancy, Genbank mango sequences were divided into complete and partial gene sequences. Subsequently, 30 and 59 nonredundant complete and partial mango gene sequences were found respectively (indicating 89 nonredundant mango gene sequences in Genbank). These sequences were submitted to the BLAST searches against 30,509 unigenes in present mango dataset. Out of 89 nonredundant mango gene sequences in Genbank, 76 (85.4 %) matched with assembled unigenes with a cutoff E-value of 10−5. This analysis provided an evaluation of the quality of unigene sequences in present dataset. Further analysis showed that majority of matched sequences had percent identities more than 95 % and E-value lower than 10−40 indicating reliability of alignment.

Recently mass spectrometry based proteomic analysis identified 538 proteins in mango leaves (Renuse et al. 2012). This mango leave proteome data contained 151 nonredundant protein sequences (length of the peptides used for protein matching were in the range of 09–34 amino acids; mean = 17). Comparative analysis showed that the present mango leaf transcriptome dataset was in full agreement with proteomic data. The integration of proteomic and transcriptomic data further evaluate the quality of assembled unigenes.

Annotation of unigenes

BLAST searching of mango unigene sequences against nucleotide databases (NR and NT), protein database Swiss-Prot, as well as KEGG and COG was performed with Evalue cutoff 10−5. BLAST searches against NR database annotated 24,593 unigenes out of total 30,509 (80 %) unigenes (Table 3). Sequence analyses with NR database and several Viriplantae species genomic datasets showed that 37 % of C. sinensis coding region sequences matched with mango sequences, followed by P. trichocarpa (22.5 %), V. vinifera (18.3 %), R. communis (17.9 %), G. max (17.3 %), M. truncatula (10.5 %) and Arabidopsis thaliana (6.7 %) (Fig. 3; Table 4) (10.4 % of mango transcriptome sequences matched with sequences of other species).

Table 3 Summary of mango unigene annotation with the nucleotide and protein sequence databases NR, NT, Swiss-Prot, KEGG, COG and GO
Fig. 3
figure 3

Annotation of mango unigenes with nonredundant nucleotide database NR. Pie diagram showing a E-value distribution, b percentage identity distribution and c species distribution of mango unigenes

Table 4 Summary of BLASTN searches of mango coding region sequences (n = 24,642) with the coding sequences of seven Viridiplantae species

Based on sequence homology, 21,054 mango unigenes were categorized into 57 functional groups, belonging to three main GO ontologies: molecular function, cellular component and biological process (Table 5). The results showed a high percentage of genes from categories of “cellular process”, “metabolic process”, “cell/cell part”, “organelle”, “catalytic”, and “binding” with only a few genes related to “locomotion” and “nucleoid”. On the other side, genes were not grouped in the categories of “cell killing”, “extracellular matrix”, “metallochaperone activity”, “nutrient reservoir activity”, “protein tag” and “translation regulator” (Table 5).

Table 5 Gene ontology (GO) classification of mango transcriptome unigenes

To further evaluate the function of the assembled unigenes, we searched the annotated unigenes involved in Clusters of Orthologous Groups (COG). Out of 24,593 NR Blast hits, 7,594 unigenes had a COG classification (Fig. 4). Among the 25 COG categories, the cluster for “general function prediction” represented the largest group, followed by three categories related to transcription, translation and posttranslational modification (categories J, K and O; see Fig. 4). Other most predicted gene functions were replication, recombination, repair and signaling (categories L and T; see Fig. 4). The categories “cell motility”, “extracellular structures” and “nuclear structure” were least represented groups.

Fig. 4
figure 4

COG functional classification of mango Unigene sequences. A RNA processing and modification; B Chromatin structure and dynamics; C Energy production and conversion; D Cell cycle control, cell division, chromosome partitioning; E Amino acid transport and metabolism; F Nucleotide transport and metabolism; G Carbohydrate transport and metabolism; H Coenzyme transport and metalbolism; I Lipid transport and metabolism; J Translation, ribosomal structure and biogenesis; K Transcription; L Replication, recombination and repair; M Cell wall/membrane/envelope biogenesis; N Cell motility; O Posttranslational modification, protein turnover, chaperones; P Inorganic ion transport and metabolism; Q Secondary metabolites biosynthesis, transport and catabolism; R General function prediction only; S Function unknown; T Signal transduction mechanisms; U Intracellular trafficking, secretion, and vesicular transport; V Defense mechanisms; W Extracellular structures; Y Nuclear structure; Z Cytoskeleton

To identify active biochemical pathways in mango, the unigenes were mapped to the reference canonical pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2008). The KEGG database contains systematic analysis of inner-cell metabolic pathways and functions of gene products, which aid in studying the complex biological behaviors of genes. In total, 13,561 unigenes were assigned to 293 KEGG pathways. The pathways with most representation were “metabolic pathways”, “photosynthesis”, “transcription/translation”, “DNA repair/recombination”, “signal transduction” and “cell cycle”. Considerable numbers of unigenes were identified as “vitamin metabolism”, “N- and O-linked glycan biosynthesis”, “ubiquitin mediated proteolysis/proteasome”, “endocytosis” and “circadian rhythm”. KEGG analysis discovered an array of biochemical pathways involved in biosynthesis of secondary metabolites/natural products known to participate in plant hormonal, flavors, aromatic and other cellular processes.

Characterization of natural products biosynthetic pathways

Annotation of mango unigenes identified many genes involved in phenylpropanoid, flavonoid, flavone/flavonol, isoflavonoid, terpenoid and carotenoid biosynthetic pathways (Pandit et al. 2010; Andrade Jde et al. 2012).

Phenylpropanoids comprise a large group of plant based natural compounds, derived from phenylalanine (Michal 1999). First step in the phenylpropanoid biosynthesis pathways is formation of cinnamic acid from phenylalanine which leads the formation of cinnamoyl-CoA, p-coumaroyl-CoA, feruloyl-CoA and sinapoyl-CoA. These CoA activated compounds are starting metabolites for the synthesis of lignins (the second most frequent class of compounds in biosphere after cellulose), flavonoids, flavones and flavonols (Michal 1999). KEGG analyses of mango transcriptome sequences revealed presence of 13 genes involved in biosynthesis of above mentioned CoA activated compounds. Analysis showed active pathways for the synthesis of several lignins in mango including guaiacyl lignin, 5-hydroxyguaiacyl lignin, synringyl lignin, p-hydroxyphenyl lignin.

Several phenylpropanoid and flavonoid biosynthetic pathways genes found in the mango transcriptome dataset were methyltransferases. Mining of mango sequence data revealed >200 unigenes corresponding to different classes of methyltransferases. The methyltransferases transfer a methyl group from a donor to an acceptor. In plants, methylation occurs on nucleotide bases in DNA and amino acids in proteins (Lam et al. 2007). This process is involved in many cellular functions including epigenetic modification and gene regulation. Methyltransferase are also involved in methylation of plant secondary metabolites (such as salicyclic acid) that are important contributors to taste and aroma of many fruits and flowers (Tieman et al. 2010). Identification of >200 methyltransferases in mango provided a wealth of data related to this important class of enzymes. The high number of unigenes encoding methyltransferases indicated variability in structures and functions which in turn reflected the diversity in epigenetic mechanisms and secondary metabolites turnover in mango.

Moreover, mango transcriptome dataset contained 12 enzymes required for the synthesis of following flavonoids (including flavones and flavonols). Pinostrobin, pinobanksin, buetin, dihydrofisetin (futin), naringenin, luteolin, afzelechin, epiafzelechin, catechin, epicatechin, homoeriodictyol, eriodictyol, quercetin and garbanzol (Table 6). Several of these flavonoids are known to exhibit antioxidant, anti-inflammatory, antimutagenic etc. properties (see Table 6 for references). For instance, butein is proposed in the treatment of breast cancer due to its ability to inhibit aromatase in the human body (Wang 2005). Interestingly, two compounds in this list (i.e. homoeriodictyol, eriodictyol) are bitter-masking flavonones (Ley et al. 2005).

Table 6 Analysis of mango transcriptome sequences revealed active biosynthetic pathways for bioactive flavonoids mentioned in the table

The present dataset contained sequences encoding enzymes for biosynthesis of precursors in Isoflavonoid biosynthetic pathway (i.e. liquiritigenin and naringenin), Flavone and flavonol biosynthetic pathway (i.e. apigenin and kaemferol) and Anthocyanin biosynthetic pathway (i.e. cyanidin).

A number of terpene and benzenoid metabolism related genes including geranylgeranyl pyrophosphate synthase, Farnesyl pyrophosphate synthase and isochorismate hydrolase were also found in mango unigenes. Terpenoids and benzenoids are important natural products in mango. These genes are reported to involve in fruit ripening process (Pandit et al. 2010; Kulkarni et al. 2013). However, they are not limited to tissues of fruit ripening and transcribe in leaf tissues as well. Since most these genes are comprised of large families, most likely different genes would be expressed in leaf and fruit tissues.

Twenty-two enzyme sequences involved in terpenoid backbone biosynthetic pathways were present in mango unigenes. Mevalonate pathway for the synthesis of Isopentenyl-pyrophosphate was found active in mango. Moreover, enzymes required for syntheses of other key intermediates i.e. geranyl-pyrophosphate, farnesyl-pyrophosphate and geranyl-geranyl-pyrophosphate were also present. The farnesal biosynthetic pathway from farnesyl-pyrophosphate was also found active as all required enzymes were present. Moreover, enzymes for the synthesis of dehydrodelichol-pyrophosphate from farnesyl-pyrophosphate; phytyl-pyrophosphate and nona-prenyl-pyrophosphate from greanyl-geranyl-pyrophosphate were also present. Among several monoterpenoid biosynthetic pathways, enzyme for production of Linalool from geranyl-pyrophosphate was found. As a terpene alcohol, Linalool has a pleasant scent and found in many flowering and spice plants (Lewinshon et al. 2001). Triterpenoid biosynthetic pathway from farneyl-pyrophosphate was found active in mango. However, enzymes needed for sesqui terpenoid biosynthesis was not detected in present dataset.

Plant hormone signal transduction pathways in mango

The subsystem based gene annotation identified active presence of a number of plant hormone signal transduction pathways in the present dataset. These plant hormones included auxin, cytokinin, gibberillin, abscisic acid, ethylene, brassinosteroid, jasmonic acid, salicylic acid. Genes encoding receptors, enzymes and others proteins of these signaling pathways and found in the present mango dataset are given in Table 7.

Table 7 List of genes involved in plant hormone signaling found in mango mango transcriptome dataset

From work in model plants, it is known that the ethylene receptors (ER) are negative modulators of the plant hormone ethylene, and therefore likely to play an important part in plant cell physiology. In Arabidopsis, ER is perceived by a family of 05 receptors, divided into two subfamilies (Bleecker et al. 1998). The type-I subfamily include ETR1 and ERS1 and the type-II subfamily receptors include ETR2, ERS2 and EIN4. Mining of mango transcriptome dataset identified a total of five receptor genes. Multiple alignment and phylogenetic comparisons of mango ER sequences with representative ER sequences from other fruit species identified two ETR1 genes (CL3814 and CL5014), one ETR2 gene (CL3814), ETR2-like gene (Unigene15424) and EIN4 gene (Unigene2479) each (Fig. 5). Hence the analysis showed that both ER subfamily members were present in mango. This study demonstrates the potential for the whole genome sequence to be used as a resource for characterization of large multi-gene families in mango.

Fig. 5
figure 5

Phylogenetic tree generated based on multiple alignment of 12 representative ethylene receptor (ER) sequences and five mango ER sequences found in present RNA-seq dataset

Proteolytic enzymes in mango

Proteases or peptidases are involved in myriad important cellular functions. Few studies have been reported on proteolytic enzymes in mango (Mehrnoush et al. 2012). The current mango transcriptome analysis revealed 235 unigenes (0.8 % of all unigenes) corresponding to proteases. These unigenes were grouped in five catalytic classes of peptidases/proteases. Number of unigenes related to different classes of peptidases were as follows; serine peptidases (n = 89), metallo peptidases (n = 72), cysteine peptidases (n = 25), aspartic peptidases (n = 40) and threonine peptidases (n = 09) (Table 8). This list contains several plant-specific proteases including Xylem serine protease, thalakoidal processing peptidase, Chloroplast processing peptidase, and Germination-specific cysteine protease. Highest number of unigenes for a specific protease was estimated for ATP dependent protease ClpAP (n = 36). The ClpAP is a two component system energy dependent protease system composed of ClpP, the peptidase and ClpA, the ATPase. ClpAP is involved in intracellular proteolysis. We found 25 unigene sequences of ClpP (both cytosolic and chloroplast) which indicated presence of several isoforms of ClpP probably due to alternative splicing. The ClpP isoforms would be responsible for degradation of different proteins due to variation in substrate specificity as a result of sequence variation at the substrate binding cleft of ClpP. Mango contained several unigenes related to papain-like cysteine proteases. These include actinidin homologues with >65 % sequence similarity (CL3299 and Unigene4699) and cathepsins L/H homologues with 40 % sequence similarity (CL3398, CL3842 and CL4516).

Table 8 Proteolytic enzymes found in mango transcriptome dataset

Stress response genes

Several stress response genes identified in mango unigene sequences included metallothionein (04 unigenes), Ubiquitin-protein ligase (03 unigenes), cysteine proteinase inhibitor (03 unigenes), FtsJ-like methyltransferase (several unigenes), 14-3-3 protein (12 unigenes), small heat shock protein (04 unigenes) and chitinase (13 unigenes). Two out of four metalothionin (MTH) sequences (i.e. Unigene10572 and Unigene12274) were classified as plant MTH type 1. Sequence motif analysis showed that Unigene12272 codes for MTH type-2 and Unigene10535 codes for MTH type-3 protein. Therefore present data provided evidence of presence of type-1, -2 and -3 MTH in mango.

Mango chloroplast genome

We carried out chloroplast genome sequencing of mango using Sanger-based and next-generation sequencing methods. Chloroplast genome contains a pair of inverted repeat (IR) regions separated by small and large single copy regions (SSC and LSC). Initially, 22,918 bp of the inverted repeat (IR) region were sequenced using the ASAP protocol (Dhingra and Folta 2005). For this, primers reported by Dhingra and Folta (2005) were used. Additionally, gap filling primers was designed. Consequently, complete IR region of mango cpDNA was sequenced with size of 27,093 bp. Furthermore, 5,783 bp of LSC region was also sequenced using same strategy. Therefore, collectively 32,876 bp of complete IR region and partial LSC region of mango chloroplast genome was obtained using Sanger-based sequencing.

To get more sequence coverage, the mango cpDNA was subjected to pyrosequencing based 454 technology adopted in GS FLX genome sequencer. The raw data of GS FLX was analyzed to get finished sequences. GS FLX sequencer generated sequence data of 10.573 megabases containing 26,988 reads with average length of 400 nucleotides. This data along with the 32,876 bp sequence obtained from Sanger sequencing was used to generate a circular map of mango chloroplast genome containing 151,173 base pairs. However, a complete cpDNA sequence could not be obtained as the 151,173 bp mango cpDNA sequence contained 30 gaps when compared to Citrus cpDNA (17 gaps located in intergenic homopolymer repeat regions whereas 13 intragenic gaps). Preliminary sequence comparison revealed close relationship of mango and C. sinensis cpDNA sequences. The chloroplast genome size of C. sinensis is 160,129 base pairs. This comparison indicated that 95 % of mango chloroplast genome sequences have been obtained resulting from the current study. The draft sequence of mango chloroplast genome was submitted in GenBank with accession number FJ212316.

The available sequence of mango cpDNA (151,173 bp; GC 38.18 %) contains a pair of inverted repeats (IRA and IRB) of 27,093 bp separated by small and large single copy (SSC, LSC) regions, respectively (Fig. 6). The inverted repeat of mango chloroplast is larger as compared to several previously reported angiosperms for example length of cpDNA inverted repeat of A. thaliana is 26,264 bp (Sato et al. 1999), Solanum tuberosum is 25,595 bp (Chung et al. 2006), Nicotiana tabacum is 25,339 bp (Shinozaki et al. 1986) and Gossypium barbadense is 25,591 bp (Ibrahim et al. 2006). The length of mango cpDNA IR region is only 97 base pairs larger compared to its closest neighbour C. sinensis which has 26,996 bp (Bausher et al. 2006). This finding support the previous observations that contraction and expansion of IR region of chloroplast DNA is a major source of size variation of cpDNA among angiosperms (Chung, et al. 2006).

Fig. 6
figure 6

Circular map of mango chloroplast DNA indicating LSC, SSC and IR regions and the color scheme for gene indication. The map has been constructed by GenomeVx at http://wolfe.gen.tcd.ie/GenomeVx/ (Conant and Wolfe 2008)

A total of 139 genes were detected in mango cpDNA sequence (119 single copy genes while 20 duplicated genes in inverted repeat regions). 91 genes code for proteins, including nine duplicated genes in the inverted repeats. There were four rRNA genes and 29 distinct tRNAs, 7 of which are duplicated in the inverted repeat. Notably, M. indica cpDNA contains the infA gene which code for a translation initiation factor and not present in its closest neighbour Citrus genome (Bausher et al. 2006). The detailed list of gene contents is present in Table 9.

Table 9 Genes encoded by mango chloroplast DNA

Chloroplast genome-wide comparative analysis was carried by multiple alignment of cp DNA sequence from mango and 16 representatives of land plants and 2 representatives of Chlorophyta using VISTA web server (Asif et al. 2013; Khan et al. 2012; Frazer et al. 2004). The phylogenetic tree also indicated grouping of mango cpDNA sequences with C. sinensis (citrus sp.), Gossypium hirsutum (cotton) and V. vinifera (red grape). However, the most closely related sequence is C. sinensis cpDNA (Fig. 7).

Fig. 7
figure 7

Phylogenetic tree of cpDNA from 19 plant species (including mango) using VISTA comparative genomics server (http://genome.lbl.gov/vista/mvista/submit.shtml)

Repeats in M. indica chloroplast genome

Along with two large inverted repeats in chloroplast genome sequences i.e. IRA and IRB, a large number of relatively small repeats have been recently observed (Haberle et al. 2008). In mango chloroplast genome 51 direct repeats (Supplementary data) could be found using RePuter server http://bibiserv.techfak.uni-ielefeld.de/reputer/submission.html (Kurtz et al. 2001). The RePuter program provides software solutions to compute and visualize repeats in whole genomes or chromosomes. It provides interactive as well as static images of the results. Figure 8 indicates the position of repeats along with their sizes and orientation. RePuter analysis of mango cpDNA revealed 15 repeats of size >50 bp while rest were less than 50 bp. These repeats occur as the part of genes as well as intergenic spacers. It is interesting to note that presence of hundreds of repeats in Trachelium caeruleum chloroplast genome is supposed to be associated with extensive rearrangements (Haberle et al. 2008).

Fig. 8
figure 8

Diagrammatic representation of short repeats in mango chloroplast genome sequence. The graph outlines the length and location of repeats. The lines indicating repeats are colored according to length. Each part of a repeat is displayed on a separate strand to keep the starting position visible

Conclusion

Mango genomic research lags that of other crops of economic importance. To facilitate biochemical and molecular biological research in mango, we characterized mango leaf transcriptome and cpDNA. Most of the resultant unigenes were aligned with mango sequences deposited in the Genbank, whereas all of coding regions in unigenes were matched with proteins sequences identified in mango proteomic dataset. Subsystem based gene annotation provided information for the production of a number of bioactive flavonoids, carotenoids and terpenoids in mango. These bioactive natural products are known to have a range of health beneficial properties. The large number of unigenes identified in this study provides an important resource for future studies on mango biology. The advancements in transcriptomic, genomic, epigenomic, proteomic resources of non-model plants would greatly facilitate research in plant biology.