Introduction

Sweet potato (Ipomoea batatas, L.), a perennial dicotyledonous plant, is one of the most important economic food crops around the world. The tuberous root of the sweet potato has high-starch content and sweet taste, is largely used as a staple food, and is also used for animal feed and industrial feedstock and as raw material for biofuel production. Except for starch, sweet potato leaves are also rich in complex carbohydrates, dietary fibers, β-carotene, vitamin C, and vitamin B6 (Bovell-Benjamin 2007). Interestingly, scientific analyses have proposed that the sweet potato ranks very high among all other vegetables in terms of its anticancer effect, even higher than ginseng, due to the high content of secondary metabolites and anthocyanins in its leaves (Johnson and Pace 2010). Sweet potato leaves contain at least 15 biologically active anthocyanins that have significant medicinal value. Anthocyanins are the predominant group of visible polyphenols that comprise the red, purple, and blue pigmentation of many plants (Johnson and Pace 2010) and have therapeutic effects on a series of diseases and cancers. Therefore, apart from producing edible pigments and food colorants, the sweet potato has great potential as a candidate for developing drugs, such as antineoplastic agents, antiinflammatory agents, and antioxidant agents (Bovell-Benjamin 2007; Johnson and Pace 2010). According to data on anthocyanin biosynthesis in tomato (Butelli et al. 2008), transgenic tobacco (Aharoni et al. 2001), and red apple (Takos et al. 2006), at least three transcription factors including MYB, basic helix-loop-helix (bHLH) and WD40, are involved in regulating anthocyanin biosynthesis. To date, two MYB genes, IbMYB1 and IbMYB2, have been isolated from the purple sweet potato (Ipomoea batatas L. Lam. cv Ayamurasaki). IbMYB1 has been implicated to control anthocyanin biosynthesis specifically in the flesh of tuberous roots due to its absence in nontuberous roots, stems, leaves, or flowers (Mano et al. 2007). In addition, the sweet potato is recognized to be tolerant to drought and high moisture, as well as pests and disease. Overall, the sweet potato is essential to world food safety, development of renewable energy, and medical applications (FAO 2010).

Although entire genomes of model plant species, such as Arabidopsis, rice, and Medicago have been sequenced in the past decade, the genome of the sweet potato is still unavailable. The sweet potato has a complex genome (hexaploid, 2n = 6x = 90), therefore genome sequencing of the sweet potato is difficult and also expensive. As a result, research on sweet potato breeding and its application has been progressing slowly. Currently, there are few molecular resources for improved breeding of sweet potato cultivars such as microsatellite (SSR) markers (Hu et al. 2004), amplified fragment-length polymorphism (AFLP) markers (Cervantes-Flores et al. 2008), expression sequence tags (ESTs), and two types of transcriptome sequencing [454 pyrosequencing (Schafleitner et al. 2010) and Illumina paired-end sequencing (Wang et al. 2010)]. SSRs comprise a class of variable number of tandem repeats (VNTR) with a repeat unit of a few nucleotides (2 nt to 6 nt) and are widely dispersed in eukaryotic genomes. There are several advantages from using SSR technology, which include co-dominant inheritance, high abundance, random distribution in the genome, high information content, and reproducibility. Thus, polymorphic microsatellites have been frequently applied to the classification and systematics, parentage identification, germplasm conservation, breeding program, and study of gene duplication or deletion (Sanchez de la Hoz et al. 1996; Andersen and Lubberstedt 2003). To date, at least 450 SSRs have been found in sweet potato by different approaches, such as screening of a small-insert genomic library, construction of a microsatellite-enriched library, and mining of EST databases (Buteler et al. 1999; Hu et al. 2004; Schafleitner et al. 2010; Wang et al. 2010). According to the derivation of SSR, there are two classes of SSRs: genomic-originated SSRs and EST-originated SSRs. ESTs provide direct or indirect evidence on gene expression in tissues, development times, and environmental responses. Hence, ESTs contribute to the discovery and study of genes in sweet potato related to the biosynthesis of starch in tuberous roots and useful secondary metabolites in leaves (Wang et al. 2010). However, to date, only 23,982 ESTs can be retrieved from the GenBank of NCBI, therefore, it is essential to perform a transcriptome analysis on the sweet potato to find more transcriptome data.

Traditional EST sequencing through constructing an EST library is time consuming as well as expensive. Fortunately, RNA sequencing (RNA-seq), is an approach that has been recently developed to profile transcriptomes in a high-throughput way using deep-sequencing technologies, such as 454 pyrosequencing and Illumina paired-end sequencing. In addition to the low cost of this technique, RNAs-seq also allows the following (Wang et al. 2009): (1) cataloguing all kinds of transcripts including mRNAs, noncoding RNAs, and small RNAs; (2) investigating the transcriptional structure of genes, splicing patterns, and gene isoforms; (3) studying posttranscriptional modification and mutations; and (4) precisely quantifying gene expression in large-scale at the same time of sequencing. Moreover, RNA-seq is genome-independent and is especially useful for analyzing the transcriptome of a species without complete genome information (Wang et al. 2009, 2010). In recent years, RNA-seq has accelerated our understanding of the complexity of gene transcription patterns, gene structure variances, and gene regulation networks (Wang et al. 2010).

In this study, we used the Illumina paired-end sequencing technology to perform RNA-seq on the tuberous roots of the purple sweet potato. Our transcriptome analysis included de novo assembly, gene function identification, and gene function classification and enrichment. A large number of new genes were detected and some of them were further identified to play roles in regulating the biosynthesis of anthocyanins and starch. According to comparison with the RNA-seq dataset of the yellow tuberous root of the sweet potato, there are significant expression differences among genes involved in the biosynthesis of anthocyanins. In addition, novel genes were also identified and are inferred to be expressed specifically in the purple sweet potato. As a result, RNA-seq not only enriches transcriptome data of the sweet potato, but it also improves our understanding of the molecular mechanisms of sweet potato development, metabolism, and its application in human health, nutrition, and pharmacy.

Materials and methods

Plant materials, RNA extraction, and RNA sequencing

A purple sweet potato (Impomoea batatas L.) cultivar “Jingshu 6” was grown in the greenhouse of the Chinese Academy of Forestry. Jingshu 6 is a hybrid from ROSA (♀) and AfedJones (♂) created by Professor Han Wang of Beijing University of Agriculture. It is dark purple when the root is fully mature. The anthocyanin content is 22 mg/100 g fresh weight of tuberous roots.

Two fresh swelling tuberous roots (diameter 3.0–3.5 cm) of purple sweet potatoes were harvested and were subsequently cleaned with sterilized water. The collected samples were then immediately frozen in liquid nitrogen and stored in a −80°C freezer for later use. According to the manufacturer’s protocol, total RNA of the sample was extracted with the BIOZOL total RNA extraction kit (Bioer Technology, Hangzhou, China). The RNA was qualified and quantified using a Nanodrop ND-1000 (Nanodrop Technologies, Wilmington, DE, USA). A total of 10 μg of total RNA was used for Illumina RNA-seq.

To construct a cDNA library for use in Illumina RNA sequencing, mRNAs were first isolated from total RNAs using beads with oligo (dT). Next, mRNAs were broken down into short fragments by adding fragment buffer. These short fragments were used as templates and a random hexamer was used as a primer to synthesize first-strand cDNA. Then, second-strand cDNA was further synthesized. Short fragments were then purified with a QiaQuick PCR extraction kit (Qiagen, Valencia, CA, USA) and were resolved in EB buffer for end reparation and the addition of a poly(A) tail. Sequencing adapters were then connected to the short fragments. After agarose gel electrophoresis, suitable fragments (200 ± 25 bp) were isolated as templates for PCR amplification. Finally, the cDNA library was constructed for sequencing using Illumina HiSeq™ 2000 (BGI, Beijing, China). The RNA-seq dataset is available at the NCBI Short Read Archive (SRA) with the accession number: SRX090758.

Pipeline of bioinformatics analysis

Raw sequence processing and de novo assembly

In order to get clean reads for de novo assembly, all raw reads from RNA-seq were filtered with the following four steps: (1) removing the reads that failed the built-in Failed Chastity Filter in the Illumina software according to the relation “failed-chastity ≤1”, using a chastity threshold of 0.6, on the first 25 cycles; (2) discarding the reads with adaptor contamination; (3) masking low-quality reads with ambiguous sequences “N”; and (4) removing the reads with more than 10% Q < 20 bases (Q = −10 × lgE) (Li et al. 2010; Wang et al. 2010).

Using SOAPdenovo, we adopted three consecutive steps to assemble clean reads into unigenes (Li et al. 2010) (Fig. 1). First, reads that overlapped with each other were assembled into longer fragments called contigs; second, reads were mapped back to contigs. At the same time, according to paired-end reads, we could deduce which contigs were derived from the same transcript sequences and what the distances were between these contigs. In this way, contigs were further assembled into scaffolds and unknown sequences in scaffolds were marked with an “N”. Finally, scaffolds were repaired with paired-end reads to get less “N”. Assembly continued until sequences could no longer be extended. The final sequences were named unigenes.

Fig. 1
figure 1

Bioinformatics analysis pipeline of RNA-seq in sweet potato

Sequence direction and CDS analysis

In order to deduce the correct transcription direction and protein sequences coded by unigenes, a BLASTX search was performed against the databases of non-redundant proteins (nr), SwissProt, KEGG, and Clusters of Orthologous Groups (COG) with an E value of 1e−5, in that order. Once a unigene had hits within a certain database, the BLASTX search against its follow-up databases was then stopped. We selected the top BLASTX result as a clue to determine the plus direction of the unigene (5′ → 3′) and to annotate its function. Meanwhile, we translated unigenes into their protein sequences according to standard genetic codes. For those unigenes having no hits with any of the four databases, their coding regions were inferred by ESTscan (Iseli et al. 1999).

Unigene expression analysis

According to the RPKM method (reads per kb per million reads) described by Mortazavi et al. (2008), unigene expression levels were calculated with the formula: RPKM(A) = (10,00,000 × C × 1,000)/(N × L) (Assigns RPKM(A) to be the expression of gene A, C to be number of reads that uniquely aligned to gene A, N to be total number of reads that uniquely aligned to all genes, and L to be the number of bases on gene A). The RPKM method is able to eliminate the influence of different gene lengths and sequencing discrepancy within the calculation of gene expression.

Gene function analysis and enrichment

We performed comprehensive function annotations on unigenes including protein sequence similarity, GO term classification, and KEGG pathway enrichment. We searched unigene sequences against protein databases (nr, SwissProt, COG, and KEGG) using BLASTX (E value < 1e−5). Protein function can be predicted from the annotation of the most homologous protein in those databases. With nr annotation, the Blast2GO program was used to classify unigenes to GO terms including molecular functions, biological processes, and cellular components (Conesa et al. 2005). Meanwhile, unigenes were enriched to different KEGG pathways according to the KEGG pathway database and nr annotation on KEGG. In this way, we could cluster multiple unigenes to the same GO terms and the same KEGG pathway (Kanehisa and Goto 2000).

SSR detection and primer design

In order to develop more potential SSR markers for sweet potato breeding, we performed SSR analyses on unigenes using a SSR finder from GRAMENE software (ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl). The parameters were designed for identifying perfect di-, tri-, tetra-, penta-, and hexa-nucleotide motifs with a minimum of 6, 5, 4, 4, and 4 repeats, respectively (Zeng et al. 2010). Eprimer3 from EMBOSS bioinformatics software packages (http://emboss.sourceforge.net/) (Rychlik 1995) was used to design flanking primers for detected SSRs. The major parameters for primer design were set as follows: PCR products ranging from 100 to 300 nt; primer lengths ranging from 18 to 24 nt with an optimum of 20 nt, 60°C optimal annealing temperature, and GC content from 40 ~ 65% with an optimum of 50% (Zeng et al. 2010).

Comparison with the RNA dataset of the yellow sweet potato

The RNA-seq dataset of the yellow sweet potato released by Wang and co-workers was downloaded from the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/Traces/sra, Accession number: SRA022988) (Wang et al. 2010) and was assembled using the same de novo assembly method mentioned above. To evaluate the read coverage of assembled contigs, all short reads were realigned to the contigs using SOAPaligner under default parameters (Li et al. 2008). We combined our unigenes and the assembled genes from the yellow sweet potato and used Orthomcl (Li et al. 2003) to analyze orthologous genes with an E value under 1e−25 and an identity of more than 90%. According to the ortholog information, we made further comparisons on genes involved in regulating the biosynthesis of anthocyanins.

Results

RNA sequencing and de novo assembly

According to Illumina’s RNA transcriptome sequencing (BGI, http://www.genomics.cn/), a sequence can generate 2 × 75 bp from each paired-end of a cDNA fragment. A total of 25,888,890 reads with the total of 2,330,000,100 nt and an average GC content of 48.92% yielded from fresh tuberous roots of the sweet potato. We defined the reads with Q ≥ 20 (see “Materials and methods”) and no ambiguous sequences “N” as high-quality reads. There were 24,501,246 of 25,888,890 (94.64%) reads with Q ≥ 20 and 259 of 25,888,890 (0.01%) reads with “N”. After cleaning raw reads, 24,501,246 high-quality reads were assembled to 473,238 contigs (Table 1) using SOAPdenovo (Li et al. 2010). The length of contigs ranged from 75 nt to 7,479 nt with an average length of 138 nt. 141,413 of 473,238 (30.0%) contigs were more than 100 nt in length (Table 1).

Table 1 Length distribution of assembled contigs, scaffolds, and unigenes

Considering initially repeated and overlapped short reads mapped to just one contig, we used the read mate pairs to join contigs step by step into scaffolds to get larger insert sizes (Li et al. 2010). Therefore, we obtained 121,084 scaffolds in the second round assembly, which ranged from 100 nt to 10,404 nt with a mean length of 303 nt (Table 1). Among the yielded scaffolds, scaffolds from 100 nt to 500 nt accounted for 88.9%. Most of the scaffolds (82.1%) had no gaps (Fig. 2). Repeated sequences composed of gaps inside the scaffolds were masked in scaffold construction. To disassemble the repeated copies and to shorten the gaps in the scaffolds, we used the paired-end information to repair gaps if one read was well aligned on the contig and another was located in the gap region (Li et al. 2010). Furthermore, we continued to do local assembly with unmapped ends to fill in the small gaps within the scaffolds. Finally, we harvested a total of 58,800 unigenes, which were neither extended on either end nor had the least “N”s (Table 1). The length of unigenes ranged from 200 nt to 10,380 nt with an average length of 476 nt (Table 1).

Fig. 2
figure 2

Gap distribution of assembled scaffolds and unigenes. Gap distribution (N/size) %: gap percentage (N amount/sequence length) distribution

The assembly quality had a similar tendency to the previously reported transcriptome sequencing by Wang et al. (2010). The overall sequencing depth varied from 1 to 6,424-fold with a mean depth of 137.1-fold (Fig. 3). Sequencing depth of most unigenes ranged from 11 to 100 reads and 26,591 (45.2%) unigenes were covered. About 35.6% unigenes had sequencing depth of more than 100 reads.

Fig. 3
figure 3

Assessment of assembly quality. Distribution of unique-mapped reads and RPKM (reads per kb per million reads) of the assembled unigenes

Using the RPKM approach, we could estimate the expression of each unigene by reading the depth of deep sequencing. According to Fig. 3, it showed that the expression of each unigene varied similarly with sequencing depth (Fig. 3). The expression of unigenes ranged from 0 to 1,935 RPKM with an average of 34 RPKM. 24,137 of 58,800 (41.0%) unigenes had a very low expression level of less than 10 RPKM.

Functionally annotating unigenes by BLASTX searching against protein databases

After obtaining unigenes, we asked if these unigenes were real genes and what genes they represented. Therefore, we performed a BLASTX search against protein databases of nr in NCBI, SwissProt, COG, and KEGG, in that order. According to the BLASTX results, 40,280 unigenes were detected to have homologous proteins in the nr protein database. Of 18,520 non-coding unigenes, 325 unigenes and 214 unigenes were hit in databases of SwissProt and COG, respectively. In all, 17,981 genes were identified to be non-coding genes. According to the sequence direction in the BLASTX result, we could be certain of the direction of at least 43,458 (73.9%) genes (5′ → 3′).

COG is a database where orthologous gene products are classified. Every protein in the COG database is assumed to be evolved from an ancestor protein, and the whole database is built on coding proteins with complete genomes as well as system evolution relationships of bacteria, algae and eukaryotes. 9,304 of 58,800 (15.8%) unigenes could be aligned to the COG database to predict and classify possible functions. According to the annotation of COG, we classified these genes into 26 different functional classes, such as RNA processing and modification, energy production and conversion, carbohydrate transport and metabolism, signal transduction mechanisms, and secondary metabolite biosynthesis, transport and catabolism (Fig. 4). The gene members in each functional class varied from 9 to 3,190 with an average of 824 ± 709. Except for predicted functions and unknown functions, the most abundant classes with more than 1,000 members were involved in transcription, DNA replication/recombination/repair, post-translational modification, single transduction, and amino acid metabolism.

Fig. 4
figure 4

Histogram presentation of clusters of orthologous groups (COG) classification. All unigenes were aligned to COG database to predict and classify possible function

GO classification and KEGG pathway enrichment

According to the BLASTX results against protein databases of nr, we could associate unigenes with GO classification from nr annotations. The GO classification provides an ontology of defined terms representing gene product properties, such as cellular components, molecular functions, and biological processes. It also enhances our understanding of gene classification in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner (Conesa et al. 2005). A total of 19,706 (33.5%) genes were categorized into 3,228 GO classes in terms of biological processes, cellular components, or molecular functions (Fig. 5). In order to better review GO classification involved by genes, each GO term of genes was further clustered to its parent term. The results showed that the three largest biological processes were cellular processes, metabolic processes, and biological regulation (Fig. 5). 12,574 genes belonged to 1,682 unique biological processes (data not shown). Meanwhile, most of the genes were classified into molecular functions of binding, catalytic activity, and transporter activity. In cellular components, the major locations for these gene products were classified as cell, cell part, or organelle. The results indicated that most of the sequenced genes were responsible for fundamental biological regulation and metabolism.

Fig. 5
figure 5

Gene ontology classification of assembled unigenes

KEGG pathway enrichment

KEGG is widely used as a reference database of pathway networks for integration and interpretation of large-scale datasets generated by high-throughput sequencing technology. To better understand the function of sequenced genes in sweet potato, a BLASTX search against KEGG protein database with a cutoff E value of 1e−5 was made on the assembled genes in sweet potato. 18,503 of 58,800 unigenes were assigned to 2,497 unique pathways, including biosynthesis of secondary metabolites, transcription and translation, protein processing and modification, sugar metabolism, amino acid metabolism, and fatty acid metabolism (Table 2). Our results showed that the largest six pathway groups were metabolic pathways [ko01100, 4,178 (22.58%)], biosynthesis of secondary metabolites [ko01110, 2,156 (11.65%)], plant-pathogen interactions [ko04626, 1,619 (8.75%)], spliceosome [ko03040, 886 (4.79%)], protein processing in the endoplasmic reticulum [ko04141, 664 (3.59%)], and starch and sucrose metabolism [ko00500, 553 (2.99%)] (Table 2). Considering the sweet potato is one of the most important tuberous crops rich in starch and also has a great application potential in food nutrition and drug development, we paid more attention to the pathways pertaining to starch and sucrose metabolism, anthocyanin biosynthesis, carotenoid biosynthesis, and vitamin biosynsthesis. Quite a number of genes were enriched to these promising pathways including anthocyanin biosynthesis [ko00942, 12 (0.06%)], starch and sucrose metabolism [ko00500, 553 (2.99%)], glycolysis/gluconeogenesis [ko00010, 312 (1.69%)], glycerophospholipid metabolism [ko00564, 198 (1.07%)], fatty acid metabolism [ko00071, 188 (1.02%)], carbon fixation in photosynthetic organisms [ko00710, 160 (0.86%)], ascorbate and aldarate metabolism [ko00053, 118 (0.64%)], carotenoid biosynthesis [ko00906, 94 (0.51%)], fatty acid biosynthesis [ko00061, 93(0.51%)], other glycan degradation [ko00511, 69 (0.37%)], photosynthesis [ko00195, 62 (0.34%)], glycosaminoglycan degradation [ko00531, 43 (0.23%)], vitamin B6 metabolism [ko00750, 24 (0.13%)], and biotin metabolism [ko00780, 8 (0.04%)]. This contributes to accelerating research and the application of starch yield and drug development in the sweet potato.

Table 2 KEGG pathway enrichment in sweet potato

Mining SSR from unigenes

Currently, the genome of the sweet potato is still unavailable; therefore much research on sweet potato breeding and genetic linkage maps relies on molecular genetic markers such as SSR and AFLP (Buteler et al. 1999). SSR is one of the most popular marker systems and is consistent by varying numbers of tandemly repeated di-, tri-, or tetra-nucleotide DNA motifs. To fully utilize the assembled unigenes, we employed SSR Finder (ftp://ftp.gramene.org/pub/gramene/software/scripts/ssr.pl) to mine potential SSRs in the identified unigenes. A total of 851 potential SSRs were identified from 821 unigenes. Out of the 851 SSRs, there were three SSR classes including di- SSR (349, 41.0%), tri- SSR (441, 51.8%), and tetra- SSR (61, 7.2%) (Suppl. Table. S1). There were practically no complex SSR motives, but not all unigenes had SSRs. The repeated unit numbers in SSRs varied from 5 to 31 with an average of 8.19. The SSRs with six repeated units were the largest and accounted for 34.7% (295).

In order to facilitate designing SSR markers, we employed eprimer3 (primer 3) to design primer pairs for each SSR under a series of primer-designing parameters (see “Materials and methods”). We could design 423 primer pairs in 851 (49.7%) SSRs (Suppl. Table. S1). However, these primers still need further experimental validation.

Discussion

Many genes are significantly expressed in the purple sweet potato. According to assembling the published RNA-seq dataset from the yellow sweet potato, a total of 130,262 contigs in length of more than 100 nt were obtained. The coverage of assembled contigs ranged from 0.28 to 996.52 RPKM with an average of 33.70 RPKM. Ortholog clustering results performed on two assembly sequences from the yellow sweet potato and our purple sweet potato shows that a total of 42,689 gene clusters were obtained. Amongst the 42,689 gene clusters, 631 genes from the purple sweet potato and 1,699 genes from the yellow sweet potato were identified to be specific genes in the two RNA-seq datasets, respectively. Of the 631 specific genes from the purple sweet potato, 253 genes are non-coding and 387 genes are protein-coding (Suppl. S2). The 387 protein-coding genes include the WD40 repeat family protein, UDP-glucosyltransferase, verticillium wilt disease resistance protein, S-receptor kinase-like protein 1, resistance protein PSH-RGH7, protein kinase, nematode resistance-like protein, nbs-lrr resistance protein, and the auxin efflux carrier component. More interestingly, at least 36 genes are shown that they are related to resistance to different abiotic and abiotic stresses. This indicates that the purple sweet potato may be more tolerant to environmental stress than other types of sweet potato.

The anthocyanin biosynthetic pathway has been well studied; many genes have been identified to participate in this pathway and have also been cloned. The pathway of flavonoid biosynthesis offers anthocyanin flavonoid as its substrate source. According to our results of KEGG pathway enrichment, a total of 183 assembled genes from our RNA-seq dataset were identified to be involved in flavonoid biosynthesis (data not shown). Chalcone synthase (CHS) and chalcone isomerase (CHI) are initially the two key enzymes controlling the metabolism of flavonoids and anthocyanins (Johnson and Pace 2010). Only chalcone synthase B was detected in the purple sweet potato and its expression was remarkably lower than its expression in the yellow sweet potato (0.03 to 0.1-fold) (Table 3). Of the eight genes predicted to code chalcone isomerase in the purple sweet potato, we found one gene (P45989) that was only detected in the purple sweet potato and two genes (P25784 and P21465) that were expressed 3.5 to 4-fold higher as compared to the yellow sweet potato (Table 3). The prototype pigment naringenin is formed through catalysis by the two enzymes and is subsequently oxidized by enzymes like flavanone hydroxylase (FHT or F3H), flavonoid 3′ hydroxylase (F3′H) and flavonoid 3′5′-hydroxylase (F3′5′H) (Johnson and Pace 2010). However, only three F3′H genes were found in the purple sweet potato and two of them have significantly higher expression (about 3 to 6 times) than that in the yellow sweet potato (Table 3). Oppositely, the expression of F3′5′H was obviously lower than that in the yellow sweet potato. Overall, three chalcone isomerase genes and two F3′H genes are indicated to play important roles in synthesizing substrates of anthocyanin pigments.

Table 3 Expression of genes involved in anthocyanin biosynthesis between two RNA-seq datasets

In the 12 assembled genes (P10687, P15973, P24884, P27764, P37171, P39309, P44029, P47461, P53247, P7192, P785, and P9837) participating in the pathway of anthocyanin biosynthesis, we could find 9 genes from the purple sweet potato that are orthologous to 20 genes from the yellow sweet potato (Table 3). Most of the orthologous genes were not significantly different in their expression. Interestingly, UDP-glucose-flavonoid 3-O-glucosyltransferase’s expression in the purple sweet potato were around 11 to 37-fold more than the its expression in the yellow sweet potato. UDP-glucose-flavonoid 3-O-glucosyltransferase (UFGT) is considered to finally catalyze anthocyanin to anthocyanin-3-glucoside (Johnson and Pace 2010). Thus, we infer that UFGT is a key enzyme in the pathway of anthocyanin biosynthesis and anthocyanin-3-glucoside is one of major components for anthocyanin biosynthesis. It is also possible that anthocyanin is not stable and is easy to be converted into other down-stream metabolites in the yellow sweet potato. UFGT from the wild potato was transformed in Arabidopsis and was proven to lead to the accumulation of anthocyanins (Lu et al. 2009). Down-regulation of the putative UDP-glucose: flavonoid 3-O-glucosyltransferase gene altered flower coloring in Phalaenopsis. The PeUFGT3-suppressed Phalaenopsis showed various levels of flower color fading due to the reduced level of PeUFGT3 transcriptional activity. There was a significant reduction in anthocyanin content in the PeUFGT3-suppressed Phalaenopsis flowers (Chen et al. 2011). Therefore, UFGT from our purple sweet potato is worthy of further investigation and has potential to be explored in future.

However, another gene isoform of glucosyltransferases is expressed about 11- and 17-fold higher in the yellow sweet potato than in the purple sweet potato. Glycosyltransferases are a ubiquitous group of enzymes that catalyze the transfer of a sugar moiety from an activated sugar donor onto saccharide or nonsaccharide acceptors (Coutinho et al. 2003). Glycosyltransferases could generate a potentially infinite number of glycoconjugates, oligo- and polysaccharides due to its remarkable diversity in their donor, acceptor, and product specificity (Coutinho et al. 2003). Thus, the glucosyltransferase may have different isoforms in the purple sweet potato and they may have different functions as compared to the yellow sweet potato.

Although a group of genes consist of a complex biosynthesis pathway for anthocyanins, at least three important transcription factors, MYB, bHLH, and WD40 repeat protein, have been validated to play key roles in anthocyanin biosynthesis by regulating anthocyanin structural genes during transcription. According to our annotation on assembled genes, at least 135 genes were predicted to code MYB proteins (Suppl. Table. S3). 128 of the 135 (94.8%) genes have orthologs in RNA-seq dataset of the yellow sweet potato. Similarly, 36 genes and 38 genes were identified to have functions of bHLH and WD40 in the purple sweet potato, respectively (Suppl. Table. S3). Most of the bHLH genes and WD40 genes have homologs in the yellow sweet potato. MYB proteins represent a diverse class of proteins with a DNA-binding domain, the MYB domain. Different categories of MYB proteins depend on the number of imperfect repeats of the MYB domain they contain. MYB proteins are versatile in regulating secondary metabolism, cellular morphogenesis, meristem formation, and the cell cycle (Jin and Martin 1999). Up to now, it is still unclear which type of MYB contributes to anthocyanin biosynthesis in the purple sweet potato. In addition, the MYB family in the two assembled sequences from the purple and the yellow sweet potato is hard to be categorized. Although the expression of MYB homologs in the purple and the yellow sweet potato could be compared, the expression level was too weak to tell which genes were really responsible for anthocyanin biosynthesis. This is also the same for bHLH and WD40. More research on these three transcription factors is needed.

In summary, we identified a total of 58,800 unigenes from the purple sweet potato using the high-throughput sequencing technology. A majority of these unigenes are protein-coding genes. The rest of the sequences may be miRNAs, siRNAs or other noncoding RNAs. Functional enrichment using GO and KEGG annotations showed that at least 3,553 genes might participate in many important biological and metabolic pathways, including pigment biosynthesis. Three chalcone isomerase genes and two F3′H genes might be key enzymes for synthesizing substrates of anthocyanin pigments in the pathway of flavonoid biosynthesis.