Abstract
Objective
In order to enrich the ovine genome and provide a basis for future molecular genetics and functional genomics analyses in sheep, we used de novo assembly to establish transcriptomes of skeletal muscle tissues of Dorper and Small-tailed Han sheep.
Results
A total of 103,058,824 clean Illumina paired-end sequencing reads from the two libraries were assembled into 145,524 unigenes in a de novo project. There were 5718 unigenes showing differential expression between the two transcriptomes, and 7437 coding SSRs were exploited. After further assembly, we identified a total of 70,348 all-unigenes with an average length of 863 bp; 35,201 of these all-unigenes could be annotated in the Nr database, and 12,219 were found in the clusters of orthologous groups database. Gene ontology searches indicated cell and binding as the main terms. Among 258 Kyoto Encyclopedia of Genes and Genomes database pathways, protein and amino acid metabolism pathways were the most commonly identified.
Conclusion
We analyzed the ovine muscle transcriptome using high-throughput sequencing technology. Many unigenes were assembled and numerous molecular markers and differential expressed unigenes were identified.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Mutton and goat meat constitute a large percentage of the meat consumed in China, and the growing market demand for mutton has promoted the expansion of sheep breeding programs. Hence, understanding the genetic principles underlying muscle growth would be helpful for promoting sheep breeding.
High-throughput sequencing, also known as next-generation sequencing (NGS), has been widely used to analyze nucleotide sequences in many organisms because of the high throughput, high accuracy and low cost of this technology (Vera et al. 2008; Collins et al. 2008; Li et al. 2010). RNA-seq is a method for studying transcriptomes using NGS which provides a platform for measuring differences in gene expression with more sensitivity than is possible using traditional microarray hybridization methods (Wilhelm and Landry 2009). Even more importantly, RNA-seq can be used to perform de novo analysis of sequencing data without a reference genome. However, to our knowledge, studies on sheep transcriptomes using NGS technology have been limited thus far. Marten Jäger provided a de novo atlas of Merino sheep bones using the Illumina GA IIx sequencing platform and compared the differentially expressed genes between standard and delayed-healing groups at various times (Jäger et al. 2011). In addition, Miao X conducted a de novo analysis of ovarian RNA-seq data from Small-tailed Han sheep (SH) and Surabaya fur sheep, which exhibit distinct fecundity patterns, and obtained more than 20,000 contigs with an average length of 300 bp (Miao and Luo 2013).
In a previous study, we performed RNA-seq using the Illumina system with mapped sequencing data to analyze and compare the muscle transcriptomes of Dorper (DP) and SH to the Ovis aries genome and reference genes. During the mapping, we found that there were many uncertain bases (N) in the genome and a relatively small amount of reference information. The mapping results revealed that more than half of the sequencing reads were annotated to inter-genes (Zhang et al. 2013, 2014). This high ratio of reads unmapped to genes may be related to imperfections in the O. aries genome and genes. Therefore, in the present study, we conducted a de novo analysis of two sets of RNA-seq data to minimize errors in the O. aries genome.
DP is a foreign breed with rapid muscle growth but the indigenous Chinese breed, SH, has a slower growth rate. By comparing unigenes with differential expression between the two transcriptomes, we expected to identify genes closely associated with muscle growth. In the present study, we utilized Illumina paired-end mRNA-Seq technology to sequence and de novo assemble the ovine skeletal muscle transcriptomes, and further developed cSSRs markers. To the best of our knowledge, this study is the first to characterize the ovine skeletal muscle transcriptomes by large-scale transcript sequencing analysis. These analysis and cSSRs will serve as a valuable resource for discovery of novel genes and marker-assisted selective breeding in sheep.
Materials and methods
Animals, tissue collection, library construction and Illumina sequencing
The experimental animals and the procedures used to perform biceps brachii collection, construction of cDNA libraries, sequencing and data filtering to obtain clean reads were described in a previous report by our group (Zhang et al. 2013). All animal experiments were approved by the Institutional Animal Care and Use Committee of Shandong Agricultural University (2004006).
De novo assembly
Clean reads of the two libraries were de novo assembled to generate contigs, unigenes and non-redundant unigenes (all-unigenes) that were as long as possible. We used two methods for de novo assembly: SOAPdenovo 63 mer-V1.05 (Li et al. 2010), with an optimized k-mer length of 41, and the Trinity method (Grabherr et al. 2011), with an optimized k-mer length of 25.
Expression annotation and analysis of differential expression
The expression level of each unigene was measured using the number of clean reads mapped to its sequence. We calculated the number of clean reads and then normalized it to the FPKM (Fragments Per Kb per Million fragments) (Mortazavi et al. 2008) using the following formula:
where FPKM (A) indicates the expression of unigene A, C represents the number of fragments that uniquely aligned to unigene A, N is the total number of fragments that uniquely aligned to all unigenes, and L is the number of bases in unigene A.
The methods for identifying differentially expressed unigenes (DEGs), calculating the corresponding P values and performing other statistical analyses were described in our previous work (Zhang et al. 2013).
Gene annotation
Unigenes were aligned to the NCBI nonredundant protein (Nr) and Swiss-Prot protein databases, as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) and Clusters of orthologous groups (COG), using BLASTx with an E-value of less than 10−5 (Altschul et al. 1977). Based on the Nr annotation, Gene ontology (GO) annotations of the all-unigenes were obtained using Blast2GO (Conesa et al. 2005). After the GO annotation was obtained for each all-unigene, we used WEGO software (Ye et al. 2006) to perform GO functional classification. The KEGG database was used for annotating the pathways of these all-unigenes (Kanehisa et al. 2008).
cSSR exploitation
Coding microsatellites (cSSRs), a type of molecular marker in coding strand, contain repeat units of a few nucleotides (usually 2–6) that form sequences of as many as dozens of tandem repeats. We performed a general screen of cSSRs in the obtained all-unigenes using the software MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/). This analysis was conducted as follows: (1) cSSRs were defined as containing motifs of two to six nucleotides in size and showing a minimum length of 12 bp. In other words, two-nucleotide motifs would be repeated at least six times, three-nucleotide motifs would be repeated at least four times, and so on. We selected cSSRs with sequences of ≥150 bp from both before and after the cSSR sequences. (2) Primer pairs flanking each cSSR locus were designed using the software Batchprimer3 (Kortt et al. 1991). The optimal size was 24 bp (20–28 bp), and the optimal Tm of the primer pairs was 63 °C (60–65 °C). (3) We aligned these primers to the all-unigenes (Blast parameters: blastall -p blastn -F F -e 10). The 5′-end of the primers permitted a mispairing of 3 bases, and the 3′ end permitted a mispairing of 1 base. (4) cSSRs in the product sequences were identified using SSR Finder software.
Results
De novo sequence assembly
Using Illumina paired-end sequencing technology, totals of 50,264,608 and 52,794,216 clean reads with a length of 2 × 90 bp were generated from the SH and DP libraries, respectively (Zhang et al. 2013). From these clean reads, we assembled 127,516 and 138,145 contigs from the SH and DP libraries (see supporting information), respectively, and the average sizes of these contigs were 326 and 309 bp (Table 1). Only contigs that exceeded 200 bp in length were analyzed further. Using the Trinity method, we next assembled these contigs into unigenes. The 70,632 and 74,892 assembled unigenes showed average lengths of 735 and 706 bp, totaling approximately 51.94 and 52.84 Mb of the SH and DP libraries (see supporting information), respectively. The two sets of contigs and unigenes showed similar fragment distributions (Fig. 1a). In order to evaluate of how the assembled unigenes were consistent with the O. aries genome and reference genes in NCBI, we used BLASTN to analysis them and took “e-value = 1e−10 and matching ration >95 %” as matched standard. For the SH and DP libraries, 99.4 and 99.3 % of unigenes were matched to the genomic, and 79.9 and 79.6 % of them were located in the reference genes, respectively.
To obtain longer and more reliable sequences, we pooled all of the unigenes from the two libraries and assembled them into a set of all-unigenes (non-redundant unigenes). We obtained a total of 70,348 all-unigenes, with an average length of 863 bp and a median length of 1641 bp (see supporting information). The total length of all-unigenes was 60,725,229 bp (Fig. 1a).
DEGs and GO classification
Comparisons of unigene expression indicated that 5711 unigenes were differentially expressed between the two muscle libraries, with a false discovery rate (FDR) ≤ 0.001 and |log2(FPKM of SH/DP)| ≥ 1 (see supporting information), among which 3053 unigenes were up-regulated and 2658 were down-regulated in the SH library compared to the DP library (Fig. 1b).
Annotation using the GO database revealed that these DEGs were allocated to the categories “biological process” (16,833 unigenes, 24 items), “cellular components” (10,651 unigenes, 16 items) and “molecular function” (3971 unigenes, 18 items) (see supporting information), with the secondary categories cellular processes (GO: 0009987, 1957, 11.6 %), cell (GO: 0005623, 2026, 19 %), cell parts (GO: 0044464, 2026, 19.02 %) and binding (GO: 0005488, 1900, 47.9 %) accounting for the greatest number of DEGs in each of the primary categories (Fig. 1c).
Functional annotation of all-unigenes against public databases
To validate and annotate the assembled all-unigenes, sequence similarity searches were conducted against public databases. Using the BLASTx algorithm and specifying E-values of less than 10−5, 35,201 of the all-unigenes were successfully annotated using the NCBI Nr database (see supporting information). As shown in Fig. 2a, 0.6 of the all-unigenes showed 0.95–1 similarity to known proteins in the Nr database, and 0.63 of them showed high similarity to encoded proteins found in Bos taurus. The E-value distribution revealed that 0.57 of all-unigenes showed significant homology (less than E-60) in the Nr database. Interestingly, only 0.02 of the all-unigenes could be mapped to the present O. aries genome, indicating that the present sheep genome should be further perfected.
A total of 28,489 all-unigenes were categorized into 62 functional groups (Fig. 2b) within the GO database (see supporting information). In the three major GO categories (biological process, cellular component and molecular function), “cellular process” (22,774 members), “cell” (24,182 members), “cell part” (24,182 members) and “binding” (21,529 members) were the most frequently indicated terms. We also observed a high percentage of genes from the “metabolic process” (18,204 members) and “organelle” (19,296 members) categories of the GO database. Few genes were allocated to the categories “virion” (5 members), “virion part” (5 members), “chemorepellent activity” (7 members), “morphogen activity” (5 members), “protein tag” (3 members) and “nutrient reservoir activity” (1 member).
The COG database can be used to classify the products of orthologous genes. The all-unigenes were aligned to the COG database to predict and classify their possible functions (see supporting information). Within this group, 12,219 all-unigenes were assigned to 25 COG classifications (Fig. 3a). Among the 25 COG categories, the “general function prediction only” cluster (5244, 42.9 %) represented the largest group, followed by “translation, ribosomal structure and biogenesis” (2970, 24.3 %), “replication, recombination and repair” (2857, 23.4 %) and “cell cycle control, cell division and chromosome partitioning” (2595, 21.2 %). However, only a few all-unigenes were assigned to the “nuclear structure” (7, 0.06 %) and “extracellular structures” (63, 0.52 %) categories.
According to the KEGG database, 26,311 all-unigenes were grouped into 258 KEGG functional pathways (see supporting information). The highest number of all-unigenes was attributed to metabolic pathways (2348, 8.9 %), followed by focal adhesion (1485, 5.6 %) and regulation of the actin cytoskeleton (1282, 4.9 %). In addition, we found 1282 all-unigenes that were involved in the regulation of the actin cytoskeleton, 765 in protein digestion and absorption and 526 in protein processing in the endoplasmic reticulum. Muscle is closely associated with the metabolism of amino acids. The KEGG metabolic pathways contained many amino acid synthesis and degradation pathways, among which arginine metabolism, proline metabolism, glutathione metabolism and N-glycan biosynthesis occupied the majority of positions. In addition to metabolic pathways, 22 signaling pathways were identified, particularly the MAPK (752, ko04010), chemokine (561, ko04062), insulin (534, ko04910), Wnt (519, ko04310) and calcium (468, ko04020) pathways (Fig. 3b).
Analysis of cSSRs
We performed a general screen on the all-unigenes for coding cSSRs and analyzed their nature and frequency. A search for di-, tri-, quad-, penta- and hexa-nucleotide repeats yielded 7437 potential cSSRs within 6247 all-unigenes, with a total size of 60,725,229 bp (see supporting information). That is, approximately 1288 bp of the all-unigenes contained at least one of the cSSR motifs considered.
The lengths of the cSSRs ranged from 12 to 54 bp. Among these motifs, the most common repeat motifs were dimers (2189, 46.4 %), followed by trimers (2180, 46.2 %) (Fig. 4a). AC repeats (1530, 32.4 %) were the most abundant type of motif detected in our cSSRs, followed by AGC (638, 13.5 %), CCG (531, 11.3 %), AGG (495, 10.5 %) and TG (423, 9 %) motifs (Fig. 4b).
In addition, we developed forward and reverse primers for each cSSR using Batchprimer3 software (see supporting information).
Discussion
After assembly, the average lengths of the unigenes (706 and 735 bp) were significantly longer than those of Aegilops (500 bp) (Xu et al. 2012), peanut (619 bp) (Zhang et al. 2012) and Adriatic sturgeon (Vidotto et al. 2013). These differences suggest that our de novo transcriptome of O. aries provided abundant unigene information, without gaps in the sequences. Furthermore, when we evaluated the consistency of these assembled unigenes with the O. aries genome, more than 99 % of unigenes exhibited a matched ratio exceeded 95 %, which is much higher than previous reports that about two-thirds of the sequencing reads could be mapped to the genome (Zhang et al. 2013). This high matched ratio illustrated that our assembly was more successful. When we realigned all of the usable sequencing reads to our unigenes, 82.9 % of the 70,348 all-unigenes showed a coverage depth that exceeded 80 %. This percentage is much higher than that obtained for sweet potato (48.5 %) (Wang et al. 2010a, b) or goat (46.3 %) (Xu et al. 2013). This high coverage indicated that we obtained longer all-unigenes without gaps, providing genetic data that enrich the available genomic resources for sheep.
The differentially expressed genes in the two breeds of sheep are likely associated with their muscle growth and development. Recent studies identified 2815 genes that were differentially expressed between the longissimus dorsi and liver of male Mule × Charollais sheep using Affymetrix Human GeneChips (Graham et al. 2011) and 635 annotated genes or ESTs that were differentially expressed in five developmental phases between Texel and Ujumqin sheep longissimus dorsi using a sheep oligo DNA microarray (Ren et al. 2011). In the present study, we identified 5718 unigenes that were differentially expressed with an FDR ≤ 0.001 and an absolute value of the log2 Ratio ≥ 1. NGS is clearly able reveal more DEGs than can be found using microarrays, but these DEGs require further confirmation.
In the present work, 60 % of the all-unigenes showed more than 95 % similarity to known proteins in the Nr database, whereas in whitefly (Wang et al. 2010a, b), sesame (Wei et al. 2011) and Aegilops (Xu et al. 2012), only 16.2, 54 and 42.6 % unigenes, respectively, showed homologs in the Nr database. The higher percentage of hits found in our study was partially due to the increased number of long sequences in our unigene database.
After mammals are born, the number of muscle fibers does not undergo any further changes, and muscle growth is mainly driven by increases in the muscle fiber diameter (Amthor et al. 2009; Jakson et al. 2012). When the identified DEGs were annotated using the GO database, the cell, cell process and cell part categories presented the most unigenes in every primary category, indicating that muscle growth and development are closely associated with cell proliferation or differentiation. We also noted that many unigenes were annotated as belonging to metabolic processes in the KEGG database.
SSRs are neutral molecular markers that are widely distributed in genomes. In this study, approximately 10 % of the all-unigenes contained at least one of the cSSR motifs evaluated. This percentage was much higher than in previous studies involving EST databases, in which it was found that approximately 3–7 % of the expressed sequences contained putative SSR motifs (Thiel et al. 2003; Guo et al. 2010; Blanca et al. 2011). Di-nucleotide repeats were the main SSRs identified, similar to results in humans (Lander et al. 2001) and pigs (Nie et al. 2011). In addition to (AC)n di-nucleotide repeats, (AGC)n, (CCG)n and (AGG)n also appeared at high frequencies. This result differs from findings in human (Lander et al. 2001), pig (Nie et al. 2011) and peanut (Thiel et al. 2003) genomes. Based on these identified markers, we can predict PCR primers, evaluate gene polymorphisms and provide a valuable source of genetic markers for future research in sheep.
References
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1977) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res 25:3389–3402
Amthor H, Otto A, Vulin A, Rochat A, Dumonceaux J, Garcia L, Mouisel E, Hourdé C, Macharia R, Friedrichs M, Relaix F, Zammit PS, Matsakas A, Patel K, Partridge T (2009) Muscle hypertrophy driven by myostatin blockade does not require stem/precursor-cell activity. Proc Natl Acad Sci USA 106:7479–7484
Blanca J, Cañizares J, Roig C, Ziarsolo P, Nuez F, Picó B (2011) Transcriptome characterization and high-throughput SSRs and SNPs discovery in Cucurbita pepo (Cucurbitaceae). BMC Genomics 12:104
Collins LJ, Biggs PJ, Voelckel C, Joly S (2008) An approach to transcriptome analysis of non-model organisms using short-read sequences. Genome Inf 21:3–14
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21:3674–3676
Grabherr MG, Haas BJ, Yassour M et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652
Graham NS, May ST, Daniel ZC, Emmerson ZF, Brameld JM, Parr T (2011) Use of the Affymetrix Human GeneChip array and genomic DNA hybridisation probe selection to study ovine transcriptomes. Animal 5:861–866
Guo S, Zheng Y, Joung JG, Liu S, Zhang Z, Crasta OR, Sobral BW, Xu Y, Huang S, Fei Z (2010) Transcriptome sequencing and comparative analysis of cucumber flowers with different sex types. BMC Genomics 11:384
Jäger M, Ott CE, Grünhagen J, Hecht J, Schell H, Mundlos S, Duda GN, Robinson PN, Lienau J (2011) Composite transcriptome assembly of RNA-seq data in a sheep model for delayed bone healing. BMC Genomics 12:158
Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M, Itoh M, Katayama T, Kawashima S, Okuda S, Tokimatsu T, Yamanishi Y (2008) KEGG for linking genomes to life and the environment. Nucleic Acid Res 36:480–484
Kortt AA, Caldwell JB, Lilley GG, Higgins TJ (1991) Amino acid and cDNA sequences of a methionine-rich 2S protein from sunflower seed (Helianthus annuus L). Eur J Biochem 195:329–334
Lander ES, Linton LM, Birren B et al (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 20:265–272
Miao X, Luo Q (2013) Genome-wide transcriptome analysis between small-tail Han sheep and the Surabaya fur sheep using high-throughput RNA sequencing. Reproduction 145:587–596
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 5:621–628
Nie Q, Fang M, Jia X, Zhang W, Zhou X, He X, Zhang X (2011) Analysis of muscle and ovary transcriptome of Sus scrofa: assembly, annotation and marker discovery. DNA Res 18:343–351
Ren H, Li L, Su H, Xu L, Wei C, Zhang L, Li H, Liu W, Du L (2011) Histological and transcriptome-wide level characteristics of fetal myofiber hyperplasia during the second half of gestation in Texel and Ujumqin sheep. BMC Genomics 12:411
Thiel T, Michalek W, Varshney RK, Graner A (2003) Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L). Theor Appl Genet 106:411–422
Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH (2008) Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol 17:1636–1647
Vidotto M, Grapputo A, Boscari E, Barbisan F, Coppe A, Grandi G, Kumar A, Congiu L (2013) Transcriptome sequencing and de novo annotation of the critically endangered Adriatic sturgeon. BMC Genomics 14:407
Wang XW, Luan JB, Li JM, Bao YY, Zhang CX, Liu SS (2010a) De novo characterization of a whitefly transcriptome and analysis of its gene expression during development. BMC Genomics 1:400
Wang Z, Fang B, Chen J, Zhang X, Luo Z, Huang L, Chen X, Li Y (2010b) De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas). BMC Genomics 11:726
Wei W, Qi X, Wang L, Zhang Y, Hua W, Li D, Lv H, Zhang X (2011) Characterization of the sesame (Sesamum indicum L) global transcriptome using Illumina paired-end sequencing and development of EST-SSR markers. BMC Genomics 12:451
Wilhelm BT, Landry JR (2009) RNA-Seq quantitative measurement of expression through massively parallel RNA-sequencing. Methods 48:249–257
Xu DL, Long H, Liang JJ, Zhang J, Chen X, Li JL, Pan ZF, Deng GB, Yu MQ (2012) De novo assembly and characterization of the root transcriptome of Aegilops variabilis during an interaction with the cereal cyst nematode. BMC Genomics 13:133
Xu T, Guo X, Wang H, Du X, Gao X, Liu D (2013) De novo transcriptome assembly and differential gene expression profiling of three capra hircus skin types during anagen of the hair growth cycle. Int J Genomics 2013:269191
Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L, Wang J (2006) WEGO: a web tool for plotting GO annotations. Nucleic Acids Res 34:293–297
Zhang J, Liang S, Duan J, Wang J, Chen S, Cheng Z, Zhang Q, Liang X, Li Y (2012) De novo assembly and characterisation of the transcriptome during seed development, and generation of genic-SSR markers in peanut (Arachis hypogaea L). BMC Genomics 13:90
Zhang C, Wang G, Wang J, Ji Z, Liu Z, Pi X, Chen C (2013) Characterization and comparative analyses of muscle transcriptomes in Dorper and small-tailed Han sheep using RNA-Seq technique. PLoS One 8:e72686
Zhang C, Wang G, Wang J, Ji Z, Dong F, Chao T (2014) Analysis of differential gene expression and novel transcript units of ovine muscle transcriptomes. PLoS One 9:e89817
Acknowledgments
This work was supported by funds of Innovation Research of Agriculture and Biology Resources (No. 2011186125), the Shandong Provincial Modern Agriculture Industry Technology System Sheep Industry Innovation Team (No. SDAIT-09-011-01) and Weifang science and technology development Projects (No. 201301009) and Doctoral scientific research fund project of Weifang university (No. 2015BS04).
Supporting information
The supplementary data are available at https://mega.co.nz/#fm/KgBCQJba.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, C., Wang, G., Hou, L. et al. De novo assembly and characterization of the skeletal muscle transcriptome of sheep using Illumina paired-end sequencing. Biotechnol Lett 37, 1747–1756 (2015). https://doi.org/10.1007/s10529-015-1854-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10529-015-1854-9