Introduction

Mutton and goat meat constitute a large percentage of the meat consumed in China, and the growing market demand for mutton has promoted the expansion of sheep breeding programs. Hence, understanding the genetic principles underlying muscle growth would be helpful for promoting sheep breeding.

High-throughput sequencing, also known as next-generation sequencing (NGS), has been widely used to analyze nucleotide sequences in many organisms because of the high throughput, high accuracy and low cost of this technology (Vera et al. 2008; Collins et al. 2008; Li et al. 2010). RNA-seq is a method for studying transcriptomes using NGS which provides a platform for measuring differences in gene expression with more sensitivity than is possible using traditional microarray hybridization methods (Wilhelm and Landry 2009). Even more importantly, RNA-seq can be used to perform de novo analysis of sequencing data without a reference genome. However, to our knowledge, studies on sheep transcriptomes using NGS technology have been limited thus far. Marten Jäger provided a de novo atlas of Merino sheep bones using the Illumina GA IIx sequencing platform and compared the differentially expressed genes between standard and delayed-healing groups at various times (Jäger et al. 2011). In addition, Miao X conducted a de novo analysis of ovarian RNA-seq data from Small-tailed Han sheep (SH) and Surabaya fur sheep, which exhibit distinct fecundity patterns, and obtained more than 20,000 contigs with an average length of 300 bp (Miao and Luo 2013).

In a previous study, we performed RNA-seq using the Illumina system with mapped sequencing data to analyze and compare the muscle transcriptomes of Dorper (DP) and SH to the Ovis aries genome and reference genes. During the mapping, we found that there were many uncertain bases (N) in the genome and a relatively small amount of reference information. The mapping results revealed that more than half of the sequencing reads were annotated to inter-genes (Zhang et al. 2013, 2014). This high ratio of reads unmapped to genes may be related to imperfections in the O. aries genome and genes. Therefore, in the present study, we conducted a de novo analysis of two sets of RNA-seq data to minimize errors in the O. aries genome.

DP is a foreign breed with rapid muscle growth but the indigenous Chinese breed, SH, has a slower growth rate. By comparing unigenes with differential expression between the two transcriptomes, we expected to identify genes closely associated with muscle growth. In the present study, we utilized Illumina paired-end mRNA-Seq technology to sequence and de novo assemble the ovine skeletal muscle transcriptomes, and further developed cSSRs markers. To the best of our knowledge, this study is the first to characterize the ovine skeletal muscle transcriptomes by large-scale transcript sequencing analysis. These analysis and cSSRs will serve as a valuable resource for discovery of novel genes and marker-assisted selective breeding in sheep.

Materials and methods

Animals, tissue collection, library construction and Illumina sequencing

The experimental animals and the procedures used to perform biceps brachii collection, construction of cDNA libraries, sequencing and data filtering to obtain clean reads were described in a previous report by our group (Zhang et al. 2013). All animal experiments were approved by the Institutional Animal Care and Use Committee of Shandong Agricultural University (2004006).

De novo assembly

Clean reads of the two libraries were de novo assembled to generate contigs, unigenes and non-redundant unigenes (all-unigenes) that were as long as possible. We used two methods for de novo assembly: SOAPdenovo 63 mer-V1.05 (Li et al. 2010), with an optimized k-mer length of 41, and the Trinity method (Grabherr et al. 2011), with an optimized k-mer length of 25.

Expression annotation and analysis of differential expression

The expression level of each unigene was measured using the number of clean reads mapped to its sequence. We calculated the number of clean reads and then normalized it to the FPKM (Fragments Per Kb per Million fragments) (Mortazavi et al. 2008) using the following formula:

$$ {\text{FPKM}} = \frac{{10^{6} C}}{{{\text{NL}}/10^{3} }} $$

where FPKM (A) indicates the expression of unigene A, C represents the number of fragments that uniquely aligned to unigene A, N is the total number of fragments that uniquely aligned to all unigenes, and L is the number of bases in unigene A.

The methods for identifying differentially expressed unigenes (DEGs), calculating the corresponding P values and performing other statistical analyses were described in our previous work (Zhang et al. 2013).

Gene annotation

Unigenes were aligned to the NCBI nonredundant protein (Nr) and Swiss-Prot protein databases, as well as Kyoto Encyclopedia of Genes and Genomes (KEGG) and Clusters of orthologous groups (COG), using BLASTx with an E-value of less than 10−5 (Altschul et al. 1977). Based on the Nr annotation, Gene ontology (GO) annotations of the all-unigenes were obtained using Blast2GO (Conesa et al. 2005). After the GO annotation was obtained for each all-unigene, we used WEGO software (Ye et al. 2006) to perform GO functional classification. The KEGG database was used for annotating the pathways of these all-unigenes (Kanehisa et al. 2008).

cSSR exploitation

Coding microsatellites (cSSRs), a type of molecular marker in coding strand, contain repeat units of a few nucleotides (usually 2–6) that form sequences of as many as dozens of tandem repeats. We performed a general screen of cSSRs in the obtained all-unigenes using the software MIcroSAtellite (MISA, http://pgrc.ipk-gatersleben.de/misa/). This analysis was conducted as follows: (1) cSSRs were defined as containing motifs of two to six nucleotides in size and showing a minimum length of 12 bp. In other words, two-nucleotide motifs would be repeated at least six times, three-nucleotide motifs would be repeated at least four times, and so on. We selected cSSRs with sequences of ≥150 bp from both before and after the cSSR sequences. (2) Primer pairs flanking each cSSR locus were designed using the software Batchprimer3 (Kortt et al. 1991). The optimal size was 24 bp (20–28 bp), and the optimal Tm of the primer pairs was 63 °C (60–65 °C). (3) We aligned these primers to the all-unigenes (Blast parameters: blastall -p blastn -F F -e 10). The 5′-end of the primers permitted a mispairing of 3 bases, and the 3′ end permitted a mispairing of 1 base. (4) cSSRs in the product sequences were identified using SSR Finder software.

Results

De novo sequence assembly

Using Illumina paired-end sequencing technology, totals of 50,264,608 and 52,794,216 clean reads with a length of 2 × 90 bp were generated from the SH and DP libraries, respectively (Zhang et al. 2013). From these clean reads, we assembled 127,516 and 138,145 contigs from the SH and DP libraries (see supporting information), respectively, and the average sizes of these contigs were 326 and 309 bp (Table 1). Only contigs that exceeded 200 bp in length were analyzed further. Using the Trinity method, we next assembled these contigs into unigenes. The 70,632 and 74,892 assembled unigenes showed average lengths of 735 and 706 bp, totaling approximately 51.94 and 52.84 Mb of the SH and DP libraries (see supporting information), respectively. The two sets of contigs and unigenes showed similar fragment distributions (Fig. 1a). In order to evaluate of how the assembled unigenes were consistent with the O. aries genome and reference genes in NCBI, we used BLASTN to analysis them and took “e-value = 1e−10 and matching ration >95 %” as matched standard. For the SH and DP libraries, 99.4 and 99.3 % of unigenes were matched to the genomic, and 79.9 and 79.6 % of them were located in the reference genes, respectively.

Table 1 Statistics on the assembled contigs and unigenes
Fig. 1
figure 1

Size distribution of the assembled unigene data and comparison of unigene expression levels between the SH and DP libraries. a Size distribution of unigenes and all-unigenes. b Comparison of unigene expression levels between the SH and DP libraries. The horizontal and vertical axes indicate expression levels of the unigenes in the two libraries. Unigenes up-regulated and down-regulated in the SH library are shown in red and green, respectively. Blue dots represent unigenes with similar expression levels. c Gene ontology classification of differentially expressed unigenes. The results are summarized into three main categories: biological processes, cellular components and molecular functions. In total, 5711 differentially expressed unigenes were assigned to gene ontologies

To obtain longer and more reliable sequences, we pooled all of the unigenes from the two libraries and assembled them into a set of all-unigenes (non-redundant unigenes). We obtained a total of 70,348 all-unigenes, with an average length of 863 bp and a median length of 1641 bp (see supporting information). The total length of all-unigenes was 60,725,229 bp (Fig. 1a).

DEGs and GO classification

Comparisons of unigene expression indicated that 5711 unigenes were differentially expressed between the two muscle libraries, with a false discovery rate (FDR) ≤ 0.001 and |log2(FPKM of SH/DP)| ≥ 1 (see supporting information), among which 3053 unigenes were up-regulated and 2658 were down-regulated in the SH library compared to the DP library (Fig. 1b).

Annotation using the GO database revealed that these DEGs were allocated to the categories “biological process” (16,833 unigenes, 24 items), “cellular components” (10,651 unigenes, 16 items) and “molecular function” (3971 unigenes, 18 items) (see supporting information), with the secondary categories cellular processes (GO: 0009987, 1957, 11.6 %), cell (GO: 0005623, 2026, 19 %), cell parts (GO: 0044464, 2026, 19.02 %) and binding (GO: 0005488, 1900, 47.9 %) accounting for the greatest number of DEGs in each of the primary categories (Fig. 1c).

Functional annotation of all-unigenes against public databases

To validate and annotate the assembled all-unigenes, sequence similarity searches were conducted against public databases. Using the BLASTx algorithm and specifying E-values of less than 10−5, 35,201 of the all-unigenes were successfully annotated using the NCBI Nr database (see supporting information). As shown in Fig. 2a, 0.6 of the all-unigenes showed 0.95–1 similarity to known proteins in the Nr database, and 0.63 of them showed high similarity to encoded proteins found in Bos taurus. The E-value distribution revealed that 0.57 of all-unigenes showed significant homology (less than E-60) in the Nr database. Interestingly, only 0.02 of the all-unigenes could be mapped to the present O. aries genome, indicating that the present sheep genome should be further perfected.

Fig. 2
figure 2

Functional annotation of all-unigenes against Nr and GO databases. a Characteristics of the similarity search performed on all-unigenes against the Nr database. E-value distribution of BLAST hits for each all-unigene with a cutoff E-value of E-5. In this search, 0.6 of the all-unigenes showed 0.95–1 similarity to known proteins, and only 0.02 of the all-unigenes could be mapped to the present O. aries genome. b Gene ontology classification of the assembled all-unigenes. The results are summarized into three main categories: biological processes, cellular components and molecular functions. In total, 28,489 all-unigenes were assigned to 62 functional groups

A total of 28,489 all-unigenes were categorized into 62 functional groups (Fig. 2b) within the GO database (see supporting information). In the three major GO categories (biological process, cellular component and molecular function), “cellular process” (22,774 members), “cell” (24,182 members), “cell part” (24,182 members) and “binding” (21,529 members) were the most frequently indicated terms. We also observed a high percentage of genes from the “metabolic process” (18,204 members) and “organelle” (19,296 members) categories of the GO database. Few genes were allocated to the categories “virion” (5 members), “virion part” (5 members), “chemorepellent activity” (7 members), “morphogen activity” (5 members), “protein tag” (3 members) and “nutrient reservoir activity” (1 member).

The COG database can be used to classify the products of orthologous genes. The all-unigenes were aligned to the COG database to predict and classify their possible functions (see supporting information). Within this group, 12,219 all-unigenes were assigned to 25 COG classifications (Fig. 3a). Among the 25 COG categories, the “general function prediction only” cluster (5244, 42.9 %) represented the largest group, followed by “translation, ribosomal structure and biogenesis” (2970, 24.3 %), “replication, recombination and repair” (2857, 23.4 %) and “cell cycle control, cell division and chromosome partitioning” (2595, 21.2 %). However, only a few all-unigenes were assigned to the “nuclear structure” (7, 0.06 %) and “extracellular structures” (63, 0.52 %) categories.

Fig. 3
figure 3

Functional annotation of all-unigenes against the COG and KEGG databases. a Clusters of orthologous groups (COG) classification of the assembled all-unigenes. The all-unigenes were aligned to the COG database to predict and classify their possible functions. In total, 12,219 sequences were assigned to 25 COG types. b Pathway assignment based on the KEGG database of assembled all-unigenes. The results showed 24 classifications associated with protein and amino acid metabolism and 22 classifications associated with signaling pathways

According to the KEGG database, 26,311 all-unigenes were grouped into 258 KEGG functional pathways (see supporting information). The highest number of all-unigenes was attributed to metabolic pathways (2348, 8.9 %), followed by focal adhesion (1485, 5.6 %) and regulation of the actin cytoskeleton (1282, 4.9 %). In addition, we found 1282 all-unigenes that were involved in the regulation of the actin cytoskeleton, 765 in protein digestion and absorption and 526 in protein processing in the endoplasmic reticulum. Muscle is closely associated with the metabolism of amino acids. The KEGG metabolic pathways contained many amino acid synthesis and degradation pathways, among which arginine metabolism, proline metabolism, glutathione metabolism and N-glycan biosynthesis occupied the majority of positions. In addition to metabolic pathways, 22 signaling pathways were identified, particularly the MAPK (752, ko04010), chemokine (561, ko04062), insulin (534, ko04910), Wnt (519, ko04310) and calcium (468, ko04020) pathways (Fig. 3b).

Analysis of cSSRs

We performed a general screen on the all-unigenes for coding cSSRs and analyzed their nature and frequency. A search for di-, tri-, quad-, penta- and hexa-nucleotide repeats yielded 7437 potential cSSRs within 6247 all-unigenes, with a total size of 60,725,229 bp (see supporting information). That is, approximately 1288 bp of the all-unigenes contained at least one of the cSSR motifs considered.

The lengths of the cSSRs ranged from 12 to 54 bp. Among these motifs, the most common repeat motifs were dimers (2189, 46.4 %), followed by trimers (2180, 46.2 %) (Fig. 4a). AC repeats (1530, 32.4 %) were the most abundant type of motif detected in our cSSRs, followed by AGC (638, 13.5 %), CCG (531, 11.3 %), AGG (495, 10.5 %) and TG (423, 9 %) motifs (Fig. 4b).

Fig. 4
figure 4

Statistics of cSSR motifs. a Distribution of the number of cSSRs according to motif sequence type. b Distribution of the number of cSSRs based on the number of repeat units. Among the 4721 analyzed cSSRs, AC di-nucleotide repeats were the most abundant motif among the 183 types of motifs

In addition, we developed forward and reverse primers for each cSSR using Batchprimer3 software (see supporting information).

Discussion

After assembly, the average lengths of the unigenes (706 and 735 bp) were significantly longer than those of Aegilops (500 bp) (Xu et al. 2012), peanut (619 bp) (Zhang et al. 2012) and Adriatic sturgeon (Vidotto et al. 2013). These differences suggest that our de novo transcriptome of O. aries provided abundant unigene information, without gaps in the sequences. Furthermore, when we evaluated the consistency of these assembled unigenes with the O. aries genome, more than 99 % of unigenes exhibited a matched ratio exceeded 95 %, which is much higher than previous reports that about two-thirds of the sequencing reads could be mapped to the genome (Zhang et al. 2013). This high matched ratio illustrated that our assembly was more successful. When we realigned all of the usable sequencing reads to our unigenes, 82.9 % of the 70,348 all-unigenes showed a coverage depth that exceeded 80 %. This percentage is much higher than that obtained for sweet potato (48.5 %) (Wang et al. 2010a, b) or goat (46.3 %) (Xu et al. 2013). This high coverage indicated that we obtained longer all-unigenes without gaps, providing genetic data that enrich the available genomic resources for sheep.

The differentially expressed genes in the two breeds of sheep are likely associated with their muscle growth and development. Recent studies identified 2815 genes that were differentially expressed between the longissimus dorsi and liver of male Mule × Charollais sheep using Affymetrix Human GeneChips (Graham et al. 2011) and 635 annotated genes or ESTs that were differentially expressed in five developmental phases between Texel and Ujumqin sheep longissimus dorsi using a sheep oligo DNA microarray (Ren et al. 2011). In the present study, we identified 5718 unigenes that were differentially expressed with an FDR ≤ 0.001 and an absolute value of the log2 Ratio ≥ 1. NGS is clearly able reveal more DEGs than can be found using microarrays, but these DEGs require further confirmation.

In the present work, 60 % of the all-unigenes showed more than 95 % similarity to known proteins in the Nr database, whereas in whitefly (Wang et al. 2010a, b), sesame (Wei et al. 2011) and Aegilops (Xu et al. 2012), only 16.2, 54 and 42.6 % unigenes, respectively, showed homologs in the Nr database. The higher percentage of hits found in our study was partially due to the increased number of long sequences in our unigene database.

After mammals are born, the number of muscle fibers does not undergo any further changes, and muscle growth is mainly driven by increases in the muscle fiber diameter (Amthor et al. 2009; Jakson et al. 2012). When the identified DEGs were annotated using the GO database, the cell, cell process and cell part categories presented the most unigenes in every primary category, indicating that muscle growth and development are closely associated with cell proliferation or differentiation. We also noted that many unigenes were annotated as belonging to metabolic processes in the KEGG database.

SSRs are neutral molecular markers that are widely distributed in genomes. In this study, approximately 10 % of the all-unigenes contained at least one of the cSSR motifs evaluated. This percentage was much higher than in previous studies involving EST databases, in which it was found that approximately 3–7 % of the expressed sequences contained putative SSR motifs (Thiel et al. 2003; Guo et al. 2010; Blanca et al. 2011). Di-nucleotide repeats were the main SSRs identified, similar to results in humans (Lander et al. 2001) and pigs (Nie et al. 2011). In addition to (AC)n di-nucleotide repeats, (AGC)n, (CCG)n and (AGG)n also appeared at high frequencies. This result differs from findings in human (Lander et al. 2001), pig (Nie et al. 2011) and peanut (Thiel et al. 2003) genomes. Based on these identified markers, we can predict PCR primers, evaluate gene polymorphisms and provide a valuable source of genetic markers for future research in sheep.