Introduction

The endosperm of rice (Oryza sativa) is a major storage tissue for starch and proteins, which is essential for embryo growth and postembryonic development prior to the start of seedling photosynthesis (Xu et al. 2008). As a major storage tissue and a determinant of rice yield, mechanisms of endosperm development have become hot topics (Girke et al. 2000; Laudencia-Chingcuanco et al. 2007; Verza et al. 2005; Luo et al. 2011). However, the molecular mechanisms underlying endosperm development remain largely unknown.

Previous studies have provided insights into the morphological and cellular features of developing rice seeds (Berger 1999; Ishimaru et al. 2003). Rice-seed development is trigged by a double fertilization process, in which the endosperm is initiated by the fusion of a sperm nucleus and two polar nuclei and the fertilized egg develops into the embryo (Goldberg et al. 1994). In monocots such as cereal crops, the endosperm is a persistent structure and represents the main part of the mature seed (Le et al. 2007). A general model for the development of cereal endosperm recognizes four major stages: syncytial, cellularization, differentiation and maturation (Olsen et al. 1992). For rice endosperm development, the fertilized polar nuclei undergo numerous cycles of mitosis without cellularization until 3 DAP (days after pollination). Then mitosis in the peripheral syncytium ceases and cellularization proceeds until 6–7 DAP. Once the endosperm has gone through the syncytial and cellularization stages, the cells of the starchy endosperm are specialized in storage material accumulation (Morrison et al. 1975). Starch and protein in the endosperm peaks between 4 and 12 DAP (Gaur et al. 2010). Active starch accumulation in the endosperm is maintained until 20 DAP, after which the seed enters the desiccation phase (Ishimaru et al. 2003; Xu et al. 2008).

Reserve accumulation is the key part of endosperm development for determination of seed weight and yield. Due to the importance of the endosperm for rice, understanding the metabolic and developmental control of reserve accumulation constitutes a major focus of plant research. Mutant and transgenic analyses have identified some important genes essential for starch synthesis and endosperm quality. For example, constitutive overexpression of the OsNAS gene family has shown enormous potential for Fe and Zn biofortification of the rice endosperm (Cavel et al. 2011), rice HGW gene in heading date and grain weight (Johnson et al. 2011), and the essential function of the transcription factor (TF) pBRP1 in rice endosperm growth control (Li et al. 2012). However, the molecular biology of the storage process is complex and not completely understood.

Besides the storage process, programmed cell death (PCD) is another important feature of endosperm development in rice, wheat and maize (Wei et al. 2002; Young and Gallie 1999; Young et al. 1997). Plants have evolved various means for controlled and organized cell destruction without releasing harmful substances into the remaining tissues (Gilchrist 1998). Most plant systems have two main forms of PCD, developmentally regulated and environmentally induced; however, PCD is part of normal development and growth (Lord and Gunawardena 2011). PCD is characterized by nuclear chromatin condensation, cytoplasmic shrinking, dilatation of endoplasmic reticulum, membrane blebbing and apoptotic body formation (Hatsugai et al. 2004). Endosperm cell death is an example of PCD in plants, a process that begins with nucleus degeneration (Wei et al. 2002). Following germination, aleurone cell development is another form of PCD within the endosperm (Wu et al. 2011). Some molecular and cellular mechanisms of PCD have been identified, such as a balance between ABA and ethylene in the progression of PCD (Bethke et al. 1999; Steffens and Sauter 2005), the role of the API5 gene in rice tapetum degradation (Li et al. 2011), pollen sterility in photoperiod-sensitive rice caused by decreased expression of the OsPDCD5 gene (Wang et al. 2010b), and the essential function of the mitochondrial hsp70 gene in PCD suppression in rice protoplasts (Qi et al. 2011). Although analyses of these mutants have facilitated the identification of several genes, this method cannot be used to identify all gene interactions and networks.

In the past few years, various powerful techniques, such as microarrays, cDNA or expressed sequence tag (EST) sequencing, and serial analysis of gene expression (SAGE), have enabled visualization of global changes in interaction and network transcript abundance. These powerful techniques have been used to describe the transcriptome of the endosperm and other compartments in Arabidopsis, wheat (Triticum aestivum) and maize (Zea mays) (Girke et al. 2000; Laudencia-Chingcuanco et al. 2007; Verza et al. 2005). However, these techniques have some drawbacks, such as low throughput, high cost, low sensitivity, and cloning bias. Most importantly, the information collected using these methods is incomplete, providing only a fragmented picture of transcriptome patterns.

The advent of next generation high-throughput sequencing is drastically altering biological study, showing us a more thorough and comprehensive view of biological development (Lister et al. 2009; Marguerat and Bahler 2010). The RNA sequencing method (RNA-Seq) represents the latest and most powerful tool for characterizing the transcriptome (Wang et al. 2009). RNA-Seq has been applied mainly to humans, mammals and yeast, and is only now being applied to plant species (Bleeker et al. 2011; Xu et al. 2012). With a typical sequencing depth and sufficient sensitivity, RNA-Seq is more suitable and affordable for comparative gene expression studies than microarrays, and generates ultrahigh-throughput data including many low-abundance genes (Bellin et al. 2009; Wang et al. 2009). RNA-Seq methodologies are expected to supersede DNA microarrays. Several studies have demonstrated that RNA-Seq data represents integrated networks that more closely resemble the cellular biology of many plants (Bleeker et al. 2011; Xu et al. 2012). However, this technology has not been applied to endosperm development in rice. To analyze the molecular mechanism of rice endosperm development from an interaction and network perspective, we utilized deep RNA sequencing (the Illumina RNA-Seq method) to investigate rice endosperm development.

Endosperm at three stages (3, 6, and 10 DAP) were subjected to early and important stage genome-wide transcriptome analyses of the rice endosperm. From 3 to 10 DAP, the endosperm undergoes cellularization and differentiation two important stages (Olsen et al. 1992). Moreover, this is associated with a series of coordinated cellular and metabolic events, including the cell death of the endosperm nucleus, starch synthesis, and starch granule packaging, which leads to efficient accumulation of starch (Wei et al. 2002; Xu et al. 2008) For this reason, in this report we applied the RNA-Seq method to rice endosperm development, generating massive endosperm development data. Bioinformatics analyses were used to obtain greater insight into gene networks and reciprocity. Finally, we report here a comprehensive analysis of the transcriptome that provides information regarding endosperm development at a molecular level and provides a foundation for future studies of endosperm development in rice and other plants.

Results

Illumina sequencing and de novo assembly

In this work, we sequenced three cDNA libraries taken from rice endosperm at 3, 6, and 10 DAP, named R1, R2, and R3, respectively. As a result, we generated 36.3 million sequence reads 42–50 bp in length. After removal of adaptor sequences, duplicate sequences, ambiguous reads and low-quality reads, 35.9 million high-quality clean reads (98.9 % of the raw data) remained. Clean reads were mapped to the rice (Oryza sativa L. ssp. indica) genome (http://www.gramene.org/) using SOAPaligner/soap2, with no more than two base mismatches allowed in the alignment. Of the total clean reads from the three samples, 72.3–75.6 % were a perfect match, and 11.2–11.9 % had no more than two base mismatches, while unmapped reads were 12.7–16.5 % (Table 1).

Table 1 Reads number based on the RNA-Seq data in three libraries of rice endosperm development

One of the most basic targets in RNA-Seq analysis is alignment of reads to the reference genome. By comparison with the rice genome and trimming, all 35.9 million clean reads were assembled into 23,836 genes. Among the three stages, the number of genes found were 21,596 (R1), 20,910 (R2) and 19,459 (R3), providing massive data for endosperm development analysis. From these numbers, we see an apparent decline in quantity during endosperm development.

Gene coverage, which is equal to the ratio of the base number in a gene covered by unique mapping reads to the total number of bases for that gene can reflect the quality of sequencing (Fig. 1). From the table, we can see that the gene coverage over three RNA-seq libraries show a highly reproducible and quite uniform. In addition, the size distribution of these genes is shown in Table 2. Reads in the most abundant group were 500–1,000 bp (6,597; 27.7 % of all 23,836 genes), followed by 1,000–1,500 bp (5,982; 25.1 % of all genes) and no less than 2,000 bp (4, 077; 17.1 % of all genes). A Venn diagram shows the overlapping relationship between the three stages (Fig. 2), and indicates the expression of genes among three stages. The three developmental stages showed a large overlap (17,617). Some of these genes exhibited little variation or a low expression level, and were thought to be housekeeping genes or expressed at a low-abundance during endosperm development. Moreover, we focused our analysis on genes that had relatively high abundance and apparent changes.

Fig. 1
figure 1

Gene coverage of the three stages mapped in the rice genome. cDNA libraries are taken from rice endosperm at 3DAP, 6DAP and 10DAP, named as R1 R2 and R3, respectively. Gene coverage is the percentage of a gene covered by reads. It is equal to the ratio of the base number in a gene covered by unique mapping reads to the total bases number of that gene. Visual result is showed as percentage of the amount of genes

Table 2 Distribution of all 23,836 genes detected in rice endosperm development via RNA-Seq technology
Fig. 2
figure 2

Venn diagram showing the genes expressed in each of the three stages of rice endosperm development. Among all genes, 17,617 are expressed at all three developmental stages, 1,811 are co-expressed in R1 and R2, 625 are co-expressed in R2 and R3, and 459 are co-expressed in R1 and R3. The number of stage-specifically expressed genes is 1,709 (R1), 1,543 (R2), and 857 (R3), respectively

RNA-Seq global data analysis and evaluation of differential gene expression

Gene expression levels were calculated using RPKM (Reads per kb per million reads) method. Their detailed expression during the three developmental stages is shown in Supplemental Table 1. There was an obvious decline in the total RPKM quantity from 3 to 10 DAP, as indicated by the number of expressed genes in each developmental process (Supplemental Table 1).

Analysis of differentially expressed genes (DEG) between samples is the emphasis of RNA-Seq, with the goal of obtaining profound and refined insight into the results. DEG was defined as the fold change of the normalized (RPKM) expression values, and was at least 2 in either direction when log2Ratio ≥1 and false discovery rate (FDR) ≤0.001. In addition, we could also identify the change in gene expression between three stages (Table 3). Using ERANGE (version 4.0) (http://woldlab.caltech.edu/gitweb/) we identified 10,371 DEGs in three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3) (Supplemental Table 2). To gain further insight, we separated the DEGs into three groups according to fold change: 1–5 fold (1 ≤log2Ratio <2.32, FDR ≤0.001), 5–10 fold (2.32 ≤log2Ratio <3.32, FDR ≤0.001) and ≥10 fold (log2Ratio ≥3.32, FDR ≤0.001) (Table 3). Differences among the three stages were marked, with many genes differently expressed at 5–10 fold (2,623) and ≥10 fold (2,951), indicating marked expression changes during the three stages of endosperm development. Likewise, the overall trend in all the comparisons was down-regulation of gene expression. Only a minority of genes were up-regulated during rice endosperm development.

Table 3 All 10,371 DEGs divided according to the folds of RPKM value between two stages in rice endosperm development

On the basis of RPKM, gene expression could be separated into six significant expression patterns, suggesting different roles in the three stages of rice endosperm development. Pattern 1 and 2 contained genes positively or negatively modulated throughout all three stages of development. Genes which were up-regulated at 3–6 DAP and down-regulated at 6–10 DAP were classified into Pattern 3. Genes in Pattern 4 were down-regulated from 3 to 6 DAP and then up-regulated until 10 DAP. Among all 10,371 DEGs, Pattern 2 was the most abundant, which contained 8,563 (82.57 %) genes. The second most abundant was Pattern 1, which contained 621 (5.99 %) genes. Pattern 3 consisted of 616 (5.94 %) genes, and Pattern 4 of 505 (4.87 %) genes. Apart from that, there were other two expression patterns. Fifty-four (0.52 %) genes, expressed only at 3 DAP, were defined as Pattern 5; and 12 (0.11 %) genes, expressed only at 10 DAP, were classified as Pattern 6. These expression patterns suggest diverse and complex interactions in endosperm development and regulation.

Hierarchical clustering of DEGs among the three developmental stages

Genes with similar expression patterns are often functionally correlated. To identify clusters with functional enrichment, we performed hierarchical clustering of the expression patterns of all DEGs and found that there was a similarity relationship in gene expression among three stages (Fig. 3). We performed a cluster analysis of gene expression patterns with functional enrichment using the Pearson correlation as the distance metric. The familiar log ratio values were used for gene expression analysis. Transformation of the data allowed the hierarchical clustering plot to display expression patterns for a subset of DEGs in each of the three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3) among the three stages. Separate plots are shown for each cluster with the means and standard deviations of the RPKM expression values. Groups of genes with similar expression patterns may have the same function in endosperm development, and the plots illustrate the similarities and differences among the three developmental stages of the rice endosperm. The hierarchical clustering analysis indicates that the majority of genes were down-regulated in all three comparisons, and that a minority of genes were up-regulated.

Fig. 3
figure 3

Hierarchical cluster analysis of gene expression based on log ratio RPKM data. The cluster display expression patterns for a subset of DEGs in each of the three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3) among the three stages. The color key represents RPKM normalized log2 transformed counts. Red represents high expression, green represents low expression. Each column represents an experimental condition (e.g., R1-v-R2), each row represents a gene

Gene ontology (GO) analysis of RNA-Seq data

The large amount of data generated by RNA-Seq demands efficient tools for data visualization. To acquire complete functional information, GO terms were assigned to each assembled gene. As an international standardized gene functional classification system, GO offers both a dynamically updated controlled vocabulary and strictly defined concepts to comprehensively describe the properties of genes and their products. It can assist understanding the distribution of gene functions at the macro level. Through the GO analysis, all genes were separated into three major parts (i.e., biological process, molecular function, and cellular component). Using BLAST2GO (version 2.3.5) (http://www.blast2go.org/), a total of 157,650 GO terms were associated with all 23,836 genes and classified into 41 functional subcategory annotations (Fig. 4). Cellular components comprised the majority of the GO annotations (27,675; 50.4 %), followed by biological processes (17,649; 32.2 %) and molecular functions (9,551; 17.4 %). The major sub-categories are shown in Fig. 4a together with an analysis of all genes among the three development stages of endosperm development. In the three categories, the six sub-categories; cell (GO:0005623), organelle (GO:0043226), binding (GO:0005488), catalytic activity (GO:0003824), cellular process (GO:0009987) and metabolic process (GO:0008152) were included in the top six most-abundant sub-groups. All genes were associated with at least one GO term. These GO annotations demonstrate that the rice endosperm expressed genes encoding diverse structural, regulatory and stress proteins. We also analyzed some other GO patterns. GO analysis of the three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3), in which all of the 10,371 DEGs were associated with 17,767 GO terms in rice, is shown in Fig. 4b. GO analyses of up-regulated and down-regulated genes in R1-v-R2 and R2-v-R3 are shown in Fig. 4c and d. Between R1 and R2, 400 up-regulated genes were associated with 1,871 GO terms and 4,103 down-regulated genes were associated with 17,138 GO terms. Moreover, 433 up-regulated genes were associated with 1,854 GO terms and 4,737 down-regulated genes were associated with 22,367 GO terms between R2 and R3. GO analysis generated more helpful information regarding these GO terms; for instance that they differ significantly among the three comparisons. A statistically significant difference was determined if the p value based on gene percentage was below 0.05, and included such terms as organelle (GO:0043226), binding (GO:0005488) and cellular process (GO:0009987), in the three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3). Catalytic activity (GO:0003824), multicellular organismal process (GO:0032501) and the three other sub-categories exhibited statistically significantly different up-regulation and down-regulation of genes between R1 and R2. Analysis of GO terms in the DEG in R2 and R3 showed that 11 sub-categories, including the macromolecular complex (GO:0032991), structural molecule activity (GO:0005198) and response to stimulus (GO:0050896), differed significantly in terms of gene percentage. The GO annotation assignment determined that the genes expressed in this study encode diverse structural, regulatory and metabolic proteins.

Fig. 4
figure 4

GO classifications of genes. The results are summarized in three main categories: biological process, molecular function and cellular component by GO analysis. a GO classifications of all genes between three stages. b GO classifications of all 10,371 DEGs among three contrasts. c GO analysis of the up-regulated and down-regulated genes in R1-v-R2. d GO analysis of the up-regulated and down-regulated genes in R2-v-R3 (*P < 0.05)

KEGG and MapMan pathway analyses

The interaction of genes with one another plays an important role in biological functions. Pathway-based analysis helps to further understand the biological functions of genes. To perform functional classification and pathway assignment of genes during rice endosperm development, all DEGs were analyzed against the KEGG (Kyoto Encyclopedia of Genes and Genomes) and the MapMan tool.

KEGG is the major public pathway-related database, providing classification that is valuable for research on genetically and biologically complex behaviors. It contains systematic analysis of inner-cell metabolic pathways and functions of gene products, which aid in studying the complex biological behaviors of genes. By mapping to the reference canonical pathways, all DEGs were assigned to 123 KEGG pathways (Supplemental Table 3). On the basis of RPKM, we found four significant expression types among all pathways. There were 84 down-regulated pathways in Type 2, indicated by a negative slope during endosperm development. Pathways in Type 3 were the second-most abundant group, containing 25 that were up-regulated between 3 and 6 DAP and then down-regulated until 10 DAP. Eight pathways in Type 4 began to be down-regulated between 3 and 6 DAP then up-regulated at 6–10 DAP. Finally, six pathways that were positively modulated throughout all three stages were classified as Type 1. A summary of KEGG analyses is shown in Table 4. A considerable number of pathways were altered, providing a macroscopic view of the KEGG pathways analysis. The profound differences between physiological processes in rice endosperm development suggest that the transcript profile varies substantially among the stages analyzed.

Table 4 The considerably changed KEGG pathways in the RNA-seq analysis of endosperm development

In recent years, there has been an explosion in the number of biological databases available. These different classification systems could provide distinct insights into gene function. We used a more comprehensive tool, the MapMan package (http://MapMan.gabipd.org) to visualize more thoroughly the pathways involved in rice endosperm development. The MapMan tool uses input from a number of experts to curate specific biological processes using the information from the TIGR database (http://compbio.dfci.harvard.edu/tgi/). Based on the classification of genes into 35 major pathways and 211 branch pathways, MapMan was originally developed to visualize rice endosperm development pathways. An overview of all DEGs between R1 and R3 in terms of cellular metabolism and regulation pathways is shown in Fig. 5. We customized the figure to depict the biological processes of interest, and display log2 RPKM-normalized expression counts onto pictorial diagrams. On the basis of RPKM, 159 pathways were down-regulated in endosperm development (R1-v-R3). The downward trend in overall transcription is apparent. Only 14 pathways, such as biotic stress and storage protein development, were up-regulated. The numbers of pathways in Types 3 and 4 expression patterns were 23 and 15, respectively. A more detailed list of all DEGs corresponding to MapMan functional categories is provided in Supplemental Table 4. These visual annotations are a valuable resource for investigation of pathways in endosperm development and suggest that many genes act cooperatively.

Fig. 5
figure 5

MapMan overview of cellular metabolism (a) and regulation (b) showing all DEGs between R1 and R3. Individual genes are represented by small squares. The color key represents RPKM normalized log2 transformed counts. Red represents up-regulation and dark blue represents down-regulation between R1 and R3

Transcription factor (TF) expression during rice endosperm development

Plant gene expression regulation is a complicated network. Through specific interactions with cis-acting target elements, proteins called TFs regulate a series of relevant downstream targets, playing an important role in plant development, evolution and the response to environmental stress. To reveal the molecular mechanisms of events involving TFs in rice endosperm development, we used the Rice TF database (http://drtf.cbi.pku.edu.cn/) to search for TFs. In total, 1,118 putative TF genes of 55 TF families were identified in 23,836 genes from RNA-Seq analysis (Supplemental Table 5). This provided a deeper understanding of the role of TFs during rice endosperm development. On the basis of RPKM, most TF family classes from the most abundant transcripts (in three libraries) were significantly down-regulated; however, some families were up-regulated. For example, the expression pattern of genes encoding the Dof family and five other families were of Pattern 1, those that were up-regulated throughout the entire developmental process. The GRAS family and two other families were Pattern 4, those that were down-regulated from 3 to 6 DAP and up-regulated until 10 DAP. The NF-YA family showed up-regulation from 3 to 6 DAP and then down-regulated; this was classified as Pattern 3. Some TF families, such as the Dof and the NAC family, will be discussed later in detail. A more in-depth analysis is needed to determine how the above gene families are linked to the molecular and cellular changes that occur during rice endosperm development.

Real-time quantitative PCR validation of RNA-Seq results

To verify the RNA-Seq results, an appropriate alternate strategy was chosen. Fifteen genes were randomly selected for validation using Real-time qPCR with gene-specific primers designed using Primer Premier software (version 5.0) (Supplemental Table 8). The results are shown in Fig. 6. Expression patterns determined by Real-time qPCR were consistent with those by RNA-Seq, confirming the accuracy of the results of the latter.

Fig. 6
figure 6

The result of Real-time qPCR Validation. It shows the expression of fifteen genes in validation (Relative expression). The columns denote the mean; the bars denote the standard deviation. The data of Real-time qPCR validation are expressed as the mean SD after normalization

Discussion

Storage-related genes are up-regulated during endosperm development

Storage of sugars, amino acids, and other important metabolites is critical for rice endosperm development and has the greatest effect on milling yield and quality (Xu et al. 2008). These metabolites are allocated to various biosynthetic pathways, primarily starch metabolism and protein biosynthesis and storage, which are responsible for synthesizing starch and proteins in precise quantities and ratios (Venu et al. 2011). According to the RNA-Seq results, genes and pathways related to storage processes for the accumulation of starch and sucrose were markedly up-regulated during endosperm development (R1-v-R3). The top 100 up-regulated genes between R1 and R3 (Supplemental Table 6) were analyzed. Each of the top 10 up-regulated genes was related to starch metabolism and storage protein biosynthesis. Furthermore, 29 storage genes were found in the top 100 up-regulated genes. Storage during rice endosperm development is also associated with complex pathways. In the KEGG analysis, the pathways up-regulated to the greatest extent were starch and sucrose metabolism (20861.2) in the endosperm at 3–10 DAP. Amino sugar and nucleotide sugar metabolism (8551.1) and carbon fixation in photosynthetic organisms (4624.8) comprised the remaining top three up-regulated pathways. In addition, the MapMan analysis suggested that storage-related pathways were markedly up-regulated during endosperm development. Dof transcription factor proteins participate in seed storage protein synthesis in the developing endosperm (Mena et al. 1998); indeed, we found 11 genes that encoded Dof proteins. The expression of the Dof family was up-regulated during all three stages, indicating that Dof proteins play a role in the storage process during endosperm development. Other genes may also play important roles in storage processes. For example, BGIOSGA006533 was up-regulated by 423-fold during endosperm development. To increase the storage and nutritional value of rice seeds, further analysis of such candidate genes and pathways is necessary. In summary, the up-regulated expression of storage-related genes suggests a role in the accumulation of sucrose and the mobilization of starch reserves at the beginning of storage mobilization.

Programmed cell death-related genes during endosperm development

Using transmission electron microscopy, programmed cell-death (PCD) was identified in rice starchy endosperm cells during development (Wei et al. 2002). The endosperm cells initiated nucleus degeneration at 5 DAP. Several PCD morphological features were observed in the nucleus of endosperm cells, such as nucleus deformation, chromatin condensation, disruption of the nuclear envelope and leakage of the nuclear matrix. Almost the entire nucleus of starchy endosperm cells was degraded by 13 DAP (Wei et al. 2002; Lan et al. 2004). We analyzed endosperm development from an integrated and network perspective and found evidence of many molecular aspects of this process. First, the total number of genes expressed during the three stages and the total expression levels reflected an obvious decline in expression over time. Compared to the previous stage, the quantity of genes expressed in R2 and R3 decreased by 3 and 16 %, respectively. Also, in terms of gene expression levels, the RPKM of R2 decreased by 14 % compared with R1, whereas R3 decreased by 7.7 % compared with R2. Second, 91.1, 91.6 and 93 % of genes were down-regulated in the three comparisons (R1-v-R2, R2-v-R3 and R1-v-R3), respectively, constituting the largest portion of the total. Third, only 13.8 % of the 123 KEGG pathways (mainly storage pathways) were up-regulated. MapMan analysis suggested that only 31 of 211 pathways were up-regulated. Moreover, the expression of TF families exhibited a declining trend. Of the 55 TF families, only 7 were up-regulated during endosperm development (R1-v-R3). Thus the majority of genes and pathways were down-regulated, with the exception of those associated with storage material accumulation, such as starch and sucrose metabolism. This down-regulated expression was closely related to cellular events during endosperm development, agreeing with previous reports of nucleus degeneration in rice endosperm development (Wei et al. 2002; Lan et al. 2004). Our results represent molecular evidence of this phenomenon and an expansion of current knowledge. The most likely reason for the dramatic changes in the endosperm transcriptome is PCD within the endosperm.

PCD is initiated and regulated by known genes, some of which are also important in rice development. For example, knockout of the rice AIP5 gene delayed degeneration of the tapetum due to inhibition of the tapetal PCD process and consequently caused defects in male gametophyte formation (Li et al. 2011). We detected up-regulation of the AIP5 gene beginning at 3 DAP, followed by down-regulation at 6 DAP. On the other hand, over-expression of mitochondrial heat shock protein 70 suppresses PCD in rice protoplasts (Qi et al. 2011), and hsp70 gene was down-regulated in this analysis. We also detected other 11 PCD-related genes, which have been identified in previous plant research (Li et al. 2008; Chen et al. 2009; Wang et al. 2010b; Li et al. 2011; Qi et al. 2011; Jiao et al. 2012) (Supplemental Table 7). Not all PCD-related genes showed an obvious change in expression during endosperm development. This indicates that the process of PCD in the rice endosperm is complex and not identical to that in other tissues. Interestingly, two genes (OsBIRH1 and OsLOL2) related to the biotic stress response were up-regulated. These may be correlated with strong anti-adversity in rice seed germination.

PCD plays an essential role in plant development and survival in response to abiotic and biotic factors such as pathogens, drought and high salinity (Lee and Luan 2011). Here, MapMan analysis identified up-regulation of 81 members of the biotic stress-response pathway. Within the top 100 up-regulated genes, we found five allergenic proteins, suggesting an allergenic stimulus. In terms of TFs (TF), the NAC family has been reported to participate in plant responses to pathogens, viral infection, and environmental stimuli (Kim et al. 2007). We found 90 members of the NAC family, and the expression pattern of family was up-regulated, suggesting their involvement in the biotic stress-response process. OsBIRH1 (BGIOSGA024643), which was responsible for the production of a DEAD-box RNA helicase that functions in modulation of responses to pathogen infection and oxidative stress (Li et al. 2008), was up-regulated during the early stages (R1-v-R2). We suspect that the first step of PCD in endosperm development is the biotic stress response, especially the allergenic process.

The plant cystatins comprise a family of specific inhibitors of cysteine proteinases, which are thought to be involved in the regulation of several endogenous processes and in defense against pests and pathogens (Martinez et al. 2005). Thus we evaluated cystatin gene expression to validate the up-regulation of the biotic stress-response process. Eleven cystatin genes were found, most of which were up-regulated at 3 DAP and down-regulated at 6 DAP. This suggests a role for cystatins in the process of PCD during endosperm development.

Previous studies have suggested that PCD could be induced by ethylene and gibberellin acid (GA) (Steffens and Sauter 2005), likely due to the increased sensitivity of epidermal cells to GA. Abscisic acid (ABA) delays promoted cell death, possibly due to the decreased sensitivity of epidermal cells to GA. During barley (Hordeum vulgare) seed development, GA promotes cell death in aleurone cells while ABA inhibits it (Bethke et al. 1999). The MapMan analysis indicated down-regulation of 20 members of the abscisic acid metabolism pathway. Conversely, six components of the gibberellin pathway were up-regulated at 6 and 10 DAP. This suggests involvement of these hormones in PCD during rice endosperm development.

There have been a number of earlier reports of gene expression analysis during endosperm and seed development in rice. By analyzing the gene expression profiles in wild type and an endospermless mutant of rice by microarray analysis, it was found that genes for defense against diseases or response to stress as well as genes for transporter proteins were strongly expressed at 2–3 DAP, and genes for storage proteins and enzymes of starch synthesis were strongly expressed at 4–6 DAP in the wild type (Kondou et al. 2006). A further transcriptome analysis between a near-isogenic line (which was a chromosome segment substitution line) with high chalkiness and a parental line (normal for rice endosperm chalkiness) found 623 differential expressed genes, which could be classified into 18 major categories (Liu et al. 2010). Pathway analysis demonstrated that expression of genes involved in sucrose and starch synthesis was up-regulated in near-isogenic line, whereas those involved in non-starch polysaccharide synthesis were down-regulated. In a deep transcriptional analysis of developing rice seeds according to the different milling yield and eating quality (Venu et al. 2011), it found that many DEG were involved in the biosynthesis of starch, aspartate family amino acids, and storage proteins. Using DNA microarrays, it was reported that many genes related to starch and protein metabolisms were highly expressed in rice endosperm in the analysis of the transcription (Xue et al. 2012). There were some similarities compared to the results of our work. For example, a common feature is that many genes related to storage were strongly expressed during the endosperm development. Here we presented for the first time the use of the RNA-Seq technique for this analysis, which represents the latest and most powerful tool for characterizing transcriptome of the rice endosperm development. We also provided an integrative view of gene expression in endosperm development, and we made a detailed analysis of gene expression in the study. Furthermore, we undertook a comprehensive bioinformatics study to analyze endosperm development. Significantly, from our RNA-Seq results, we provided additional supporting molecular evidence related to PCD in rice endosperm development which has been previously shown by microscopic analysis.

In addition, the new RNA-Seq data could be combined with the embryo investigations (Xu et al. 2012) to provide an integrated view of gene regulation in rice seed development. We found specific gene expression and regulation during rice endosperm and embryo development in two aspects: First, we found that both the quantity of gene and overall gene expression change in embryo development were steady. In contrast, the quantity of gene and overall gene expression in endosperm showed an obvious down-regulated trend. Second, pathway analysis showed both similarities and differences. The expression of genes for starch and sucrose metabolism was up-regulated gradually in both endosperm and embryo development. On the other hand, up-regulated gene categories in endosperm development were related to defense against disease, response to stress, and inositol phosphate metabolism, while the up-regulated gene category in embryogenesis corresponds to protein modification related genes. These results showed a distinct time course of expression patterns during seed development, and a divergence in transcriptional programs between endosperm and embryo development.

In conclusion, the transcription analysis by RNA-Seq provided integrative information for understanding rice endosperm development. First, a subset of genes that were differently expressed during the three developmental stages were identified, and suggested the occurrence of various metabolic/cellular processes during endosperm development. A large number of genes related to purine metabolism, ribosomes, the spliceosome, and oxidative phosphorylation were detected in the early and middle stages. Expression of some genes was increased significantly during the middle stage, such as those related to plant hormones, glycolysis, galactose metabolism and carbon fixation in photosynthetic organisms. During the middle and late stages, genes related to defense against disease or the response to stress, inositol phosphate metabolism, and starch and sucrose metabolism, were expressed strongly. Second, some pathways were positively modulated throughout endosperm development, such as auxin metabolism and biotic stress. The majority of pathways were down-regulated; e.g., regulation of transcription and cell organization. Some pathways, however, had low expression during the middle stages of endosperm development, including calcium signaling and lipid degradation. Pathways for amino acid synthesis, alkaloid biosynthesis and major CHO metabolism, as well as 13 other pathways, were induced at 6 DAP. In addition, some potential regulators of endosperm development, particularly TF families, that may play important roles during endosperm development were identified. This integrative information regarding rice endosperm development suggests that RNA-Seq analysis facilitates elucidation of the molecular mechanisms of endosperm development, and indicates the complex regulation of rice seed development. In summary, we present here the application of RNA-Seq to early rice endosperm development and report a comprehensive analysis of the transcriptome that may serve as a blueprint of the gene expression profile during rice endosperm development.

Materials and methods

Plant materials

The endosperm were derived from rice (O. sativa L.ssp. indica cv.9311), which were grown in a greenhouse at Wuhan University, China. Caryopses were tagged at the initiation of pollination. The samples consisted of three replicates arranged in a randomized complete block design. For each replicate, development seed were collected from 15 different plants at 3, 6, and 10 DAP (Fig. 7) during the growing season. The three-stage endosperm was taken from caryopses with microdissection needles under a dissection microscope (Olympus, Tokyo, Japan). The materials were immediately frozen in liquid nitrogen and kept at −80 °C until required (for RNA extractions).

Fig. 7
figure 7

Plant material. Caryopses are tagged at the initiation of pollination, and the developing seeds were harvested at 3 DAP, 6 DAP, 10 DAP. Endosperm is the major part of the developing seed

Preparation of RNA-Seq libraries

Total RNA was prepared using Trizol according to the manufacturer’s protocol (Invitrogen, Burlington, ON, Canada). The yield and purity of each RNA sample was determined by the absorbance (Abs) at 260 and 280 nm and the integrity of all RNA samples was monitored on 1 % agarose gels. The total RNA samples were pooled into three-stage samples and 10 μg of total RNA from each sample was used to isolate poly (A) mRNA and to prepare a nondirectional Illumina RNA-Seq library with an mRNA-Seq 8 Sample Prep Kit (Illumina). Each library had an insert size of 200 bp, and 42–50 bp sequences were generated via Illumina HiSeqTM 2000.

Mapping reads to the reference genome and annotated gene

The rice genome and gene information were downloaded from the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu). Sequencing-received raw image data were transformed by base culling into sequence data. The raw reads were prior to the reference database and cleaned by removing adaptor sequences, empty reads and low quality sequences. Transcriptome de novo assembly was aligned to the rice genome assembling program SOAPdenovo (http://www.soap.genomics.org.cn/soapdenovo.html) (Li et al. 2009) A digital gene was performed on the uniquely mapped to the genome if it includes reads mapped to the reference genomes with no more than two mismatches (Balwierz et al. 2009). Reads that failed to be mapped were progressively trimmed off, one base at a time from the 30-end and mapped to the genome again until a match was found (unless the read had been trimmed by 27 bases). For single-end reads, the insert between paired reads was set as 1 base-5 kilobases, allowing them to span introns of various sizes in the genome. Except the insert was changed to 1 base-1 kilobase, the same strategy was used to align single-end reads to the non-redundant gene.

Evaluation of genes from RNA-Seq and GO analysis

The evaluation of gene expression was performed with ERANGE software (version 4.0) by assigning reads to their site of origin and counting them (http://woldlab.caltech.edu/gitweb/). Raw digital gene expression counts were normalized by a variation of the RPKM (Reads per kb per million reads) method, which could eliminate the influence of different gene length and sequence discrepancy on the calculation of gene expression. Therefore, the calculated gene expression could be directly used for comparing the difference of gene expression among samples. For all RPKM values of each gene, the cutoff value for determining gene transcriptional activity was determined based on a 95 % confidence threshold. The Blast2GO program was used to obtain GO annotations (version 2.3.5) (http://www.blast2go.org/) for the all genes (Conesa et al. 2005). Then the software WEGO was used to perform GO functional classification of all genes to view the distribution of gene functions of the species at the macro level with the default parameters with a robust FDR correction to get an adjusted P value (Ye et al. 2006).

Analysis of differential expression genes (DEGs) during rice endosperm development

We applied the R package DEGseq to identify DEGs with the random sampling model based on the read count for each gene at different developmental stages (Wang et al. 2010a). In multiple test and analysis, FDR was used to determine the threshold of P value. We used FDR <0.001 and the absolute value of log2Ratio ≥1 as the threshold to judge the significance of each gene expression difference. For pathway analysis, we mapped all DEG to terms in the KEGG and MapMan database, and then looked for significantly enriched pathway terms compared to the genome background. KEGG pathway analyses were performed using Cytoscape software (version 2.6.2) (http://www.cytoscape.org/) with the ClueGO plugin (http://www.ici.upmc.fr/cluego/cluegoDownload.shtml) (Bindea et al. 2009). The MapMan package (http://MapMan.gabipd.org) was performed to visualize more thoroughly the pathways involved in rice endosperm development.

Real-time quantitative PCR validation of RNA-Seq data

Fifteen genes were randomly selected for validation using Real-time qPCR. Primer sets were designed with the Primer5 software (Supplemental Table 8) and primer pairs were designed to span exon–exon junctions. Detection system in a total volume of 20 μl in each tube contains 1 μl of diluted cDNA, 0.6 μl of reverse and forward primers, 7.4 μl of ddH2O, and 0.4 μl of ROX and 10 μl of the PCR master-mixture (Thunderbird SYBR qPCR Mix, Toyobo). The following PCR program was used: 95 °C for 10 min, 95 °C for 15 s, 60 °C for 1 min; steps 2 and 3 were repeated 40 times. A heat dissociation curve (60–95 °C) following the final cycle of the PCR were checked to test the specificity of the PCR amplification. By the ABI StepOneTM Real-Time qPCR analysis system (Applied Biosystems), the relative quantification analysis was performed by relative standard curve according to Threshold values (CT) generated. UBQ5 was used as internal control to standardize the results. All samples were run in triplicates in separate tubes and each cDNA sample was run in duplicate. All data were expressed as the mean SD after normalization.