Introduction

Presence of introns is a characteristic feature of the eukaryotic genes, although genes with introns have also been reported with very low frequency in bacteriophages and bacterial genomes (Edgell et al. 2000). On the other hand, many eukaryotic genes lack introns. The number of introns per gene varies drastically among the eukaryotes. Most vertebrates have several introns per gene whereas only two characterized introns have been found in Giardia lamblia a flagellated protozoan parasite (Aparicio et al. 2002; Nixon et al. 2002). There is no simple phylogenetic pattern of intron-rich and intron-poor species across the eukaryotic tree (Jeffares et al. 2006). Such differences largely reflect stronger genomic streamlining in unicellular organisms than in multicellular species (Gilbert 1987). There is differential efficiency of intron selection in species with different population sizes, but none of the models predict high intron densities in early eukaryotes and unicellular species (Lynch and Conery 2003; Roy and Gilbert 2005; Loftus et al. 2005). Thus, intron number is controlled by sensitive natural selection that implies an important role for mutational mechanisms of intron gain and loss. In general intron sequences are under low selection pressure than exons, consequently the introns have a higher rate of gain and loss than exons (Lin et al. 2006).

Presence of introns before the divergence of prokaryotes and eukaryotes is yet to be confirmed, but there are definite instances of both loss and gain of introns later in the evolution of species (Jeffares et al. 2006). There is no conclusive theory as yet on the mechanisms and forces that underlie gain and loss of introns, but the evolution of splicesomal introns has broad implications for many fundamental evolutionary questions. Presently there are two opposing views on the origin of introns. The intron-early hypothesis suggests that the introns were present in the genes of common ancestor of all the presently living organisms and splicing mechanism is ancient one (Gilbert et al. 1997). The intron-late theory advocates the view that the introns were inserted into their present location in the genes and the splicing mechanism has evolved late in eukaryotes due to its selective advantage (Rzhetsky and Ayala 1999). Research on mechanisms and causes of splicesomal intron evolution has been very active in the past few years in the post genomic era, resolving some old controversies and sparking some new ones (Roy and Gilbert 2006).

The completion of high quality rice genome sequence has provided an excellent opportunity to study the evolution of introns, possible mechanisms involved in loss and gain of introns, distribution of introns in the genes and their relation with other structural and functional features of the genes. The aim of present study was to particularly analyze the relationship between intron number and gene expression in rice.

Materials and methods

The complete set of 62,820 CDS sequences of predicted rice genes distributed over the twelve rice chromosomes was downloaded in batches from the TIGR built 4.0 database (http://www.tigr.org/tdb/e2k1/osa1/overview.shtml). A subset of 27,330 expressed genes of these were extracted manually. For the evidence of expression, ESTs and full-length cDNA information was used (Kikuchi et al. 2003). The number of introns per gene for all the predicted rice genes, including both expressed and non-expressed categories were predicted individually employing a semi-automated procedure using GffUtils tools (https://pythonhosted.org/gffutils/). The outputs were arranged chromosome wise and grouped in to classes based on the number of introns per gene. The average CDS length, average exon length and percent of expressed genes were calculated for each group of expressed rice genes and plotted against the number of introns contained in these genes.

First 200 genes (started from short-arm terminal of chromosome) of 0 to 30 intron categories of expressed genes were extracted manually and all the ESTs and cDNA, for the individual genes were downloaded in batches from the TIGR site. The average number of ESTs for each of genes possessing the 0 to 20 intron was categorized separately; however, the average number of ESTs for 21 to 30 intron genes was calculated together because there were less number of genes with high intron numbers in these categories. The number of ESTs was plotted against the number of introns in these genes. As some rice genes have multiple splice forms, average of all the alternative isoforms were considered for the analysis. Same set of 200 genes under each category was also used for the functional classification of genes. The predicted function information for each rice genes was extracted from the TIGR website and classified as per the Plant GOSlim ontologies (http://www.geneontology.org/).

Results and discussion

Intron frequency in the rice genes

The 62,820 predicted rice genes contained total 2,29,556 introns of which 1,36,792 introns were present in 27,330 expressed genes with cDNA evidence. In the rice genome there is preponderance of genes with no introns or fewer number of introns as against the popular perception that most eukaryotic genes have introns (Fig. 1). About 60 % of the rice genes have less than 4 introns per gene and only ninety one genes had more than 30 introns per gene (Fig. 1, Table S1). The genes with higher number of introns coded for structural proteins and high molecular weight protein for instance, kinesin motor domain containing protein (316.6kD, 37 introns), HEAT repeat family protein (250kD, 53 introns). Genes with protein binding function were the largest class (8.94 %) of intron-rich gene followed by those with hydrolase activity (7.82 %) and motor activity (6.15 %) (Fig. S1). Response to abiotic stimulus and secondary metabolic process is mainly coded by the intronless genes. Genes for transferase activity was a major class of intronless genes followed by transcription factor activity (3.93 %) (Fig.S1). Mitochondrion, response to endogenous stimulus, kinase activity and catalytic activity genes showed diverse number of intron (Fig.S1).

Fig. 1
figure 1

Frequency of genes with different number of introns per gene in the rice genome. Expressed genes are those supported by cDNA match in dbEST

Relationship of intron frequency with exon length and CDS length in expressed rice genes

The average exon length showed an overall negative correlation (r = −0.377) with the number of introns in a gene. As the number of intron increased the average exon size fell down initially but there was no further reduction in the average exon size in the genes with more than ten introns (Fig. 2). The correlation coefficient between the number introns and average exon size was much higher (r = −0.79) up to ten intron per gene. Exons of less than 50 bp are too short for the splicesome to operate and exons that are too long (greater than 300 bp) are difficult to locate by the splicesome (14). This may be the reason for genes with larger exon size having very few or no introns to avoid difficulty in splicing. Intron early theory hypothesizes that the very first genes and exons represented small polypeptide chains 15–20 amino acids and then large genes evolved by fusion of these smaller genes (Gilbert et al. 1997). But according to our results, Average exon size in the expressed rice genes is 215 bp. It is assumed by the intron-early theory that the exons of today are the results of, on an average, two to three acts of fusion from the original 15–20 amino acids long exons (Gilbert et al. 1997).

Fig. 2
figure 2

Relationship between exon size and number of introns per gene based on 27,330 expressed genes in the rice genome

Average CDS length for the expressed rice genes was 1292 bp, with significant variation between genes. There was a strong positive correlation (r = 0.89) between intron number and CDS length of the expressed genes (Fig. 3). As there was a lower limit to the average size of exons, increasing number of intron resulted in increase in the length of CDS.

Fig. 3
figure 3

Relationship between average size of the predicted coding sequence (CDS) and number of introns per gene in 27,330 expressed genes in the rice genome

Relationship between intron frequency and gene expression

Present analysis revealed a strong positive correlation between the number of introns per gene and percent of expressed genes in that category. The percentage of expressed genes increased with increasing number of introns in the genes up to about 13 introns after which it stabilized at about 80 % (Fig. 4). This is more than two fold higher than 32.74 % observed for intron less genes and 28.13 % for genes with single intron. This finding supports the hypothesis that more introns per gene leads to higher gene expression.

Fig. 4
figure 4

Relationship between percentage of expressed genes and number of introns per gene in the rice genome. Percentage of expressed genes was calculated by dividing number of genes with EST support by total number of predicted genes in that category

To make a quantitative estimation of the relationship between level of expression and number of introns per gene, we performed correlation analysis between average number of ESTs representing particular gene locus and the number of introns in that gene. A significant positive correlation (r = 0.45) was observed between the average number of ESTs for particular gene and the average number of introns in the gene. The rice EST database was the second largest behind human with total 1,220,261 entries on April 25th 2008 in NCBI (Release 042508) and an average of 102 ESTs per gene loci analyzed in the present study. The average number of EST matches increased with increasing number of introns per gene but not in a strictly linear pattern (Fig. 5). The category of genes with 15 introns per gene was most highly expressed with an average of 236 ESTs matches per gene in the database. In contrast, intronless genes were very low in expression, with an average of only 32 ESTs per locus. These observations are based on a valid assumption that genes having low expression level or very specific expression conditions will accumulate less in the EST database. Recently, the abundance within the EST database method has been proposed for estimation of expression levels (Marais and Piganeau 2002). It has been reported that highly expressed genes have more and longer introns in rice and Arabidopsis which is consistent with our results (Ren et al. 2006). Presence of intron is known to enhance the gene expression level in transgenic plants also, and examples of intraspecific intron presence/absence polymorphism also supports role of introns in enhanced expression level (Llopart et al. 2002). The enhancement of gene expression level using intron sequence has been verified in wide range of plant species including monocots and dicots (Morita et al. 2012; Patil et al. 2010). Recently, intron of the Gmubi gene found to be contribute to very high levels of expression in soybean transiently and stably transformed tissues (Carola and Finer 2015).

Fig. 5
figure 5

Average number of EST matches in the TIGR database (give web address) for each category of genes with different number of introns. Average number of ESTs was calculated for the first 200 genes identified in each category of genes with 0 to 30. Category 21 to 30 have less than 200 genes each hence average was calculated on combined frequency

In humans and Caenorhabditis elegans, the highly expressed genes have fewer and shorter introns. This compact nature of highly expressed genes is explained by transcriptional efficiency, regional mutation bias or genomic design (Vinogradov 2005; Urrutia and Hurst 2003; Sanderson et al. 2004). This hypothesis however does not fit well with the observations in plants. Whatever selection was responsible for the presence of more introns in the expressed genes as compared to the non-expressed rice genes might be due to divergence of animals and plant about 1600 million year ago (Sanderson et al. 2004).

The theory of intron gain with joining of adjacent exons can be evaluated by analyzing open reading frame across all exons of the gene. Consequently, the phase information of all the rice exons was retrieved in gff3 format from phytozome database (ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v8.0/Osativa/). The phase indicates whether the exon started with reference to the reading frame or not. On the basis of exon start with nucleotide number in a codon, the exons are classified as phase 0, 1, and 2. The phase analysis revealed over dominance of exons started with proper reading frame in rice genome. More particularly, 60.15 % phase 0 exons and only 33.28 % exons either phase 1 or phase 2 were observed. However, 6.5 % of total exons were totally non-coding and represent only untranslated regions. The results of exon phase analysis strongly support the theory of intron gain with joining of adjacent exons.

Intron gain also occurs by the insertion of transposable elements into the existing coding sequences leading to progressively lower frequencies of genes with more introns (Ren et al. 2006). In contrast, loss of introns during evolution will lead to accumulation of genes with fewer or no introns. There are two models for intron loss in genes, the classical model, in which genomic copy of a gene undergoes gene conversion by double recombination with a reverse-transcribed copy leading to loss of one or more adjacent introns or creation of new genes by reverse transcription of mRNA (intronless) followed by insertion of this cDNA into the genome (retrotransposons like mechanism), and second is genomic deletion model in which introns could be lost by (near) exact genomic deletion (Hu 2006). The two models make several distinct predictions. First, recombination with RT-mRNAs should excise introns exactly, whereas genomic deletion should be less tidy, sometimes deleting adjacent coding sequences and leaving residual intron sequence (Roy and Gilbert 2006). Formation of new gene copy as per classical model was rare in Drosophila (Betrán et al. 2002). But we assume this mechanism might be predominant in plants, as retrotransposons are particularly abundant in plants, where they are often a principal component of nuclear DNA. For instance, in wheat about 90 % of the genome is made up of retrotransposons, whereas it is 49–78 % in maize (Li et al. 2004; Sanmiguel and Bennetzen 1998).

There would be a possibility of formation of defective copies of genes along with the functional copy during the intron loss. It has been reported that evolutionarily conserved genes in rice have more number of introns than newly evolved genes (25). Further, the results agreed with our presumption about the newly evolved intronless or genes with few introns being faulty. While loss of introns theory can explain formation of defective genes there is no evidence of any selective advantage for such genes during evolution of eukaryotic genes. Particularly, our results point more towards the selective advantage of intron-rich genes due to their high expression level.

Intron features like length and abundance in gene have been routinely used to correlate with evolution of gene family (Patil and Nicander 2013; Deshmukh et al. 2013). The gene families mostly expands with segmental or whole genome duplications. Many other mode of gene duplication are also known which involves reverse transcription of RNA, horizontal gene transfer and uneven recombination events. All these different mechanisms enforced different level of selection pressure on introns. Xu et al. (2012) have analyzed 612 pairs of sibling paralogs from seven representative gene families and 300 pairs of one-to-one orthologs from different species and their results suggested that the structural divergences have a more important role during the evolution of duplicate than non-duplicate genes.

Despite increasing number of available genome sequences and advance analytical tool very less efforts are employed to understand intron evolution and genomic scale. The present study is focused on rice genome and there is possibility of having diverse pattern of intron distribution in other plant genomes particularly in dicots and primitive plant species. This study will be helpful to verify the facts and enrich understanding of intron distribution in genome. We have shown here a typical genome wide distribution pattern of introns in the rice genes, their correlation with exon length, CDS length and gene expression. The result presented here, would be useful in understanding the structural organization of genes with respect to the presence of introns.