Introduction

Spliced leader (SL) trans-splicing is an mRNA maturation process found in a subset of metazoan phyla (Davis 1996; Stover and Steele 2001; Vandenberghe et al. 2001; Ganot et al. 2004) and in Euglenozoa (Agabian 1990; Ebel et al. 1999; Frantz et al. 2000). During SL addition, a short exon from the 5′ end of a small nuclear RNA, the SL RNA, trans-splices to a site in the 5′ region of a pre-mRNA. The portion of the pre-mRNA upstream of the trans-splice site is deleted and replaced by the SL sequence, which includes a methylated guanosine cap at the 5′ end. SL addition thereby allows cells to cap otherwise uncapped RNAs, provided they contain a 3′ splice site and the requisite splicing signals (Nilsen 1993). In nematodes (Evans et al. 1997), like Caenorhabditis elegans (Zorio et al. 1994), as well as in other well-characterized SL trans-splicing species (trypanosomes and flatworms) (Blumenthal 1995), this process is used to cap the downstream genes of operons following the 3′ mRNA cleavage of the gene immediately upstream. The addition of the SL and its attached cap allows for the proper transport and translation of these messages (Nilsen 1993). A large number of operons have been identified in the C. elegans genome: a total of 1052 operons, comprised of 2727 genes, or approximately 15% of the genome (Blumenthal et al. 2002; Blumenthal and Gleason 2003).

Recently Lercher et al. (2003) analyzed expression of C. elegans genes and found that the majority of coexpressed genes fall into two classes: genes within the same operon and duplicated genes. Genes in these two classes share either identical promoters (for genes in operons) or evolutionarily related promoters (for duplicated genes, assuming that the attached promoter was duplicated), providing a straightforward explanation for coexpression. In addition to this finding, the authors note that genes in operons are duplicated less often than monocistronic (nonoperonic) genes. To explain this unanticipated finding, the authors proposed that the lack of autonomous promoter elements for the downstream genes prevents otherwise routine gene duplication events from resulting in expressed functional copies.

In this study, we test this hypothesis by determining the duplication rates of genes found at the three different positions within operons: genes at the 5′ end, which lie adjacent to the operon’s promoter and should thus be equivalent to genes outside operons—at least regarding duplication of promoter elements; genes at the 3′ end of the operon, which are adjacent to untranscribed DNA; and internal genes, which are flanked on both sides by other genes in the operon. Since the formation and extension of eukaryotic operons is likely a dynamic process (Lawrence 1999), we also compare the ages of the duplication events (as represented by their KS values) for genes inside and outside operons. This allows us to determine if duplication events are constrained by incorporation in an operon or whether the genes in operons are predisposed against duplication for some independent reason. We also determine whether specific mechanisms of gene duplication (especially inverted, tandem gene duplication, which is most common in C. elegans) are less frequent among genes in operons. We show that intrachromosomal duplications are more rare in operons, suggesting that a mechanism responsible for many gene duplications in C. elegans is hindered by operons. This difference may largely account for the discrepancy between the observed frequency of duplicated genes in operons and the rest of the genome.

Materials and Methods

Data Download

All sequences and annotation correspond to Wormbase 121 (Harris et al. 2004) and were downloaded from ftp://ftp.wormbase.org/pub/wormbase/elegans/WS121 on April 20, 2004. Operon data were obtained by parsing the gff annotation file for the start and end coordinates of each annotated operon. For each operon we retrieved all genes starting or ending inside the annotated coordinates that were coded on the same strand as the operon itself. Mitochondrial chromosomes and genes were excluded from our analyses.

Data Filtering

Groups of alternatively spliced genes, or isoforms, were collapsed, and the hits to any member of an isoform group were assumed to hit this combined “gene.” Genes derived from repetitive elements were removed using the RepBase9.02 database (http://www.girinst.org/; downloaded on April 21, 2004). This database contains all known repetitive elements in the C. elegans genome. We used the sequences from this database to query the C. elegans proteins using Fastx34 (e ≤ 10−5) (Pearson and Lipman 1988). If more than 60% of the sequence of a given protein is hit by a repetitive element, we excluded the protein and its corresponding gene from the analysis (Gu et al. 2002; Cavalcanti et al. 2003).

Duplicated Genes

In their paper, Lercher et al. (2003) showed that only 6.7% of the gene pairs that hit each other with a BLAST E-value smaller than 10−50 contain at least one operon gene, while the expected value is 21.5%. The use of BLAST E-values alone provides only a rough definition of duplicated genes, however, and more precise methods for determining gene duplications exist. Furthermore, the presence of large families of duplicated genes can bias the estimation of the number of pairs of duplicated genes. As an illustration, the genes composing a family of size 242 (the largest gene family identified by Gu et al. [2002]) could potentially have 58,322 hits (242 × 241) if all the members of the family hit each other, whereas a family of size 5 could have at most 20 hits.

In this study, we expand the observations of Lercher et al. (2003) by using the methodology described by Gu et al. (2002) and Cavalcanti et al. (2003) to detect duplicated genes (for an explanation of why this particular definition was used, see Gu et al. [2002]). This method uses a more stringent definition of duplicated genes and also groups the detected paralogous gene pairs in families, allowing us to test if bias introduced in the number of duplicated pairs by large gene families can partially explain Lercher and coworkers’ (2003) observations.

First, an all-against-all protein search was performed using Fasta34 (Pearson and Lipman 1988). The pairs of proteins that scored with an E-value smaller than 10 were kept for further analysis. Of these genes, we considered a pair of genes to be duplicated if they fulfilled the following requirements.

  1. 1.

    The length of the alignable region (L) between the two sequences was ≥ 80% of the length of the longer protein.

  2. 2.

    The similarity of the alignable region was ≥ I, where I = 30% if L ≥ 150 aa or I = 0.01n + 4.8L−0.32 (1 + exp(−L/1000)), otherwise, and n=6 (Rost 1999; Gu et al. 2002; Cavalcanti et al. 2003).

Gene families were determined by clustering the resulting pairs using a single linkage algorithm, i.e., if gene A hits gene B and gene B hits gene C, we grouped these three genes into a family regardless of whether gene A hits gene C (Gu et al. 2002). To avoid bias in the number of hits introduced by the existence of large gene families, we conducted most of our analysis using the number of genes with at least one paralogue in the genome instead of duplicated gene pairs. When using gene pairs, we only considered those pairs belonging to families of size 2 and thus duplicated only once.

To determine the age of duplicated pairs we used the proportion of synonymous site changes (KS); because selection on synonymous sites is very weak, KS is generally a good approximation for the time of divergence between two homologous sequences (Graur and Li 2000). For each pair of duplicated proteins we realigned the Fasta34 alignable region using ClustalW (Thompson et al. 1994), aligned the genes based on the protein alignment, and calculated KS values for the gene alignment using the maximum likelihood method in the program CODEML, part of the PAML package (Yang 1997). The algorithm used by CODEML can sometimes get trapped in suboptimal local maxima and give an incorrect result. To avoid this, all the KS calculations were repeated four times and the result with the larger likelihood was retained. All these steps were automated using bioperl scripts, and all the programs were run with the default parameters of the bioperl modules used.

Results

Operons in C. elegans

Version 121 of Wormbase contains 22,254 protein coding genes; 12 of these belong to the mitochondrial chromosome and were excluded. If we collapse each isoform group (described above) to a single “gene,” there are 19,874 genes; 356 of these genes belong to repetitive elements and were excluded, and 19,518 genes were further analyzed.

There are 1054 operons in Wormbase 121; however, operons CEOP1452 and CEOP2074 are annotated twice (as CEOP1894 and CEOP2692, respectively) and only one of each pair was kept in the analysis. Two other operons, CEOP2590 and CEOP3765, contain only one gene followed by a noncoding transcript and were also excluded. Ten genes in operons belong to repetitive elements and were also excluded from the analysis. This left us with 1050 unique operons comprising 2715 genes.

The mean distance between consecutive genes in operons is much smaller than the mean distance between consecutive genes in the genome. While the mean distance between two consecutive genes in an operon is ∼610 bp, this distance is ∼2.6 kb between consecutive genes in the genome (we excluded from the analyses the distances between the first and the last genes in an operon and the gene immediately preceding or following the operon). The median distance of genes within an operon is 424 bp, while that for genes outside operons is 1234 bp.

Duplicated Genes and Operons

Using the method detailed above to detect duplicated genes, we identified 40,894 pairs of duplicated genes; 2235 of these pairs contain at least one gene in an operon. Because in the case of very similar gene families each gene can belong to multiple pairings, these pairs only contain 6845 unique genes, 415 of which belong to operons. Table 1 summarizes these results. For all KS brackets the genes within operons have significantly fewer paralogues than genes in the genome as a whole.

Table 1 Frequency of duplicated genes in the genome and in operons

Table 1 also gives the results of 1000 runs of simulations in which we randomly assigned 2715 genes to operons and calculated the expected number of duplicated genes in these samples. The p-values were calculated as the number of samples that presented fewer duplicated genes than observed for the genes in operons. As can be seen, for all KS brackets none of the random samples presented fewer duplicated genes than the set of operon genes (p < 0.001).

The last column in Table 1 shows the ratio of the number of duplicated genes in operons versus the total number of duplicated genes. If genes in operons duplicated at the same frequency as the genes in the genome, all the values in this column should be 0.139 (2715 ÷ 19,518), which is the proportion of the genome found in operons. Even when we consider all duplicated genes, this ratio is 0.066, much lower than expected. Furthermore, this ratio decreases with KS, indicating that the relative frequency of duplicated genes in operons compared to that of the genome decreases for more recent duplications.

To test Lercher et al.’s (2003) hypothesis to explain the observed paucity of duplicated genes in operons, we determined whether the position of a gene inside an operon has an effect on its duplication pattern. According to this hypothesis, genes at the 5′ end of operons (that lie adjacent to their promoter) should have paralogues more often than downstream genes in operons (that lack their own promoters). The filtered dataset comprises 1048 genes at the 5′ ends of operons, 621 genes internal to operons, and 1046 genes at the 3′ ends of operons. Table 2 shows the number of duplicated genes in each position within operons, the results of 1000 runs of simulation in which we randomly assigned 1048 of the genes in operons to the 5′ class, 621 to the internal class, and 1046 to the 3′ class and the expected number of duplicated genes for each class.

Table 2 Frequency of duplicated genes in different regions of operons

The position of the genes in the operons does not appear to affect their duplication frequency. Indeed, the only value significantly different from that expected by chance is for genes internal to operons with KS ≤ 0.6, which seem to have more duplicated copies than expected, although only marginally significant (p = 0.03).

Next we wanted to check whether, besides being less frequent, the duplicated genes in operons show a different gene family distribution compared to the genome as a whole. We divided the gene families in three groups based on their sizes: small—when the family has fewer than 6 members; medium—when the family has more than 5 but fewer than 21 members; and large—when the family has more than 20 members. We then counted the number of genes in each family category. Among the duplicated genes in operons, 55.7% belong to small families, while only 41.5% of all duplicated genes in the genome fall in this category. Only 17.5% of the duplicated genes in operons belong to large families, compared to 32.5% of all duplicated genes in the genome. These differences are highly significant (p < 0.001 in both cases).

Another important pattern of gene duplications is the relative position of the genes in the duplicated pair. To avoid the problem of multiple duplicated genes, we only used duplicated pairs belonging to families of size 2. Our dataset contained 738 families of size 2. Table 3 shows the number of duplicated pairs in the same or different chromosomes for all the duplicated pairs in the genome and for the duplicated pairs in which at least one gene is in an operon, together with the expected value if the genes in operons were randomly selected from the genome (1000 replicates).

Table 3 Frequency of different types of duplication in the genome and in operons

Table 3 shows that the number of duplicated pairs in different chromosomes for genes in operons (84) is not significantly different from the expected number based on the frequency of such duplications in the genome (88.6; p = 0.272). The number of duplicated pairs in the same chromosome for genes in operons (44), on the other hand, is significantly smaller than expected (103; p < 0.001). Among the pairs in the same chromosome, those in which the duplicates lie close together (≤ 10 kb apart) are more rare than those in which the duplicates lie far apart, ∼5.5 SD from the expected value, compared with ∼4.3, respectively. Although it would be highly desirable, we could not perform the same analysis for recently duplicated genes, as the number of such duplicated pairs for which one of the genes belongs to an operon is very small (only 12 cases with KS ≤ 1).

Discussion

We first confirmed Lercher et al.’s (2003) results showing that genes in operons have fewer duplicated copies than genes in the genome as a whole. In their paper, Lercher et al. (2003) propose that this is caused by the lack of promoter elements in genes within operons, because duplications of these genes would result in nontranscribed copies that would soon be lost. Table 2 shows that genes at the 5′ ends of operons are not duplicated more often than genes in other operon positions, indicating that the data do not support the original hypothesis.

If this hypothesis were incorrect, then why do genes within operons have fewer duplicated copies in the genome? We would like to propose and discuss two alternative hypotheses.

First, as genes in operons tend to be coregulated (Lercher et al. 2003), duplication of such genes may disrupt the stoichiometric relationships between gene products, making the duplications deleterious (Lynch and Connery 2000). This is known as the Gene Dosage Balance hypothesis (GDBH) (Veitia 2004). Papp et al. (2003) showed in yeast that imbalances in the concentration of subunits of a protein complex have deleterious effects; furthermore, members of protein complexes seem to be duplicated less frequently in the yeast genome.

More generally, if genes that for any reason duplicate less often are more prone to incorporation into operons, the paucity we observe might not be caused by incorporation into an operon. Rather, it could be an intrinsic characteristic of the genes that are incorporated. Notably, proteins that catalyze steps of gene expression, like transcription, splicing, and translation, are abundant in operons (Blumenthal and Gleason 2003). Since these genes are rarely duplicated in eukaryotes (Maere et al. 2005), this may in part account for the paucity of duplicated genes in operons.

However, if this hypothesis were entirely responsible for the observed paucity, we would expect the values in the last column in Table 1, which shows the duplication rate of genes in operons with respect to age of the duplication, to be roughly the same, because the incorporation of genes into operons would not affect their duplication frequency. We observe a decrease in these values instead, suggesting that genes currently in operons duplicated more often before their incorporation. This can be explained if their duplication rates became constrained after inclusion in an operon, and could not be explained by the inherent duplication rate of the genes.

Our second hypothesis is that duplication of genes in operons disrupts expression of genes in the operon, resulting in a lower number of duplicated operon genes over time. Katju and Lynch (2003) observed that the most prevalent recent gene duplicates in C. elegans are in tandem. Furthermore, the majority of the recent duplications observed in their study were not only in tandem, but also in an inverted orientation, with the two genes encoded on opposite DNA strands. The authors proposed that the inversions are part of the duplication event and not caused by secondary rearrangements. Tandem, noninverted gene duplications can be created by illegitimate recombination during DNA replication; however, duplicated genes created in this manner are easily lost by the same mechanism (Katju and Lynch 2003). More complex mechanisms have been described explaining the origin of tandem, inverted duplications: illegitimate recombination during DNA replication mediated by DNA polymerase strand switching (Cohen et al. 1994; Bi and Liu 1996; Lin et al 2001; Katju and Lynch 2003; Woollard 2005) and strand misalignment-realignment (Gordon and Halliday 1995). Whatever the mechanism is that leads to inverted tandem duplications in C. elegans and its ancestors, or its frequency relative to other mechanisms of gene duplication, duplicates in this arrangement are the majority of recent gene duplications in this species. The inversion may promote retention of the new gene by inhibiting loss of the gene by illegitimate recombination (Katju and Lynch 2003).

Inverted duplication of a gene within an operon is likely to have a detrimental effect on expression of the downstream genes. The average distance between the polyadenylation site of an upstream gene and the trans-splicing acceptor of the downstream gene in C. elegans is only 126 nucleotides, with very few separated by more than 350 nucleotides (Blumenthal et al. 2002). In an unpublished experiment where heterologous DNA was inserted to artificially increase the intercistronic distance between two C. elegans genes (gpd-2 and gpd-3) from approximately 100 to 300 bp, nearly all trans-splicing of the downstream mRNA was abolished (reported in Huang et al. 2001). Insertion of sequence from the noncoding strand of a duplicated gene between two operon genes may disrupt expression of the downstream operon genes in other ways as well. Insertion of a long enough stretch of DNA could cause RNA polymerase to terminate transcription before reaching the downstream genes. Operon genes have presumably become optimized over time for factors that affect their expression, such as intercistronic distance and proportion of SL2 trans-splicing; perturbing these balances with insertions of noncoding sequence may be detrimental. If insertion of an inverted gene disrupts expression of operon genes by any of the effects listed above, it would be deleterious and selected against.

While the deleterious effects of inverted tandem gene duplications can easily be imagined for genes at any position in an operon (the 5′ and 3′ ends, as well as the middle genes), other mechanisms of gene duplication may be disfavored in operons as well. An illegitimate recombination event that introduces an exceptional amount of flanking DNA next to a noninverted tandem gene duplication may disrupt the operon in ways similar to those described above.

One other point to note is that, even if the insertion of an inverted operon gene does not destroy the biology of the operon, identification of such an operon may be difficult. Genes are annotated together in an operon based in part on their being collinear and consecutive. Evidence that typically confirms that a gene is part of an operon is that the majority of downstream genes in C. elegans are trans-spliced to SL2; however, genes downstream of long intercistronic regions are often trans-spliced by SL1 instead (Blumenthal et al. 2002). The observed paucity could thus be an artifact of the methods currently available to annotate operons.

Our second hypothesis predicts that the patterns of duplications should be very different for genes inside or outside operons: tandem duplications of genes in operons should be rare, while interchromosomal duplications should have the same frequency for genes inside or outside operons. The ideal test of this prediction is to compare only recently duplicated genes with very low KS values. However, by this criterion, the number of duplicated pairs in operons becomes too low (only 12 duplicate pairs exist with KS lower than 1 and families of size 2), and we performed our analysis using all duplicated pairs belonging to families of size 2. Although most of the recent duplications observed in C. elegans occur in tandem and inverted orientation, pairs with high KS values do not display the excess of tandem inverted duplications seen for those with low KS values (Katju and Lynch 2003; Lercher et al. 2003).

We found that intrachromosomal gene duplications occur less frequently for genes in operons (44 cases) than expected (103; p < 0.001), while interchromosomal duplications occur (84 cases) as often as expected statistically, based on the frequency of this type of duplication in the genome (89; p = 0.272). In fact, 65.6% of the duplications involving operon genes were interchromosomal, compared to only 47.3% of all duplications in the genome (Table 3). Furthermore, for genes in operons, intrachromosomal duplications less than 10 kb apart are more rare than duplications separated by greater chromosomal distances (Table 3), which again is in accord with the predictions of the second hypothesis. Thus it appears that genes in operons are duplicated less frequently not because the resulting genes are defective but, rather, because duplication of a gene by typical mechanisms would disrupt either adjacent genes or the structure of the operon.