Introduction

Alu and L1 retrotransposons are the most abundant transposable elements (TEs) in our genome, with approximately 1.1 and 0.5 million copies respectively (IHGSC 2001) and, together, contribute ∼30% of our genome sequence. The majority of both Alu and L1 insertions are nonfunctional “fossils” that either were already incapable of replication at the time of insertion or became inactive later due to mutations, insertions, or deletions. Only 80–100 L1 copies are active at present (Brouha et al. 2003), and the number of active Alu master sequences is approximately 3%–15% of the new insertions (Cordaux et al. 2004).

The two groups of elements rely on different strategies for survival. L1 elements are autonomous and encode proteins with chaperone (ORF1p), endonuclease, and reverse transcriptase activity (ORF2p), which enables them to replicate and insert independently of host functions. A full-length L1 is ∼6500 bp long, but most copies are truncated at their 5’ends, and the mean insertion length of L1s is only 900 bp (IHGSC 2001). The currently active primate specific L1s target gene-poor, AT-rich parts of the genome, which can be explained by the cleavage site of the L1 endonuclease: TTTT/A (Feng et al. 1996), but insertions of old, inactive L1 families are present at similar frequencies across the genome regardless of local GC frequency (IHGSC 2001; Yang et al. 2004).

In contrast to L1s, Alus are short (∼300-bp) sequences which, unlike L1s, do not encode their own proteins but parasitise L1s’ (Dewannieux et al. 2003; Jurka 1997; Smit et al. 1995). Their replication and insertion depend on the ORF2 protein of L1, and it has been shown that young Alus also target AT-rich parts of the genome and accumulate (more rapidly than L1s) in regions with a high GC content (Gu et al. 2000; IHGSC 2001; Pavlicek et al. 2001; Yang et al. 2004). The consequence of this relatively rapid accumulation is that the chromosomal distribution of Alus and L1s is different, Alus being present mainly in GC-rich regions of the genome (IHGSC 2001; Soriano et al. 1983).

Currently, there is no consensus on the mechanism responsible for the temporal enrichment of Alus and L1s in the GC-rich regions. Gu and colleagues (2000) have argued that the different insertion patterns of Alus and LINEs might be caused directly by Alu/LINE interactions. These authors proposed that Alus could switch their insertion preference, to avoid competition with LINEs for the ORF2 protein. This hypothesis does not require selection to act on the insertions, but recent findings (Hackenberg et al. 2005), and also our results (see below), do not support it. Since GC-rich regions are also gene-rich, it has been proposed that the accumulation of Alus in these regions may be adaptive (IHGSC 2001). However, Brookfield (2001) pointed out that the accumulation of Alus in GC-rich regions is still slower than the time necessary for the fixation of neutral alleles; positive selection is therefore unlikely to be the cause. Interestingly, in rodents even the youngest SINE insertions show strong GC preference, Yang et al. 2004. Bailey et al. (2003) proposed that duplications might contribute to the enrichment of Alus, because duplications occur more frequently in Alu- and GC-rich regions of the genome than elsewhere. However, the frequency of duplications in the human genome is not enough to explain the strong pattern observed (Jurka et al. 2004). Recently Belle et al. (2005) demonstrated that degradation of Alus by short indels in GC-poor regions is also not the cause of selective Alu loss.

Several authors have proposed that the enrichment of repeats in GC-rich regions is the consequence of illegitimate recombination between the repeats (Batzer and Deininger 2002; Brookfield 2001; Hackenberg et al. 2005; IHGSC 2001; Lobachev et al. 2000; Medstrand et al. 2002; Stenger et al. 2001), which can cause deletions and duplications (Fig. 1). Deletions are likely to be more deleterious in gene (and GC-)-rich regions than in gene-poor (and AT-rich) regions, because they may delete entire genes or exons. This mechanism can thus eliminate repeats from AT-rich (gene-poor) regions without as many harmful effects, thus leaving a higher abundance in GC-rich regions. It has been shown (MGSC 2002) that in the mouse and human genomes, LINEs and SINEs are found in different chromosomal locations, but human and mouse specific LINE (and also SINE) insertions accumulate in orthologous segments of the genomes; repeat densities in one species are actually correlated more strongly with the density in the other species in the same region than with the local GC content of the insertions. This finding led the authors to the conclusion that GC content is not the direct cause of the different distribution of LINEs and SINEs but is merely correlated with the true cause. A relationship between recombination rate and GC content was observed more than a decade ago (Eyre-Walker 1993). More recently Meunier and Duret (2004) argued that the regional variation of GC content is driven by recombination.

Fig. 1.
figure 1

Chromosomal rearrangements caused by repeats. Open bars mark nonrepetitive regions that undergo duplication. a Ectopic recombination between heterozygous repeats leads to a deletion on one chromosome and duplication on the other; either is likely to be deleterious in gene-rich regions. This process is likely to be prevalent in organisms with a low TE abundance but high Te polymorphism, like Drosophila (Langley et al. 1988). b In mammals most TEs are fixed, and recombination between homozygous TEs probably has minor consequences on fitness, but recombination between nonhomologous repeats leads to duplications and deletions as well. Bailey et al. (2003) have proposed that Alu-mediated duplications could cause the enrichment of Alus in gene-rich regions, because duplications are more frequent in GC-rich regions. However, recently Jurka et al. (2004) concluded that the amount of duplications in the genome is not high enough to explain the GC shift of Alus. c Intrachromosomal exchange between repeats. This process removes one repeat with a fragment of surrounding DNA. Since deletions are more likely to be deleterious in gene-rich regions where they can remove exons, the process predicts faster loss of repeats from gene-poor, AT-rich regions of the genome than from GC-rich regions.

Chromosomal rearrangements caused by repeats occur primarily during meiosis, at least in Drosophila (Montgomery et al. 1991). Therefore, ectopic recombination between repeats predicts that, in regions with low recombination rates or with a very low density of selectively important sequences, repeats will not accumulate in GC-rich regions over time. We test this hypothesis using gene-poor regions from human chromosomes 4, X, and Y. At present the Y chromosome experiences no meiotic recombination (with the exception of pseudoautosomal regions), and Medstrand et al. (2002) have already noticed a delay in the accumulation of Alus on the Y. The Y chromosome evolved from an X-like ancestor and originally paired with the ancestor of the X chromosome (reviewed by Charlesworth et al. 2005). The cessation of meiotic recombination between the two chromosomes was gradual and involved several steps leading to “evolutionary strata” on the sex chromosomes, the oldest formed 240–320 million years ago (mya), while the most recent ones formed 30–50 mya (Lahn and Page 1999; Ross et al. 2005; Skaletsky et al. 2003). However, since some Alu and, particularly, L1 families are older than the youngest “evolutionary strata,” at the time of their insertion they could also have inserted into parts of the chromosome which were still recombining. In addition, 10.2 Mb of the Y chromosome sequence was acquired from several autosomes by transposition events in the last 300 million years (Skaletsky et al. 2003), which certainly moved repeats to the Y chromosome. For these reasons we expected reduced enrichment of TEs in GC-rich regions of the Y chromosome, rather than their complete absence.

In this paper we address the following questions.

  1. 1

    We test whether the change of GC distribution of non-LTR retroelements follows the same temporal pattern on the X chromosome and the male-specific part of the Y chromosome as on the autosomes.

  2. 2

    Since Alus and L1s use the same protein (the ORF2p of L1s) for replication and insertion, we predict that, in the absence of selection against deleterious deletions, their abundances on the Y chromosome should be positively correlated. We test whether Alus and LINEs accumulate at the same chromosomal locations on the Y and compare the results with those from other chromosomes.

  3. 3

    If ectopic recombination is the cause of the GC shift of repeats, then the important factors are its frequency and the gene density of the chromosome, which determines the magnitude of deleterious effects of deletions, and both factors are weaker on the Y. They can be separated, comparing regions of the genome that, unlike the Y, do experience meiotic recombination but have a very low gene density, with regions of high gene density but low recombination rate. We test whether the distribution of Alus and L1s shows a shift toward GC-rich regions, using genomic regions which are extremely gene-poor but recombine (Myers et al. 2005), one on chromosome 4 (26.7–38.5 Mb) and one on the X chromosome (86.8–95.7 Mb), and a fragment of the Y chromosome (19–26.7 Mb) which has a higher than average density of genes.

  4. 4

    It has been shown that mammalian sex chromosomes contain a higher abundance of TEs than autosomes (Baker and Wichman 1990; MGSC 2002) and the proportion of full-length and long L1 insertions is also higher (Boissinot and Furano 2001; Erlandsson et al. 2000; MGSC 2002). In addition, insertions of different lengths are present in regions of different GC content, with AT-rich regions of the genome containing more long L1 insertions (Medstrand et al. 2002; MGSC 2002). Since ectopic recombination is more likely to occur between long insertions than short ones (Hasty et al. 1991; Petrov et al. 2003), the loss of long insertions due to recombination results in an enrichment of LINEs in GC-rich regions, relative to AT-rich regions. We test whether the selective loss of long L1 insertions is a significant factor in the accumulation of old repeats in GC-rich regions.

Methods

The sequence and repeat annotation files (RepeatMasker) of the human genome (hg17; May 2004 assembly) were downloaded from the UCSC genome browser site at http://genome.ucsc.edu (Karolchik et al. 2003). The pseudoautosomal regions (PARs) and the X transposed regions (XTRs) were excluded from the sequence of the Y chromosome. Both Alus and LINEs were grouped into cohorts according to their consensus sequences (Jurka and Milosavljevic 1991; Jurka and Smith 1988; Smit et al. 1995). The cohorts were active during different periods of mammalian evolution (IHGSC 2001): AluY (currently active, 0–30 mya); AluS (30–60 mya); AluJ (60–100 mya); L1PA (currently active, 0–65 mya); L1PB (50–80 mya); L1MA (50–100 mya); L1MB (100–150 mya); and L1MC, L1MD, and L1ME (80–150 mya). With the exception of the youngest L1PA and AluY cohorts, in humans all Alu and L1 insertions are “fossils,” and there is no evidence of their activity for millions of years; L1P* cohorts are present only in primates, while L1M* cohorts can be found in most mammals. Analyses of the age, phylogeny, insertion preference, and copy numbers of L1 cohorts can be found in IHGSC (2001), Ohshima et al. (2003), and Smit et al. (1995).

The insertion preference of repeats was calculated using the method of Yang et al. (2004). This takes into account the differences in absolute repeat densities and GC distribution between the chromosomes. First, the frequency distribution of the GC content of the chromosomes (autosomes, X, Y) was calculated by dividing their sequences into nonoverlapping 30-kb windows. The frequency of the G and C nucleotides (GCchr) was calculated in each window. Repetitive sequences were excluded from the windows; consequently the average nucleotide count in the windows was 15 kb. Next, the local GC content of the sequences adjoining individual repeat insertions (GCrep) was calculated. For each insertion, the frequency of G and C nucleotides was counted in adjoining 15-kb windows up- and downstream of the insertions (before the first and after the last position of the repeat), excluding repetitive sequences, to ensure the independence of GCrep from GCchr and local repeat density. Fragmented repeats were counted as one insertion. When calculating the frequency distributions of GCchr and GCrep, bins of 2% were used for the X chromosome, and bins of 3% for the Y chromosome, due to its lower abundance of repeats. The insertion preference of the repeat cohorts is represented by relative frequencies: the frequency of GCrep in a bin divided by the frequency of GCchr in the same bin. For statistical analyses of Alu distributions, a modified version of the method outlined above was used: the GCrep of each repeat was divided by the median of the GCchr distribution of the sequence analyzed. The resulting distribution was dichotomized; 1 was assigned for each repeat with GCrep equal to or larger than the median, and 0 for repeats with smaller GCrep. The proportions of repeats with a local GC content larger or equal to the median GCchr on each chromosome or genomic region were compared with Tukey tests (Zar 2004). This method allows statistical comparison of the shifts in repeat distributions when the underlying GC distributions of the sequences compared are very different.

Repeat densities on the chromosomes were calculated by dividing the chromosomes into nonoverlapping fragments of 200 kb. In each window, the numbers of Alu, L1PA, and L1PB insertions were counted (the primate specific L1 cohorts); fragmented repeats were counted only once. Repeats lying on the border, or having fragments in two windows, were counted in the window that contains their first nucleotide. Besides the sex chromosomes, the densities of the repeats were also calculated on two autosomes, chromosome 7 and chromosome 21, which have a size and GC content similar to those of the sex chromosomes (chr7-X, chr 21-Y [ICGSC 2004]). The length of L1 insertions was calculated as in ICGSC (2004), as the difference between the first and the last position on the matching consensus sequence.

Results

On the X chromosome and autosomes, both Alus and L1s increase in frequency in GC-rich regions over time (Figs. 2a and b, Supplementary Fig. 1). In contrast to the X chromosome, old LINEs on the Y chromosome (L1MB–L1ME cohorts) are present at similar frequencies in AT-rich regions as the currently active L1PA cohort (Fig. 2c). Compared to LINEs, the frequency of Alus increases in GC-rich regions over time (Figs. 2d and 3), but the GC shift is delayed (Fig. 2d) and weaker than on the autosomes and X chromosome (Fig. 3). No significant change in repeat distributions was observed in regions with low gene densities on chromosomes 4 and X (Figs. 3 and 7).

Fig. 2.
figure 2

Frequency of occurrences of different LINE and SINE families in regions of different GC content. The pattern on the X chromosome is similar to the pattern observed on autosomes (Supplementary Fig. 1; IHGSC 2001); at the time of insertion Alus show a similar GC preference to LINEs, but accumulate rapidly in GC-rich regions of the genome, while the accumulation of LINEs in GC-rich regions characterizes only the older clades. On the Y chromosome LINEs show no accumulation in GC-rich regions, and Alus show only very weak enrichment in GC-rich regions. Dotted lines show the distribution of the youngest Alu insertions of the currently active AluY family, which diverged from their consensus by 1%, 2%, and 5 %. Divergence was calculated as in MGSC (2002). On the Y chromosome the higher, 5% cutoff is necessary due to the low abundance of Alus.

Fig. 3.
figure 3

The shift of Alu cohorts toward GC-rich regions on the analyzed chromosomes and chromosomal regions. The numbers above the whiskers give the frequency (%) of Alus having a local GC content higher than the median of GCchr. The shift toward GC region is highly significant on every chromosome (Tukey‘s test for proportions, p ≪ 0.001), with the exception of the gene-poor regions of chromosomes 4 and X (p > 0.5 for both; n chr4 = 724, n chrX = 640). The Y chromosome is significantly different from both the X chromosome and the autosomes (p ≪ 0.001).

Alu and LINE densities are negatively correlated on autosomes and the X chromosome (Spearman rank correlations; see legend to Fig. 4), while on the Y chromosome they are positively correlated (Fig. 4a). The strongest negative relationship between Alu density and LINE density is on the X chromosome, where LINEs are particularly abundant (Bailey et al. 2000; Ross et al. 2005) and are thought to be involved in its inactivation (Bailey et al. 2000; Chow et al. 2005; Lyon 1998; but see Ke and Collins 2003). On the two autosomes the trend is similar to the trend observed on the X chromosome, but less pronounced (Fig. 4b).

Fig. 4.
figure 4

The abundances of primate-specific L1s (L1PA–L1PB cohorts) and Alu elements in 200-kb nonoverlapping windows. Alu and L1P abundances are strongly negatively correlated on the autosomes and the X chromosome, where the trend is the strongest, but are positively correlated on the Y chromosome. Spearman‘s rank correlations: Y chromosome, R = +0.25, n = 100, p = 0.010; X chromosome, R = –0.61, n = 761, p ≪ 0.001; chromosome 7, R = –0.45, n = 711, p ≪ 0.001; chromosome 21, R = –0.45, n = 150, p ≪ 0.001.

The analysis of the length distribution of L1s in the finished portions of the sex chromosomes (PARs and the X-transposed region excluded from the Y chromosome) shows a qualitatively similar enrichment of full-length L1PA (the currently active LINE cohort) elements on the sex chromosomes (Fig. 5) to that reported by Boissinot et al. (2001). The frequency of full-length (5600- to 6400-bp-long) L1PA insertions is 1.7–2 times higher on the Y chromosome, and 1.46–1.08 times on the X chromosome, compared with autosomes (Fig. 5). The frequency of full-length elements among the youngest L1PA1 (L1HS)–L1PA5 families is 29.3% on the Y chromosome and 17.2% on the X chromosome (not shown).

Fig. 5.
figure 5

Higher frequencies of LINEs on sex chromosomes. a Frequency distribution of insertion sizes of the currently active L1PA cohort on the X and Y chromosomes. The insertions were grouped into bins of increasing insertion size, differing by 80 bases. b The frequency of insertions longer than 1000 bp is higher on the sex chromosomes than on autosomes: Mann–Whitney U-test (n = 15), p < 0.001 for both the X and the Y chromosomes. Dotted lines are best-fit (OLS) logarithmic functions. Repeats were grouped into bins differing by 400 bases.

Although long insertions are much less frequent in old L1 cohorts than in younger ones, their selective disappearance is not the main cause of the shift of old L1 cohorts toward GC-rich regions (Fig. 6): the variance explained by the negative correlation between the length of the insertions and their local GC content is far less important than the length-independent shift toward a high local GC content (see statistics in Fig. 6). However, in the relatively gene-rich (compared to the chromosome average) region of the Y chromosome, long L1 insertions (>1000 bp) show a weak but significant shift toward regions with a high GC content (Fig. 7).

Fig. 6.
figure 6

Relationships between the local GC content and the length of L1 insertions. We used ANCOVA to separate the effect of cohort (categorical predictor) from the effect of insertion length (continuous predictor), but for clarity (∼300,000 insertions) the data are shown as box-plots. Lengths and local GC contents were log transformed prior to analysis. On autosomes the effect of cohort is almost two times stronger than the effect of length (insertion length, β = –0.09, p ≪ 0.001; cohort, β = +0.164, p ≪ 0.001; n = 293,783); the pattern is similar on the X chromosome (insertion length, β = –0.076, p ≪ 0.001; cohort, β = + 0.173, p ≪ 0.001; n = 21,224). On the Y chromosome we found a significant relationship between insertion length and GC preference (β = –0.091, p < 0.001; n = 1992) but no significant effect of cohort (β = 0.038, p = 0.101).

Fig. 7.
figure 7

Relationships between the insertion length of LINEs and their local GC content, using chromosomal regions with a low gene density (chromosomes 4 and X) and a relatively gene-dense (compared to the chromosomal average) region from the Y chromosome. We used GLM for statistical analyses. No significant (p > 0.1; n chr4 = 966, n chrX = 1,289) effects of any factors or their interactions are detected on chromosomes 4 and X, but a weak effect (p = 0.0131; n = 785) of the interaction between insertion length and cohort is present on the Y chromosome, indicating that the slopes of the regressions of old and young L1 cohorts are different.

Discussion

Our results support the hypothesis that ectopic recombination between repeats drives the accumulation of L1s and Alus in GC-rich regions of the genome. This can explain most observed patterns, both on autosomes (and X) and on the Y chromosome. Its implications are: (1) the Y chromosome has the original, approximately unchanged GC distributions of LINEs and Alus, which reflects the insertion patterns of the repeats; (2) the AT preference of the recent TE insertions is primarily the result of the target specificity of the ORF2 protein; and (3) the insertion preferences of old LINE (L1MB–L1ME) and Alu (AluS, AluJ) cohorts were similar to those of the currently active cohorts. In agreement with the predictions of ectopic recombination, the gene-poor regions of chromosome 4 and X show no change in Alu and L1 distributions over time, suggesting that the presence of genes is necessary for the GC shift; in their absence recombination alone does not change the distribution of repeats (Figs. 3 and 7). The observation that on the Y chromosome Alus and primate specific L1s accumulate in the same chromosomal regions (Fig. 4) is the expected pattern for repeats which use the same protein for insertion, and the negative correlation between Alu and L1 abundances on other chromosomes is likely to be the result of subsequent processes (ectopic exchange). The weak enrichment of Alus in GC-rich locations on the Y could be due to the gradual cessation of recombination or even ectopic recombination, because roughly 25% of the euchromatic region of the Y chromosome consists of palindromes, which are older than the human–chimpanzee split and undergo frequent gene conversion (Rozen et al. 2003).

Ectopic recombination occurs more often between long sequences than short ones (Hasty et al. 1991; Petrov et al. 2003; but see Cooper et al. 1998) and this predicts the overrepresentation of long insertions on sex chromosomes. This is clearly the case: the X and, especially, the Y chromosome have higher frequencies of insertions longer than 1000 bp compared with autosomes (Fig. 4). The loss of long L1 insertions is not, however, the main factor that shifts the distribution of old repeats toward GC-rich regions, since on autosomes and the X, all length classes show a comparable shift (Fig. 5). This apparent paradox can be explained by at least two processes: incomplete deletions of repeats may change the length but not necessarily the local GC content of the insertion, and such repeats may be classified among the short insertions; also, the frequency of ectopic recombination is likely to increase with increasing density of repeats, and short L1 repeats (<1000 bp) are much more abundant than long ones on every chromosome. The weak but significant shift of long LINEs toward GC-rich regions on the relatively gene-rich (ampliconic) fragment of the Y chromosome (Fig. 7) indicates that recombination between repeats occurs on the Y chromosome as well, but it is either not frequent or not deleterious enough to shift the distribution of the shortest insertions toward GC-rich regions.

An alternative possibility is that the pattern on the autosomes and X chromosome represents the original distribution of repeats and was generated by an unknown process, and it is the Y chromosome that underwent subsequent changes. The Y chromosome has experienced degeneration: it has lost the majority of its genes, and its overall length has decreased to one-third the size of the X chromosome. Its euchromatic region is even smaller, comprising approximately one-half of the chromosome (Skaletsky et al. 2003). The underlying causes of shrinking are not fully understood, but most explanations are related to lack of recombination (reviewed by Charlesworth and Charlesworth 2000). Recombination enables outcrossing parents to produce offspring with fewer deleterious mutations than they have themselves. Therefore in nonrecombining organisms or regions of the genome the accumulation of deleterious mutations is inevitable and can lead to the loss of deleterious and nonfunctional genomic material. The degeneration of the Y chromosome did not affect its entire sequence similarly; GC-rich regions disappeared at a higher rate than AT-rich regions, due to either sequence turnover or deletions, and this has skewed the overall GC distribution of the chromosome compared to X and autosomes (Supplementary Fig. 2). The distribution of L1s shows that most of the oldest cohorts (L1MC, L1MD, L1ME, which were active in the ancestors of primates 80–150 mya) have already disappeared from the Y; the chromosome is dominated by the youngest, primate specific repeats (Supplementary Fig. 3). If repetitive sequences with a high local GC content were deleted on the Y chromosome at a higher rate than euchromatic sequences with the same GC content, it would have led to the depletion of repeats in these regions. Complete deletions leave no signs behind, but partial deletions should result in a more pronounced negative correlation between GC content and insertion length on the Y chromosome than on recombining chromosomes. In the case of LINEs we found no such effect; the standardized slope (β) of the partial regression between insertion length and GC content on the Y chromosome is similar to that of the autosomes (Fig. 6).

Taken together, although the shrinking of the Y chromosome is likely to have influenced the GC distribution of its retroelements, our data do not indicate that this is a major force, while selection against deleterious deletions caused by ectopic recombination explains the observed pattern both on autosomes and sex chromosomes. The two repeat classes behave in a qualitatively similar way; only the speed of the GC shift is different. For heterozygous repeats (Fig. 1a), theory predicts that the likelihood of TEs participating in ectopic recombination events is proportional to the square of the copy numbers of TEs (Langley et al. 1988), and experimental work on Drosophila (Montgomery et al. 1991) suggests that ectopic exchange between repeats is most frequent when repeats are heterozygous. It has been proposed (Charlesworth and Charlesworth 1995; Morgan 2001) that in selfing species, where homozygosity is high, TE abundances should be higher due to a reduced frequency of ectopic exchange between repeats. In mammals, similarly to selfing species, the vast majority of L1 and Alu insertions are homozygous (Bennett et al. 2004), in this case because of fixation, and unlike in Drosophila, intrachromosomal recombination may be the main force removing TEs from the genome (Fig. 1c). It is unclear how the frequency of intrachromosomal recombination scales with the density of repeats, but again the relationship is likely to be positive and nonlinear. The majority of L1 insertions in the genome predate the appearance of Alus; only approximately 130,000 L1s date from the time of Alu proliferation. At the peak of their activity (∼40 mya) Alu insertions were 25 times more frequent than the coexisting L1s (∼180,000 vs. ∼7,200 [Abrusan and Krambeck, submitted]). Although a typical LINE insertion is four to five times longer than an Alu, the large difference in their abundances alone may be sufficient to explain the faster shift of Alus toward GC-rich regions.