Introduction

Transposable elements (TEs) are genomic sequences able to replicate themselves, and to move from one site to another within genomes. They are present in almost all the organisms in which they have been looked for, and can make up a large proportion of a genome: ~45% of primate genomes (The International Human Genome Sequencing and Analysis Consortium 2001; The Chimpanzee Genome Sequencing and Analysis Consortium 2005; The Rhesus Macaque Genome Sequencing and Analysis Consortium 2007), 38.5% of the mouse genome (The Mouse Genome Sequencing and Analysis Consortium 2002), 5% of euchromatin in Drosophila melanogaster (The Drosophila 12 Genomes Consortium 2007), and in some plants, such as maize, they can account for 80% of the genome (Schnable et al. 2009). Several classes of TE can be distinguished on the basis of their sequence and structure (Wicker et al. 2007), however, they are usually divided into two main classes based on their replication system (Finnegan 1989). The class II elements, or DNA transposons, use a DNA intermediate and move by a “cut and paste” mechanism; the class I elements, or retrotransposons, use an RNA intermediate and move by a “copy and paste” mechanism. The retrotransposons are further divided into two subclasses, those that have “long terminal repeats” (LTRs) at their extremities (LTR retrotransposons), and those that do not (non-LTR retrotransposons). There are also two subfamilies of non-LTR retrotransposons: the long interspersed nuclear elements (LINEs) and the short interspersed nuclear elements (SINEs).

TEs can insert into the cis-regulatory regions of genes, thus acting as regulatory sequences that control the expression of these genes (Mariño-Ramírez et al. 2005). For example, Alu elements, a particular family of SINEs, contain binding sites for transcription factors, which allow these elements to control the activity of nearby genes (Polak and Domany 2006). The presence of SINEs in the gene neighborhood has been found to be associated with the deregulation of genes in tumoral conditions (Lerat and Sémon 2007). TE insertion and recombination between TE copies, either on the same or on different chromosomes, can also lead to duplication/deletion events, chromosomal translocations, and more complicated chromosomal rearrangements. These modifications are responsible for some human diseases (Kazazian and Moran 1998; Kazazian 2004; Deininger and Batzer 1999), and Alu insertions have been linked to 16 diseases (Deininger and Batzer 1999), and L1 insertions to 15 diseases (Chen et al. 2005). For instance, one type of breast cancer is caused by an Alu insertion in exon 22 of the BRCA2 gene, and hemophilia A by the insertion of an L1 in exon 14 of coagulation factor VIII (Kazazian et al. 1988). Forty-nine diseases, including Tay-Sachs disease (Myerowitz and Hogikyan 1987), are known to be the consequence of Alu/Alu recombination (Deininger and Batzer 1999). L1/L1 recombination is known to be responsible for two diseases, glycogen storage disease (Burwinkel and Kilimann 1998) and duplication of the β-globin gene (Fitch et al. 1991). Despite their deleterious impact, TEs can also have positive effects on the host genome (Biemont and Vieira 2006), and some copies have been domesticated by the genome. The human protein SETMAR, which is a fusion between an H3 methylase and a transposase belonging to DNA transposons of the mariner family (Liu et al. 2007), has a role in both DNA methylation and DNA repair. The RAG1 and RAG2 proteins, which initiate the V(D)J recombination, a site-specific somatic recombination necessary for the assembly of the variable region of B-cell receptor/immunoglobulin and T-cell receptor genes from different gene segments, have been shown to be derived from an ancient transposase (Roth and Craig 1998).

The distribution of Alus and LINEs in the human genome has been studied with regard to the characteristics of the genes into which or near to which these elements are inserted. Two categories of gene function have been distinguished according to their Alu content: Alu-poor genes are implicated in important functions such as regulation and transcription, whereas Alu-rich genes tend to be associated with transport or metabolism functions (Grover et al. 2003). This difference in Alu distribution could be explained by the involvement of Alus in regulatory processes. Human L1 elements tend to be inserted into gene-poor regions, not only because they can affect gene expression since they have their own promoter in their 5′-UTR region, but also because of their capacity to produce truncated transcripts by introducing more transcription termination signals (Perepelitsa-Belancio and Deininger 2003) or by reducing the amount of transcript produced, thus, reducing protein expression (Han et al. 2004). Moreover, the function and expression level of genes have been found to be associated with TE insertion/fixation in mammalian introns (Sironi et al. 2006).

Most of the works referred to above, which analyze the relationships between the TEs and the host genes, concern particular types of TE families, such as Alu and L1 elements, which constitute the largest proportion of the TEs in the human genome. Only a few studies have investigated the impact of DNA transposons or retrotransposons, including the inserted forms of retroviruses. In our study, we focused on the relationships between overall TE insertions and human genes. We have considered only the complete TE copies, compared to their reference, that could still have an intact and potentially active promoter region that could allow them to have a particular influence on neighboring genes. We show here that the functions of TE-rich genes differ from those of TE-free genes for all the TE families studied. This difference seems to be associated with the different selection pressures acting on the two categories of gene, with selection acting more strongly on TE-poor genes and their flanking regions. In addition, using microarray data, we show that the expression of TE-free and TE-rich genes differs in tumor and immune system tissues. This suggests that TE-rich genes are deregulated in tumor tissues, and that they are subject to specific regulation in immune system tissues.

Materials and Methods

The number of genes used in the different analyses are indicated in two summary figures proposed as supplementary material.

The Data

We used the BioMart tool (http://www.biomart.org/; Smedley et al. 2009) to extract all human gene coordinates (gene start and end, chromosome location) based on the Ensembl database (release 50, July 2008), which constitutes 35,641 genes. We then downloaded the list of TE coordinates obtained from the corresponding human genome version (hg18) available on the Repeat Masker website (http://www.repeatmasker.org/PreMaskedGenomes.html). For each gene, we considered the 2- and 10-kb flanking regions located upstream and downstream of the gene. We chose these values because it has been demonstrated that the promoter regions of human genes can be located up to the 10 kb of the gene start, with the majority of the promoters being within a 2-kb region (Kim et al. 2005).

We calculated the number of complete copies of TEs inserted into each gene and into its 2- and 10-kb flanking regions. We considered a TE copy to be complete if its length was at least 95% of that of the complete reference element, and if its divergence to the complete reference element was <20% (Lerat and Sémon 2007). We also considered all solo-LTRs of more than 95% of the length of full-length reference LTRs, and with a divergence to the reference LTR to be <20%, to be complete, because such sequences harbor regulatory regions. We then calculated the overall TE density for each gene as the fraction of the number of complete TEs within the gene and its flanking regions, and the difference between the length of the gene plus the length of the flanking regions, and the length of the TEs. The value found was multiplied by 104 to allow a more convenient representation of the data.

$$ {{\text{TE}}\;{\text{density}} = {\frac{{{\text{Number}}\;{\text{of}}\;{\text{TEs}}}}{{{\text{Gene}}\;{\text{length}} - {\text{TE}}\;{\text{length}}}}} \times 10^{4} } $$

Similarly, for each gene we calculated the density in DNA transposons, LTR retrotransposons, LINEs, and SINEs. We used the TE density to define four classes of genes, extending from TE-free genes to TE-rich genes. We defined all genes with a TE density >10 as TE-rich genes.

We used the web-based application FatiGO (http://babelomics.bioinfo.cipf.es/EntryPoint?loadForm=fatigo; Al-Shahrour et al. 2004, 2005) to compare the functions of the two sets of genes (TE-free and TE-rich) for the two sizes of flanking region. FatiGO was used to extract Gene Ontology (GO) terms that were significantly over- or under-represented in one of the two sets of genes. GO terms are classified into three non-overlapping domains (The Gene Ontology Consortium 2000, 2001). The molecular function section of FatiGO describes the biochemical activities of gene products. The description is limited to describing the type of transcription product, without specifying either when or where it is produced (e.g., enzyme, transporter, ligand, etc.), the biological processes in which the gene product is involved (e.g., biosynthesis of cyclic AMP, cell growth, etc.), and the cellular component within which the products derived from the genes are located (e.g., nucleus, golgi apparatus, etc.).

Evolutionary Analyses

We used BioMart to extract the orthologous gene sequences of the chimpanzee, and then computed the selection pressure ratio, ω (or K a/K s, where K a represents the rate of non-synonymous substitutions and K s the rate of synonymous substitutions), of TE-free and TE-rich genes. We eliminated any human genes without orthologous genes in chimpanzee and the genes with internal stop codons. To avoid saturation in the computation of selective ratio and bias in the statistical analyses, we also removed the genes with almost identical sequences between human and chimpanzee (with a % identity between 99 and 100%). We performed pairwise alignments of each pair of human and chimpanzee genes using muscle (Edgar 2004), and we used the codeml program from the PAML package (Yang 2007) to calculate the ω ratio. The same method was used to compute the ω ratio between human genes and their orthologous genes in mouse.

We compared the ω ratios of orthologous genes along the phylogenetic tree of four primates: human, chimpanzee, orangutan, and rhesus macaque. Using BioMart, we searched for the “one-to-one” orthologous protein coding genes present just once in each of the four genomes. The orthologous genes obtained were aligned using muscle (Edgar 2004). Ten different scenarios were thus defined depending on the neighboring TE status of the genes (see Fig. 1). For example, scenario 1 indicates that the genes in all four species are TE-rich, whereas scenario 10 indicates that human genes are TE-free while the other primate genes are TE-rich. In order to evaluate the contribution of selection pressure to the difference in TE density observed between the four primate species, different models were studied for the different scenarios considered. The M0 model assumes the same ω ratio for all branches in the tree; the M1 model supposes an independent ω ratio on each branch of the phylogeny; the M2 model assumes a ω1 ratio for a specific branch, which differs from the background ω0 ratio of the tree. These computations were done using the codeml program from the PAML package (Yang 2007). Two models were studied for each scenario: M0 versus M1 when the orthologous genes in the four primate genomes had the same TE density, and M0 versus M2 when the TE density of orthologous genes was different in at least one species. The statistical significance was assessed by comparing twice the difference in the likelihood scores (2ω), with a ω² distribution with a number of degrees of freedom equal to the difference in the number of parameters between the models.

Fig. 1
figure 1

Phylogeny of the four primates studied (branch length ignored). The tree is rooted by the mouse. The date of divergence in million years (My) is indicated on each branch. The table on the right shows the 10 gene evolution scenarios associated with the ω comparative analysis (see “Materials and Methods” section). F in white cells: TE-free genes; R in gray cells: TE-rich genes

The percentage identities of the flanking regions upstream and downstream of each orthologous gene pair of human and chimpanzee, human and orangutan, and human and macaque, were calculated using the dnadist program of the phylip package (Felsenstein 1989). To align the flanking sequences of each orthologous gene pair, the 2- and 10-kb flanking region sequences were submitted to RepeatMasker in order to mask the incomplete TEs present in these sequences. Sequences containing more than 60% of TE sequences were not taken into consideration in our analysis, because once the TEs had been masked, the DNA sequences were too short to be aligned.

Expression Data

The gene expression levels of human genes were retrieved from the data of Su et al. (2004), which were obtained by high-density oligonucleotide arrays. In this study, Su et al. (2004) surveyed the expression levels of almost all protein-encoding genes in 79 human tissues including six tumor tissues. Two determinations of the expression level were available for each tissue, and we used the average value as the expression level per tissue for each gene. In this way, we obtained the expression level for 158 TE-rich genes and 239 TE-free genes.

Results

Distribution of TEs in the Neighborhood of Genes in Four Primate Species

The number of human genes in the four classes of full-length TE density, for the 2- and 10-kb flanking regions is shown in Fig. 2. The distribution of the number of genes is the same whatever the size of the flanking region, with the number of genes decreasing as the TE density increases. Only a few genes were entirely devoid of complete TEs, with TE-free genes accounting for 17.4% (6,185 genes) and 1.3% (462 genes) of the total number of genes (35,641 human genes) for the 2- and 10-kb flanking regions, respectively. Genes with a TE density comprised between 0 and 5 constituted 50% (17,828 genes) and 62.2% (22,155 genes) of the total number of genes for the 2- and 10-kb flanking regions, respectively, while genes with a TE density comprised between 5 and 10 constituted 26.1% (9,307 genes) and 30% (10,693 genes), respectively. Finally, genes with a TE density of more than 10 (TE-rich genes) corresponded to 6.5% (2,321 genes) and 6.5% (2,331 genes) of the total number of genes for the 2- and 10-kb flanking regions, respectively.

Fig. 2
figure 2

Distribution of the number of human genes according to their TE density, for the 2- and 10-kb flanking regions. The numbers of genes are indicated at the top of each bar

We determined the proportions of LTR retrotransposons, LINEs, SINEs, and DNA transposons among the complete TEs located in and near human genes for both flanking region sizes (Table 1 and Supplementary Table 1). We found that SINEs were the most frequent elements, corresponding to 74.6 and 75.8% of complete TE copies considering the 2- and 10-kb flanking regions, respectively. The LTR retrotransposons constituted 13.7 and 14% of complete TE copies considering the 2- and 10-kb flanking regions, respectively, which was quite similar to the proportion of DNA transposons, which constituted 11.2 and 9.8% considering the 2- and 10-kb flanking regions, respectively. Finally, complete LINE elements were the least frequently occurring elements, accounting for only 0.5 and 0.4% considering the 2- and 10-kb flanking regions, respectively. These percentages are in agreement with the observations made globally for the complete human genome sequence (The International Human Genome Sequencing and Analysis Consortium 2001). We determined the distribution of the number of genes according to TE density for each TE family (see Supplementary Figs. 1, 2, 3, and 4). The numbers of genes per class of DNA transposon, SINE, or LINE showed the same distribution pattern as for all TEs, i.e., the number of genes decreased as the TE density increased. However, there were more DNA transposon-free and LINE-free genes than genes harboring these TEs in their vicinity. This tendency was less obvious for LTR retrotransposons, although there were fewer LTR retrotransposon-rich genes than genes with low LTR retrotransposon density. Overall, these results show that all TE families contribute to the trend observed for TEs as a whole.

Table 1 Proportion of each TE classes and subfamilies represented inside and in the neighborhood of genes

In order to find out whether the density of complete TEs observed in human TE-free and TE-rich genes is conserved in closely related species, we looked at the orthologous genes in three species, chimpanzee, macaque, and orangutan, of which the genomes have been completely sequenced and TE lists are available. We selected human genes with only a single ortholog in each of the other three species, i.e., human genes that had only one orthologous gene in chimpanzee, one in macaque, and one in orangutan (see “Materials and Methods” section for details). We identified 14,744 human genes with one and only one ortholog in each of the other three species. We determined the density of TEs present in and near these genes for the 2- and 10-kb flanking regions in the other three primates. Among the 897 genes that were TE-free in the human genome, 67% (606 genes) were also TE-free in the chimpanzee, macaque, and orangutan, and among the 1,496 human TE-rich genes, 17% (263 genes) were also TE-rich in all of the other three species when we considered 2-kb flanking regions. When we considered the 10-kb flanking regions, we found that 56% of the 89 human TE-free genes (50 genes) were also TE-free in the other three primate species, and 15% (289 genes) of the 1,849 human TE-rich genes were also TE-rich in the other species. This indicates that TE-free genes seem to be more prone to conserve this characteristic throughout evolution than TE-rich genes to conserve the contrasting characteristic.

Gene Functions According to Their TE Neighborhood

We first compared the functions of the human genes displaying extreme TE densities, i.e., the TE-free genes and the TE-rich genes (TE density of more than 10). For the 2-kb flanking regions, we compared 6,185 TE-free genes to 2,321 TE-rich genes, and for the 10-kb flanking regions we compared 462 TE-free genes to 2,331 TE-rich genes.

Gene functions were assigned by the FatiGO software (Al-Shahrour et al. 2004, 2005) to seven levels, from level 3 to level 9, ranging from general to more specific functions. For clarity’s sake, only the levels for which clear and statistically significant differences between the TE-free and TE-rich genes were found for either the 2- or the 10-kb flanking regions are shown in Fig. 3. In the case of the 2-kb flanking regions, we observed for the three chosen levels that TE-free genes were more often involved than TE-rich genes in multicellular organismal development, nervous system development, and the regulation of transcription (Fig. 3). In contrast, TE-free genes were less often involved in biosynthetic processes, protein transport, and intracellular protein transport than TE-rich genes (Fig. 3). The same trends were observed for the 10-kb flanking regions. In order to verify if the effect we observed is related to the presence or absence of complete TEs, we compared the genes free of any TEs, both complete and incomplete (genes corresponding to those included in transposon free regions (TFRs) as defined by Simons et al. (2006, 2007)) (971 TFR genes), the genes that are complete TE-free and TE-partial rich (356 TCF genes), and the genes that are TE-partial free and TE-complete rich (386 TPF genes). No significant difference in the gene function was observed between the TFR genes and the TCF genes, whereas TFR genes and the TPF genes did display significant function difference in biological processes (see Supplementary Table 2). These results confirm that the function of these genes is mainly linked to the presence or absence of complete TE sequences inside the gene or in their flanking regions, with no influence of the presence or absence of partial TE sequences.

Fig. 3
figure 3

Distribution of the percentage of TE-free (black bars) and TE-rich (white bars) genes involved in biological processes at gene ontology levels 3 (a), 5 (b), and 7 (c), considering 2-kb and 10-kb flanking regions

We then compared the functions of the genes belonging to the intermediate classes of TE density (0 < TE density ≤ 5 and 5 < TE density ≤ 10) to those of TE-free and TE-rich genes, for both sizes of flanking region. Overall, we observed the same tendency as when the TE-free genes were compared to the TE-rich genes, although it was less marked, with the exception of the fact that the functions of the TE-rich genes did not differ significantly from those of the genes in the intermediate class (Chen et al. 2005; Finnegan 1989) (data not shown). We will therefore focus the rest of our analysis on the genes belonging to the extreme classes, i.e., the TE-free genes and the TE-rich genes.

We compared the functions of TE-free and TE-rich genes in orthologous genes in the chimpanzee, orangutan, and macaque. Since orthologous genes are likely to be involved in the same functions, we looked for the ontologies of human genes and compared the functions of human TE-free genes that were also TE-free in the other three species (606 genes) to those of the human TE-rich genes that were also TE-rich in the other species (263 genes). We found the same degree of difference of function as when the functions of human TE-free genes were compared to those of human TE-rich genes, showing that this feature is conserved in primates.

Selective Pressures Acting on the Coding Regions of Genes Depending on Their TE Neighborhood

To test whether TE density could be linked to selective pressures acting on the genes, we computed the ω ratio of the human and chimpanzee TE-free and TE-rich orthologous genes. We analyzed 1,377 TE-free genes and 824 TE-rich genes for the 2-kb flanking regions, and 121 TE-free genes and 982 TE-rich genes for the 10-kb flanking regions. The median of the ω ratio appeared to be higher for TE-free genes than for TE-rich genes for both groups of flanking regions (0.33 vs. 0.28 for 2 kb, 0.37 vs. 0.26 for 10 kb), but this difference was statistically significant only for the 2-kb flanking region category (Wilcoxon rank sum test = 522,711.5, P = 0.002). This indicates that selection pressure on the coding part of the genes is not sufficient to explain the absence of TE insertions in and near the TE-free genes. This could be due to the fact that the two species are too closely related for any significant difference to be detected. To boost the power of the detection, we performed a comparative evolutionary analysis using the genomes of these four primates: human, chimpanzee, orangutan, and macaque. We considered 10 different scenarios (Fig. 1). In scenarios 1 and 2, all orthologous genes belonged to the same class (either TE-free or TE-rich), and two models were compared: the M0 model, which assumes the same selective ω ratio for all branches of the tree (Fig. 1), and the M1 model, which assumes an independent ω ratio for each branch of the phylogeny. The likelihood ratio test indicated that the M0 model fitted the data significantly better for both scenarios 1 and 2 (92% of genes fitted scenarios 1 and 2). This means that all branches of the tree have the same ω ratio, and so genes belonging to the same TE density class have been subjected to the same selection pressure.

In scenarios 3–10, orthologous genes can belong to different TE-density classes, in one or more species. We tested whether this could reflect the fact that different selection pressures were acting on these genes. For example, in scenario 3, the human genes were TE-rich, whereas the orthologous genes in the other three species were TE-free. This could result from exposure to more relaxed selection pressure in the human genome than in those of the other species. We then tested the M2 model, which assumes that the ω1 ratio for a specific branch (the human branch in scenario 3) differs from the background ratio ω0 of the tree, and compared it to the M0 model. Despite their differing TE densities, the selection pressures exerted on the orthologous genes showed no significant difference. In scenarios 3–10, the M0 model was always the one that fitted our data best, which indicates that the selection pressure is the same in all branches of the tree.

The genomic proximity of the human genome to that of the other three primates (the divergence between humans and macaques is dated to around 25 Million years (My) ago), which could explain the absence of obvious differences as a saturation phenomenon is possible. To overcome this difficulty, we compared the ω ratio of TE-free and TE-rich orthologous genes in the human and mouse genomes, species which diverged 75-My ago. We analyzed 991 TE-free orthologous genes and 745 TE-rich genes (2-kb flanking regions), and 91 TE-free genes and 874 TE-rich genes (10-kb flanking regions). The median of the ω ratio values was significantly lower for TE-free genes than for TE-rich genes for both sizes of flanking regions (0.15 vs. 0.17 for 2 kb, Wilcoxon rank sum test = 334,503, P = 0.0008; 0.09 vs. 0.16 for TE-rich genes for 10 kb, Wilcoxon rank sum test = 25,977, P = 4.9e−8). This indicates that the selection pressure on the coding part of the gene can explain the differences in TE density between TE-free and TE-rich genes, with stronger negative selection pressure on the TE-free genes being evident when human and mouse genes were compared.

Selective Pressures Acting on the Flanking Regions of Genes Depending on Their TE Neighborhood

Since selection pressure can also act on non-coding regions, we next analyzed the sequence identity of the 2- and 10-kb regions downstream and upstream of the genes. For each human gene, we retrieved the sequences of the 2- and 10-kb regions upstream and downstream of its orthologous gene in chimpanzee, orangutan, and macaque. An example of the distribution of the percentage identity of the flanking regions for the different categories of genes in the human–chimpanzee comparison is shown in Supplementary Fig. 5. For the human–chimpanzee comparison, a Wilcoxon rank sum test showed that the mean percentage identity of the regions flanking TE-free genes is significantly higher than that of the regions flanking the TE-rich genes, for both the downstream and upstream regions, for 10-kb but not for 2-kb (Table 2). The percentage identity values are congruent with the global nucleotide divergence observed between the two genomes (from 0.5 to 3.0%, The Chimpanzee Genome Sequencing and Analysis Consortium 2005). The mean percentage identity is lowered by a minority of orthologous sequences that have a high divergence (see Supplementary Fig. 8). When the human and orangutan genomes were compared, the mean percentage identities of the 2- and 10-kb upstream flanking regions, and the 10-kb downstream flanking regions, were significantly higher for regions flanking TE-free genes than for regions flanking TE-rich genes (Table 2). In the human/macaque comparison, the upstream and downstream 2- and 10-kb flanking regions of TE-free genes displayed significantly higher identity percentages than those flanking TE-rich genes (Table 2). These results indicate that the regions flanking TE-free genes seem to be better conserved through evolutionary time than those flanking TE-rich genes.

Table 2 Mean percentage identities of the regions flanking orthologous genes in the human, chimpanzee, orangutan, and macaque genomes

Gene Expression

Figure 4 shows the expression level variations of TE-free and TE-rich genes for 19 of the 79 human tissues analyzed by microarray (Su et al. 2004). The variation of the expression level of the remaining tissues is shown in Supplementary Fig. 9. A Wilcoxon rank sum test revealed that the level of expression of the TE-free genes was significantly lower than that of the TE-rich genes in 13 of the 79 tissues analyzed (indicated by the asterisks in Fig. 4). This difference was observed in three of the six tumor tissues, and in all 10 immune system tissues investigated.

Fig. 4
figure 4

Variation of the expression level of TE-free (black bars) and TE-rich genes (white bars) among 19 tissues considering 2-kb flanking regions. The asterisks indicate significantly differences between the two categories of genes

Discussion

In this study, we analyzed the characteristics of human, chimpanzee, orangutan, and macaque genes in terms of the presence of complete TE copies either within their sequence or in their vicinity (within 2- or 10-kb flanking sequences). We showed that TE-free and TE-rich genes tend to have different types of function. Genes devoid of TEs have a role in development and in the regulation of transcription, while genes having a high proportion of TE insertions within their sequence and in their vicinity tend to be involved in metabolism and transport functions. This tendency was observed for all the TE families investigated (DNA transposons, LTR retrotransposons, LINEs, and SINEs). Consequently, we can extrapolate the findings of a previous study of Alu elements and human chromosomes 21 and 22, which had shown that Alu-rich genes and Alu-poor genes have different functions (Grover et al. 2003), to the whole set of TE families and indeed to the entire human genome. Similarly, genes with no TE in their introns were usually involved in morphogenesis and development as well as in transcription, and displayed extremely well-conserved intronic regions (Sironi et al. 2006). We show the same tendency in our present work in which only complete TEs are considered. This strongly suggests that the differences in gene function are likely to be related to the presence or absence of complete TE sequences.

Despite the high proportion of TEs in mammalian genomes, some regions, as much as 100 kb in length, are devoid of TEs (Simons et al. 2006). These regions, which are known as TFRs, have been also identified in amphibian and fish genomes (Simons et al. 2007). Most of these regions are associated with genes involved in important functions such as the regulation of transcription and development. The existence of such genomic regions could be due to strong evolutionary selection preventing their interruption by TE-derived sequences. The maintenance of retroelement-free regions in the human genome has indeed been shown to be due to selection acting against transcriptional interference from TEs, i.e., against the transcriptional activity of TEs that could interfere with the transcriptional activity of the genes in their vicinity (Mourier and Willerslev 2008). Selection would thus tend to eliminate any TEs inserted in the vicinity of genes with important functions in order to prevent potential interference. About 14% (875 of 6,185) of our TE-free genes in the 2-kb flanking regions were included in these TFR regions. This low proportion could be attributable to the fact that we only considered complete TEs, but these TE-free genes all had the same types of function as all the genes located in TFRs.

The difference in the amount of TEs in TE-free and TE-rich genes could be attributable to greater purifying selection acting on TE-free than on TE-rich genes. This hypothesis is supported by the significantly lower ω ratio of TE-free genes than of TE-rich genes, when human and mouse orthologs were compared. This suggests that stronger purifying selection tends to eliminate any complete TEs inserted within or in the vicinity of the TE-free genes. However, selection pressure is difficult to identify when orthologous genes of closely related species, such as human and chimpanzee, are compared. This could be due either to the fact that selection has not yet had time to act on all the genes that have undergone TE insertions because of the relatively recent divergence between human and chimpanzee (6 My), or that the species are too closely related to make it possible to identify such selective constraints. The same reasoning explains the values of the ω ratio in the multispecies comparisons. If selection pressure is the factor that explains the difference in TE density between genes, we would expect to find different ω ratios between these genes. However, the M0 model, which assumes that all branches of the tree have the same ω ratio, was always the model that fitted our data best, even when the orthologous genes had different TE densities. This suggests that these primate species are too closely related to make it possible to detect any difference in selection pressures, whereas this could be detected in the human/mouse comparison.

Non-coding regions can also be the target of selection pressure (Lowe et al. 2007). When we compared the sequence conservation of the flanking regions downstream and upstream of TE-free and TE-rich genes in human and chimpanzee, human and orangutan, and human and macaque, we observed that the mean percentage identity of regions surrounding TE-free genes was significantly higher than that of surrounding TE-rich genes for the 10-kb flanking regions. This finding implies that the regions surrounding TE-free genes are more highly conserved than those surrounding TE-rich genes. The same analysis for the 2-kb flanking regions did not always confirm higher sequence conservation around the TE-free genes. We can assume that promoters of genes, which are probably located in the 2-kb flanking regions, are subjected to strong purifying selection due to their functional importance, regardless of the presence or absence of TEs. This difference became significant in the human and macaque comparison since the global sequence identity between the two genomes is lower [about 93.5% (The Rhesus Macaque Genome Sequencing and Analysis Consortium 2007)]. This showed that when the global sequence identity between genomes is high, as it is between human and chimpanzee, and between human and orangutan, it becomes difficult to detect any difference in sequence conservation depending on a factor such as TE density. We conclude that the regions flanking TE-free genes are better conserved, and thus likely to be exposed to a stronger purifying selection than the regions flanking TE-rich genes. These results confirmed previous observations suggesting that the difference in TE density between TE-free genes and TE-rich genes could be due to different selection pressures, but without quantifying them. We showed that these different selection pressures act not only on coding regions, as it is often assumed, but also on non-coding regions.

Gene expression can be regulated transcriptionally by promoters and cis-regulatory sequences, post-transcriptionally at the level of the untranslated regions (UTR) of mRNA, and at the higher-order level of the chromatin. It has been shown that almost 25% of human promoter regions contain TE-derived sequences, and that TEs provide about 2.5% of all human cis-regulatory sequences (Jordan et al. 2003). For example, SINE elements, which possess RNA polymerase III promoters, have been shown to promote the transcription of RNA polymerase II genes (Oliviero and Monaci 1988), and to present internal transcription factor binding sites within their sequences (Polak and Domany 2006) that can account for the control of the activity of genes located in their vicinity. However, it has been observed that the enrichment of a particular subfamily of SINEs, the Alus, in and around broadly expressed genes could only be a by-product of a preferential insertion bias of these elements near housekeeping genes (Urrutia et al. 2008). The presence of transcription factor sites has also been shown to be most abundant in ancient subfamilies of Alus than in young ones when tested on the elements from the chromosome 22 of human (Shankar et al. 2004). These observations can account for our observations as the most abundant subfamily of SINEs in our analysis correspond to AluS (Table 1). Analyses of the binding regions of seven mammalian transcription factors showed that five of them are associated with distinct families of TEs, like ERV1 being associated with TP153, indicating that TEs play an important role in expanding the repertoire of binding sites in mammals (Bourque et al. 2008). Moreover, L1, LTR retrotransposons, and DNA transposons displayed a higher affinity for nucleosome binding than Alus, which would result in a difference in the chromatin conformation and thus in expression (Huda et al. 2009). The different TE classes may thus have a different influence on genes. When all TE families are considered, human genes with TE-enriched promoters on average display greater and broader expression than gene promoters devoid of TEs (Huda et al. 2009). This effect of TEs on gene expression was also found in rodents, in which a correlation between gene expression and recent TE insertions was observed, indicating that these insertions do significantly alter gene expression patterns (Pereira et al. 2009). The comparison of genomic neighborhoods of human and chimpanzee genes has also shown that the expression of genes with a conserved genomic neighborhood was different from that of genes with TEs inserted in their vicinity (De et al. 2009). Since a TE sequence can affect gene regulation, selection pressure can be expected to be tighter for genes with crucial functions, such as development and regulation, making them more likely to be TE-free, while selection is expected to be more relaxed for genes with other, less vital functions, allowing TE insertions to be maintained in their vicinity. In our analysis, we found different levels of overall expression between TE-free and TE-rich genes among 79 human tissues, with the expression of TE-rich genes being greater than that of TE-free genes. It is striking that this difference in expression was found in three of the six tumor tissues, and in all 10 of the immune system tissues analyzed. These findings are consistent with the study of Lerat and Sémon (2007), in which levels of gene expression differed in the context of tumoral and normal conditions depending on the SINE neighborhood, suggesting that SINEs could be involved in gene deregulation under tumoral conditions, whereas they are silent in normal tissues. DNA methylation or other epigenetic mechanisms, which are known to regulate and even silence TE activity, could be associated with this gene deregulation in tumor tissues, because it is known that the loss of methylation in such tissues can affect TEs (Szpakowski et al. 2009). This is illustrated by the increase in the activity of retrotransposons that follows the loss of DNA methylation in tumoral conditions, which has been reported for human endogenous retroviruses in breast (Wang-Johanning et al. 2001) and ovarian cancers (Menendez et al. 2004; Wang-Johanning et al. 2007), and in leukemia cell lines (Patzke et al. 2002). Could this epigenetic gene deregulation postulated in tumor tissues be involved to explain the difference in gene expression according to the TE density observed in immune system tissues? This is a question that warrants to be investigated because it has been recently shown that the transcriptome is comparable between immune system cells and tumor tissues (Yang et al. 2008). We can thus hypothesize that the same kind of “deregulation” of the TEs via some epigenetic mechanisms could influence the silencing of the TEs near genes and change their expression both in tumor and immune tissues.

From this study, we can conclude that gene function is an important factor in determining the distribution of TEs in the human genome, and also in the genomes of the closely related species chimpanzee, orangutan, and macaque. Selection has the effect of eliminating TE insertions within and near genes with important functions. In addition, the insertion of TEs seems to be associated with a marked difference in gene expression in human tumor and immune system tissues. It would therefore be of great interest to analyze other primate species and indeed more widely phylogenetically divergent species, such as the mouse, to assess more precisely the effect of TEs on gene expression, especially under tumoral conditions.