Introduction

Palindrome, an inverted repeat without spacer, is a double-strand DNA, 5′- to -3′ sequence of which is identical on each DNA strand. Palindromes are ubiquitous among the DNA of natural plasmids, viral and bacterial genomes, and eukaryotic genomes and organelles. Palindromes have various roles in the regulation of cellular processes, such as gene expression, DNA replication and recombination (Thukral et al. 1991; McClarin et al. 1986; Hiratsu et al. 2000). For example, many palindromes contained in cis-acting elements are binding sites for regulatory proteins (Thukral et al. 1991); the recognition sites for most restriction endonucleases are short palindromes of 4–8 bp in prokaryotic genomes; the palindromes located in telomeric regions can initiate DNA replication (Hiratsu et al. 2000). Another important aspect of palindromes is that they have the potential to fold back on themselves to form secondary structures, known as hairpin or cruciform, which are putative substrates for several structure-specific nucleases and mismatch repair enzymes, and the action of such an enzyme may result in a double-strand break (DSB) in the DNA. The hairpin structures formed by palindromes immediately preceding the terminators could increase RNA polymerase transcriptional activity by promoting more efficient template clearance (Chu et al. 1997). Hairpin structures can also be formed in RNA transcripts by palindromic sequences, and they may inhibit gene expression probably by impeding the proceeding of the ribosome along the mRNA transcript (Zhang et al. 2002).

Giant palindromes that are scare in the genome may play important roles and evolve in a unique way in eukaryotic Y chromosomes. The giant palindromes have evolved in the Y chromosome, showing remarkable divergence in their content between Chimpanzee and human Y chromosomes (Hughes et al. 2010). Abundant gene conversion occurred between arms of palindromes in human and ape Y chromosomes (Rozen et al. 2003). A large palindrome is involved in interchromosomal gene duplications in the pericentromeric region of the Drosophila melanogaster Y chromosome (Méndez-Lago et al. 2011).

The abundance of palindromes in the genome has been a focus owing to its diverse functions in the cell. The abundance of palindromes varies with their size in the genome. It has been shown that short palindromes are under-represented in some bacteriophages and their bacterial hosts, as well as in some archaea, in comparison with random sequences (Gelfand and Koonin 1997; Rocha et al. 2001). In Saccharomyces cerevisiae, short palindromes (<14 bp) are also under-represented while long palindromes are over-represented (Lisnić et al. 2005). Long, AT-rich palindromes were also over-represented in the chromosomes III and X of Caenorhabditis elegans (LeBlanc et al. 2000). In the human genome palindromic sequences are over-represented in introns and gene upstream regions (Lu et al. 2007). Several possible explanations for the short palindrome avoidance have been proposed. First, the avoidance of short palindromes (<8 bp) that usually serve as recognition sites for type II restriction/modification systems (RMSs) in prokaryotes was thought to be caused by the activity of RMSs (Rocha et al. 2001). Such an interpretation, however, could not explain the paucity of short palindromes in the S. cerevisiae genome, where no RMS has been detected. Second, a strong bias against palindromic dinucleotides, which is probably caused by intrinsic mutational bias associated with neighbor-dependent mutations, replication/repair and recombination, may explain the observed short palindrome avoidance (Lisnić et al. 2005; Karlin et al. 1997).

The over-representation of long palindromes (10–50 bp) in genomes can be explained by their selective advantages (or functional importance) in genomes, such as roles in initiation of DNA replication, meiotic recombination and transcription termination. Relatively long palindromes, however, were also known to generate genomic instabilities by forming hairpin structures (Nag and Kurst 1997; Nasar et al. 2000; Lobachev et al. 1998; Lisnić et al. 2009; Leach 1994). For example, palindromes stimulate deletions during the replication of DNA (Lobachev et al. 1998); Palindromes stimulate interchromosomal recombination between homologous sequences (Nag and Kurst 1997; Nasar et al. 2000); Palindromes can also stimulate intrachromosomal recombination between homologous sequences leading to the loss of the intervening sequence (Lisnić et al. 2009). Hence, the over-representation of long palindromes may reflect their population advantages over disadvantages as a whole at genome level.

The evolutionary forces exposed on the abundance of palindromes in the genome are far from thorough elucidation, although several factors, such as the activity of RMSs and palindromic dinucleotide bias, are thought to affect palindrome abundance. In this paper, we investigated the possible effects of base composition and recombination on the abundance and genomic distribution of palindromes in the D. melanogaster genome. The results show that compositional bias is a major determinant of the abundance and distribution of palindromes in the D. melanogaster genome.

Materials and Methods

Genomic Sequences

Complete genome sequences, several types of genomic sequences, including 5′ UTR, 3′ UTR, introns, intergenic regions, and gene coding sequences (CDS) of D. melanogaster and their corresponding chromosomal positions were all retrieved from Flybase (ftp://ftp.flybase.net/releases/FB2006_01/dmel_r5.1/). In order to reduce redundancy, only the sequences that correspond to the first transcripts were selected when a gene locus matches more than one transcript. Repeat-masked genome of D. melanogaster was downloaded from UCSC (http://www.genome.ucsc.edu). Repeat-masked intergenic sequences, introns, and CDS were taken from the repeat-masked genome using their chromosomal positions. For statistical significance in palindrome counting and relevant correlation analysis, only the sequences with effective length, i.e., the length of sequence excluding the unknown nucleotides and the fraction of repeats greater than 1,000 bp were retained for CDS, introns, and intergenic regions. The ultimate data analyzed includes 5,948 introns, 5,795 intergenic regions, and 7,963 CDS.

Palindrome Scoring

Palindromes were counted by a program named Spinnaker (Lisnić et al. 2005). As indicated in Fig. 1, palindromes in a given sequence can be scored in different ways. First, only the palindromes that do not share common base pairs are scored (Fig. 1a). Second, all palindromes including non-overlapping palindromes, partially overlapping palindromes, and shorter palindromes entirely embedded within longer palindromes are scored (Fig. 1b). Third, palindromes including non-overlapping palindromes, partially overlapping palindromes but not shorter palindromes entirely embedded within longer palindromes are scored (Fig. 1c). Fourth, palindromes including non-overlapping palindromes, partially overlapping palindromes, and the shorter palindromes entirely embedded within longer palindromes but having a different center of symmetry are scored (Fig. 1d). The first approach would result in an underestimation of the actual number of palindromes. The second approach would identify too many short palindromes embedded within long palindromes in which the embedded palindromes sharing the same center of symmetry cannot represent those possessing different functional potentials. The third approach is unable to identify any short palindrome embedded within long palindromes. Short palindromes embedded within long palindromes but sharing a different center of symmetry may have different functional potentials, so we decided to count palindromes by the fourth approach.

Fig. 1
figure 1

Palindrome scoring. Different numbers of palindromes can be counted in the same sequence depending on the scoring criterion. a Non-overlapping palindromes do not share common base pairs. b All palindromes including non-overlapping palindromes, partially overlapping palindromes, and short palindromes contained within long palindromes. c Palindromes including non-overlapping and partially overlapping palindromes but not short palindromes contained within a long palindrome. d Palindromes including non-overlapping palindromes, partially overlapping palindromes, and the short palindromes contained within long palindromes but having a different center of symmetry

Recombination Rate Estimation

The data of D. melanogaster cytogenetic bands mapping to the genome (Release 5.1) were taken from Flybase (http://flybase.bio.indiana.edu). We used a procedure similar to that described by Kliman and Hey (1993). A four-order polynomial curve was fitted to the genetic position as a function of physical position for each chromosome arm. The middle position of a sequence was chosen to represent its physical position, and the recombination rates (cM/Mb) for genomic sequences were estimated as the derivative of the polynomial function. Recombination rate on the fourth chromosome was considered to be zero.

Palindrome Density

The genome of D. melanogaster was split into non-overlapping fragments of 100 kb, and corresponding random sequences, length and base composition of which are the same with the 100-kb fragments of D. melanogaster genome were generated by computer. Then, the palindrome density, defined as the palindrome occurrence divided by the effective sequence length, was computed for each of the 100-kb fragments of the D. melanogaster genome and corresponding random sequences. The sequences with effective length smaller than 50 kb were not included in our analysis. For the genomic sequences including CDS, introns, and intergenic sequences, palindrome density was defined in the same way. Sample sizes of the data analyzed in this study were listed in Table 1.

Table 1 Sample size of the data analyzed in the present study

Results

Comparison of Palindrome Abundance in the D. melanogaster Genome and Random Sequences

In order to examine the possible non-random usage of palindromes in the D. melanogaster genome, the palindrome abundance in D. melanogaster and that in randomly generated sequences were compared. As shown in Fig. 2, the short palindromes of 2–4 bp are present at lower frequency in the natural D. melanogaster genome than random sequences, while the situation of palindromes larger than 4 bp is just the reverse. Moreover, for the palindromes larger than 4 bp, the larger the palindromes are, the more pronounced the difference of their abundance between the natural and random sequences are. We can also see that palindromes are enriched more on the fourth chromosome than on the autosomes and long palindromes larger than 10 bp are distributed more densely on the X chromosome than on the autosomes. Palindromes in repeated sequences in the genome are of less interest in this study. When repeats were masked, the difference between natural sequences and random sequences still existed, but the palindrome occurrences decreased much, and the difference in the abundance of long palindromes between the autosomes and the X chromosome became less evident (Supplementary Fig. 1). This indicated that the over-representation of long palindromes on the X chromosome is attributed largely to the repeats distributed on the X chromosome.

Fig. 2
figure 2

Comparison of palindrome abundance in the non-masked D. melanogaster genome and random sequences. The size of palindromes is illustrated in the parentheses behind the title of ordinate. The ordinate represents the average of palindrome densities for both the non-overlapping 100-kb fragments of the D. melanogaster genome and the corresponding composition-constrained random sequences (see “Materials and Methods” section for detail). Analysis of variance shows that the palindrome densities for natural sequences and random sequences are significantly different (1 × 10−46 < P < 1 × 10−6)

Palindrome Distribution is Associated with Base Composition

To explore the possible effects of base composition and recombination on the genomic distribution of palindromes in D. melanogaster, the correlations of palindrome density with local GC content and recombination rate for non-overlapping sliding windows of 100 kb across the genome was analyzed. Due to the difference in patterns of molecular evolution between the autosomes and the X chromosome in D. melanogaster (Singh et al. 2005), our analysis was done separately for the two chromosome sets.

As shown in Table 2, palindrome density is negatively correlated with local GC content on both the autosomes and the X chromosome for palindromes, except for palindromic dinucleotides. The relationship between palindrome density and recombination rate differs between the X chromosome and the autosomes: palindrome density correlates negatively with recombination rate on the autosomes, but positively on the X chromosome. We also investigated the correlation of palindrome density with coding exon density. On the autosomes, palindrome density is negatively correlated with coding exon density for palindromes larger than 4 bp. On the X chromosome, palindrome density is positively correlated with exon density, particularly for the short palindromes smaller than 8 bp (Table 2).

Table 2 Spearman correlations of palindrome density with several factors in the non-masked genome of D. melanogaster

It is worth noting that microsatellite repeats, such as (ATT)n and (CCG)n, could contain many palindromes, but are likely to be evolving under a different mutational model than most of the genome. So, to exclude the repeat-associated effect, we further analyzed the correlation between palindrome density and the variables (GC content and recombination rate) based on the repeat-masked genome of D. melanogaster. As shown in Table 3, the results based on repeat-masked genome are similar with those listed in Table 2, indicating that those correlations were not caused by repeats. Partial correlation analysis shows that most significant correlations between palindrome density and recombination in Table 3 become non-significant after controlling for genomic GC content, and the correlations between palindrome density and GC content are not affected by recombination rate, suggesting that the correlation between palindrome density and recombination is mediated by GC content.

Table 3 Spearman correlations of palindrome density with several factors in the repeat-masked genome of D. melanogaster

Apart from the negative correlation between palindrome density and local GC content in the fly genome, a similar stronger negative correlation between the two variables is also observed for random sequences which have the same base composition with the 100-kb genomic fragments of D. melanogaster (data not shown), suggesting that the correlation is unlikely to be caused by selection. Could the compositional bias of sequences modulate palindrome abundance? The GC contents of both the 100-kb genomic fragments of D. melanogaster and composition-constrained random sequences are lower than 50 %. If we suppose that the negative correlation resulted from the compositional bias of the sequences, i.e., the more biased a sequence is, the more palindromes can be detected in it no matter whether the bias is toward AT- or GC-richness, it is then expected that palindrome density is negatively correlated with GC content for sequences with GC content lower than 50 % and positively correlated for sequences with GC content higher than 50 %. To test this, we generated 40 random sequences of 100 kb, which have GC content increasing from 31 to 70 % by an interval of 1 %, and correlated the palindrome densities of them with their GC contents. As expected, strong negative correlations between palindrome density and GC content were observed when GC contents are lower than 50 %, and strong positive correlations were observed when GC contents are higher than 50 % (Table 4), indicating that compositional bias of a sequence has significant influence upon its palindrome abundance. The richness of palindromes on the fourth chromosome shown in Fig. 2 is thus likely due to its low GC content.

Table 4 Pearson correlation between palindrome density and GC content for random sequences

Negative correlations between palindrome density and local GC content for introns and intergenic regions, and positive correlations for the CDS (Table 5) are also compatible with the aforementioned model for the compositional effect on the palindrome abundance, as majority of the CDS have GC content higher than 50 %, and majority of the introns and intergenic regions have GC content lower than 50 %.

Table 5 Spearman correlation between palindrome density and local GC content for the repeat-masked genomic sequences of different type in D. melanogaster

As introduced in methods, there are several ways of palindrome scoring, which differ in counting the overlapping palindromes. In this study, we identified palindromes by the fourth approach described in the method section. Is it possible that the negative correlation between palindrome density and sequence GC content was caused by palindrome identifying method? To test this possibility, we identified palindromes in the above-mentioned 40 random sequences using the other approaches, and correlated their frequency with sequence GC content. As shown in Table 4, other approaches generated similar results as the fourth approach with the only exception of 2-bp palindromes, indicating that the negative correlation between palindrome density and sequence GC content cannot be attributed to the palindrome identifying method.

The fourth approach for identifying palindromes excludes internal shorter palindromes that share a center of symmetry with a larger palindrome. Thus, it underestimates the frequencies of shorter palindromes and may generate a negative correlation between the densities of longer and shorter palindromes. By contrast, the second approach is to count any palindrome regardless of context, and does not underestimate shorter palindromes. However, when the second approach was employed for palindrome identification, similar results as the fourth approach were observed (Tables 4, 6), suggesting the choice of the fourth approach or the second approach has no bearing on our conclusion.

Table 6 The second approach-based Spearman correlations of palindrome density with several factors in the repeat-masked genome of D. melanogaster

Palindrome GC content is negatively associated with recombination

As shown in Fig. 3, the GC content of palindromes strongly decreased with their size increasing. Large palindromes located in repeats are AT-rich and may affect the relationship between the GC content of palindromes and their size. However, the decreasing trend of GC content of palindromes with their size still exists when palindromes in repeats were excluded (Fig. 3).

Fig. 3
figure 3

The GC content of palindromes decreases with the increasing of palindrome size

Since both the GC content of palindromes and the relationship between palindrome density and recombination rate depend on their size (see Fig. 3 and Table 2), the correlation analysis between palindrome GC content and recombination rate was carried out for individual palindrome size to exclude the size effect. The correlation between the mean GC content of palindromes in genomic 100-kb fragments of D. melanogaster and recombination rate is shown in Table 7. A different correlation pattern between the autosomes and the X chromosome was observed, although the correlations on the X chromosome were non-significant. Palindrome GC content is positively correlated with recombination on the autosomes and negatively on the X chromosome, which is in agreement with the observation that local GC content correlates positively with recombination rate on the autosomes and negatively on the X chromosome of Drosophila (Singh et al. 2005). Furthermore, the correlation between palindrome GC content and recombination rate on autosomes becomes weak with the increasing size of palindromes.

Table 7 Spearman correlation between the GC content of palindromes and recombination rate in the repeat-masked genome of D. melanogaster

We also examined how palindrome GC content is affected by recombination in the sequences of different type, such as gene coding regions, introns, and intergenic regions. As shown in Table 8, the correlation differs between the autosomes and the X chromosome. Regardless of the sequence type, palindrome GC content is positively correlated with recombination on the autosomes and negatively on the X chromosome, which is consistent with the finding illustrated in Table 7.

Table 8 Spearman correlation between the GC content of palindromes and recombination rate for the repeat-masked genomic sequences of different type in D. melanogaster

The chromosome 4 is very AT-rich, and therefore one may suspect that its contribution to the observed correlations between palindrome density and local GC content and between palindrome GC content and recombination rate is quite strong. Our analysis, however, shows that the correlations and the decreasing trend of palindrome GC content with the palindrome size increasing are not affected much by chromosome 4 (data not shown).

Several major findings can be summarized as follows: (1) In the D. melanogaster genome, short palindromes (<6 bp) are under-represented, while relatively long palindromes (>6 bp) are over-represented as compared to those in random sequences; (2) The density of palindromes is negatively correlated with local GC content on both the autosomes and X chromosome; (3) The correlation between the density of palindromes and recombination rate is mediated by local GC content; (4) Coding exon density is positively correlated with the density of short palindromes (<6 bp) and negatively with the density of long palindromes (>6 bp); (5) The GC content of palindromes decreases with their size increasing, and it correlates with recombination rate positively on the autosomes and negatively on the X chromosome.

Discussion

Palindrome Abundance in Drosophila

Lisnić et al. proposed that a strong bias against palindromic dinucleotides may explain the observed short palindrome avoidance in S. cerevisiae (Lisnić et al. 2005). In order to test if the under-representation of short palindromes of 2-4 bp in the D. melanogaster genome is caused by biased usage of palindromic dinucleotides, we counted the short palindromes center of symmetry of which are composed of the palindromic dinucleotides. Consistent with our previous findings (Liu and Li 2008), the AT, TA, and CG-centered palindromes are significantly rare in D. melanogaster, while GC-centered palindromes are rich (Table 9). These indicate that the under-representation of short palindromes of 2–4 bp is, at least partially, caused by the biased usage of palindromic dinucleotides. The biased usage of palindromic dinucleotides was probably shaped by intrinsic mutational bias associated with neighbor-dependent mutations, replication/repair and recombination (Arndt et al. 2003; Karlin et al. 1997; Liu and Li 2008).

Table 9 Comparison of the abundances of short palindromes (2–4 bp) between the non-masked fly genome and random sequences

By contrast with short palindromes, long palindromes are over-represented in both S. cerevisiae and D. melanogaster. Keeping the beneficial effects of long palindromes in various cellular processes and their deleterious effects associated with genomic instabilities together in mind, we speculate that the actual occurrence of long palindromes in the genome is the consequence of the balance between two selective forces acting oppositely on them. The force associated with beneficial effects tend to increase long palindromes, while the force associated with deleterious effects decrease them. Compared with very long palindromes (>50 bp), the moderately long palindromes (10–50 bp) are less easy to result in DSBs (Nasar et al. 2000). So we propose that the over-representation of palindromes of moderate size (10–50 bp) detected in D. melanogaster may reflect their population advantages over disadvantages at genome level. The negative correlation between the GC content of palindromes and their size may be caused by a positive selection of high A + T content for long palindromes, because high A + T content may facilitate local DNA melting and adoption of secondary structures. The long AT-rich palindromes may be generated through the insertion of AT or TA dinucleotides into AT-rich palindromes by slippage during DNA replication (Lisnic et al. 2005; Toth et al. 2000).

Why Palindrome Distribution is Correlated with GC Content and Recombination?

The strong correlation between palindrome abundance and local GC content in the Drosophila genome can be predicted by the following random expectation. The probability that two symmetrical bases of a palindrome are complementary is \( P_{2}=2p_{\rm A}p_{\rm T} + 2p_{\rm G}p_{\rm C} \), where \( p_{i} \) is the proportion of base i (\( i = A,G,C,T \)). Thus, the expected probability of k-base palindromes is

$$ P_{k} = (2p_{A} p_{T} + 2p_{G} p_{C} )^{\frac{k}{2}} $$
(1)

Then the expected frequency of k-base palindromes is \( F_{k} = P_{k} (L - k + 1) \), where L is the sequence length. An assumption that \( p_{A} = p_{T} \) and \( p_{G} = p_{C} \), combined with the normalization \( p_{A} + p_{T} + p_{G} + p_{C} = 1 \), generates \( p_{A} + p_{G} = \frac{1}{2} \). Then we have

$$ \begin{aligned} P_{k} &= (2p_{A} p_{T} + 2p_{G} p_{C} )^{\frac{k}{2}}\\ &= (2p_{A}^{2} + 2p_{G}^{2} )^{\frac{k}{2}}\\ &= 2^{\frac{k}{2}} \left[\left(\frac{1}{2} - p_{G} \right)^{2} + p_{G}^{2} \right]^{\frac{k}{2}}\\ &= 2^{k} \left[\left(p_{G} - \frac{1}{4}\right)^{2} + \frac{1}{16}\right]^{\frac{k}{2}}\\ \end{aligned} $$
(2)

It is clear that the probability of palindromes has the minimum when \( p_{G} = \frac{1}{4} \). Given \( \left\{ \begin{gathered} p_{A} + p_{T} + p_{G} + p_{C} = 1 \hfill \\ p_{A} = p_{T} ,p_{G} = p_{C} \hfill \\ \end{gathered} \right. \), we know that the minimum probability of palindromes occurs when \( p_{A} = p_{T} = p_{G} = p_{C} = 25\,\% \). Accordingly, palindrome frequency should bottom when GC content is 50 %. This means the probability of palindromes decreases as GC content approaches 50 %. Given the overall AT bias in the Drosophila genome, a negative correlation between GC content and palindrome density is expected. Consistent with this expectation, we observed negative correlations between GC content and palindrome densities, as listed in Table 3. Although base composition has a great effect on palindrome abundance, it is worth noting that the compositional effect is not great enough to solely explain the distribution of palindromes across the genome, as indicated by the low correlation coefficients in Table 3.

Our data indicated that except palindromic dinucleotides the density of palindromes is negatively correlated with recombination rate on the autosomes and positively on the X chromosome. Can selection explain these correlations? Recombination accelerates the evolution of sexual reproductive organisms by segregating advantageous and disadvantageous genes (Felsenstein 1974; Kliman and Hey 1993; Lewin 2004). In other words, selection efficiency is elevated in regions of increased recombination rates. This effect could modulate the distribution of genomic elements, such as transposons and pseudogenes, in the genome (Bartolome et al. 2002; Hua-Van et al. 2005; Liu et al. 2010). Similarly, it is possible that beneficial palindromes are preferentially retained and deleterious palindromes are efficiently eliminated in the regions of high recombination rates. Short palindromes might be selectively disadvantageous while longer ones might be selectively advantageous, as implicated in their abundance in the genome. If selection associated with recombination is responsible for the under-representation of short palindromes and over-representation of long palindromes, it would be expected that recombination rate correlates negatively with the density of short palindromes and positively with the density of long palindromes. No such consistent result was observed between the autosomes and the X chromosome, implying that such a selection cannot explain the correlation between palindrome distribution and recombination.

Partial correlation analysis suggested that the effect of base composition is stronger than the effect of recombination on palindrome abundance and the relationship between palindrome abundance and recombination is a mere side-product of the effect of compositional bias on palindrome density. One may claim that recombination affects palindrome abundance indirectly by influencing base composition, as GC content is affected by recombination and GC content in turn could affect the palindrome abundance. However, besides the recombination, there are other factors affecting GC content in the genome, and thus it cannot be concluded that recombination determines the palindrome abundance in a composition-associated manner.

Why Palindrome GC Content is Correlated with Recombination?

Recombination rate correlates positively with the palindrome GC content, genomic GC content, and the density of dinucleotide palindromes composed of nucleotide G and C on the autosomes, and negatively on the X chromosome. The GC-biased gene conversion of which frequency is proportional to recombination rate could be invoked to explain the positive correlations on the autosomes, but it could not explain the negative correlations on the X chromosome.

There are several hypotheses that could explain the negative correlation between GC content and recombination rate on the X chromosome. First, a background substitutional bias toward decreased GC with increased recombination rate could be mediated by a recombination-associated mutational bias toward A and T on the X chromosome. However, this mutational explanation seems unlikely, as to date there are no empirical data to support such a difference in the underlying mechanisms of mutation between the autosomes and the X chromosome. Second, a background substitutional bias mediated by gene conversion could explain the observation. In general, recombination increases the GC content of genomic sequences including palindromic sequences through gene conversion that may frequently occur in regions of high recombination rates (Birdsell 2002; Galtier et al. 2001; Marais et al. 2003; Meunier and Duret 2004; Singh et al. 2005). Nevertheless, if rates of gene conversion on the X chromosome were high in areas of low recombination, then this could lead to an X-specific, negative correlation between GC content and recombination rate. It does seem possible that rates of gene conversion and rates of recombination are inversely related on the Drosophila X chromosome (Langley et al. 2000). Third, it is also possible that the negative correlation between recombination rate and GC content on the X chromosome is the result of historical recombination rates rather than current ones. The negative correlation on the X chromosome can be explained if the X chromosome underwent a recent rearrangement of the recombination map such that the currently highly recombining areas experienced historically low recombination rates and visa versa.

In this study, we also found that coding exon density tends to correlate positively with the density of palindromes shorter than 6 bp and negatively with the density of palindromes longer than 6-bp. The negative correlation is probably due to selection force that makes palindromes rare in or around genes to avoid unbeneficial effect on genes. One simple explanation for this is that the excess of large palindromic motifs in coding sequences is harmful to gene function and counter-selected. For example, palindromes in coding sequences might be selected against due to their potential to induce DSB or deletions. It is unclear that why the density of palindromes shorter than 6-bp correlates positively with coding exon density.

Conclusion

To conclude, we explored the possible evolutionary forces shaping the distribution pattern and abundance of palindromes in the D. melanogaster genome. Our results show that the palindrome occurrence in the genome deviates from random expectation and the uneven distribution of palindromes across the genome is associated with local GC content, recombination rate, and coding exon density. Our data suggest that the correlation between palindrome abundance and recombination is a side-product of the effect of compositional bias on palindrome abundance, and the base composition is the main driver of the distribution pattern and abundance of palindromes in the D. melanogaster genome.