Introduction

Whole-human-genome analyses have revealed that there exist 11,000–19,000 pseudogenes. The vast majority of human pseudogenes derive from a duplicated sequence, with ∼70% of them having a retrotranspositional origin and the rest having originated by a duplication event (Torrents et al. 2003; Zhang et al. 2003). Only a minority of pseudogenes did not appear after a duplication event, and thus their silencing could have produced a loss of function (Menashe et al. 2003). Although pseudogenes are defined as transcriptionally silent sequences, many cases of transcribed pseudogenes have been reported (Balakirev and Ayala 2003). This fact, together with the evolutionary conservation of the original sequence and the low level of nucleotide diversity, led to the proposal of a regulatory role for some pseudogenes (McCarrey and Riggs 1986). This hypothesis has been confirmed by several examples, confirming the functional potentiality of RNA transcribed from pseudogenes (Hirotsune et al. 2003; Korneev et al. 1999; Lee 2003). On the other hand, pseudogenes have also been shown to be involved in the generation of genetic diversity through gene conversion or recombination with functional genes and have the potential to become new genes (Balakirev and Ayala 2003).

The increasing evidence of functionality of RNA independent of its role in protein synthesis that could encompass partial mRNA must be considered in analysis of the pseudogenization process. Usually, gene silencing is thought to be produced by one inactivation event followed by neutral evolution, adding new mutations at a neutral pace, some of which will, later, be viewed as further inactivating variants. Considering the functional potentialities of RNA, this process can be viewed as a series of inactivating variants having been successively selected due to undesired remnant function in the partial or flawed mRNA (by itself, as, e.g., a regulatory element or through a partial protein).

In this study, we have analyzed the pseudogenization process of arpAT, one of the 33 recently silenced genes described in the analysis of the human genome (International Human Genome Sequencing Consortium 2004). arpAT is a member of the light subunits of heteromeric amino acid transporters (LSHAT) family and has a strong preference for aromatic amino acids, especially L-DOPA. It was shown to be expressed in enterocytes in the small intestine and in neurons from different brain areas, and was suggested to be an L-DOPA transporter related to the neurotransmitter function of this amino acid in the rodent brain. The gene is functional in rodents, dog, and chicken, whereas it is inactive in the human (and chimpanzee) genome, where ten frame-disrupting insertions/deletions, four in-frame stop codons, and one Alu insertion-disrupting exon 1 were found. While the obtained d N /d S ratios of 0.8 and 0.85 for humans and chimpanzees, respectively, indicated that these sequences are under neutral evolution, the possible excess of frameshift mutations in the two primate species, compared with the low mutational rate of the genome, led to consideration of the possibility that the successive silencing of this gene may have undergone positive selection (Fernandez et al. 2005), thus opening the question of the pace of pseudogenization. To disentangle the previous two hypotheses on the arpAT gene, we have analyzed it in several primate species to trace back the pseudogenization history. The goal is to answer questions regarding the pseudogenization process of arpAT, such as when the inactivation occurred, what the rate of accumulation of inactivating mutations is, and whether there have been selective forces favoring the successive inactivation of this gene in any of the primate branches.

Materials and Methods

Samples

The following primate DNA samples from the ECACC (European Collection of Cell Cultures) Primate DNA Panel were used: MA104 (Chlorocebus aethiops), CYNOM-K1 (Macaca fascicularis), OMK (637–69) (Aotus trivirgatus), and B95–8 (Saguinus oedipus). The Microcebus murinus sample was provided by Christian Roos, Gene Bank of Primates (German Primate Center), along with other prosimian samples. The human DNA was obtained in our lab. Supplementary Table S1 reports the primers used. When amplification or the expected band was not obtained, PCRs were performed at lower annealing temperatures.

Obtaining Sequences

The first step in this work was to obtain genomic sequences from the arpAT coding region in several primates. The coding region includes six exons, and no evidence for splicing variants has been reported. To do this, available genomic sequences from different organisms (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris) were aligned and primers were located in the more conserved regions close to the ends of the exons. No primers were designed in exon 5 because of its small size and low level of conservation (see Results and Discussion). As expected because it is a pseudogene, it is more difficult than designing primers in real exons, as here the amount of change is expected to be greater. PCR amplifications were carried out in Homo sapiens; two Old World monkeys, the cynomolgous monkey (Macaca fascicularis) and the African green monkey (Chlorocebus aethiops); two New World monkeys, the cotton-top marmoset (Saguinus oedipus) and the owl monkey (Aotus trivirgatus); and one prosiminan species, the mouse-lemur (Microcebus murinus). The exact location of these primers is reported in Supplementary Table S1. Due to sequence divergence, PCR amplifications were successful only when the primers were located in exonic sequences, although in some cases PCR product was not obtained. Sequences of different parts of the arpAT gene were obtained as follows: exon 1 sequences from Homo sapiens, Chlorocebus aethiops, and Saguinus oedipus; exon 2, exon 3, and the intron between these two exons in Homo sapiens, Macaca fascicularis, Chlorocebus aethiops, and Saguinus oedipus; exon 4 in Homo sapiens, Macaca fascicularis, Chlorocebus aethiops, Saguinus oedipus, Aotus trivirgatus, and Microcebus murinus; and exon 6 in Homo sapiens, Macaca fascicularis, Chlorocebus aethiops, and Microcebus murinus. Since genomic traces of the regions containing exons 4 and 6 in Microcebus murinus were detected through BLAST searches, it was possible to design primers in the intronic regions surrounding these two exons. Therefore, in the case of Microcebus murinus the complete sequences of exons 4 and 6 were obtained. No sequences homologous to exons 1, 2, 3, and 5 were found through BLAST searches.

DNA Sequence Analysis

Nucleotide sequences were analyzed using the Lasergene software package (DNASTAR). Besides the sequences obtained here, other orthologous sequences were included in the analyses. Genetic sequences of Mus musculus (ENSMUSG00000020600), Rattus norvegicus (ENSRNOG00000006119), and Canis familiaris (ENSCAFG00000003845) were retrieved from the Ensemble database (www.ensembl.org). Microcebus murinus sequences were obtained by means of similarity searches in the NCBI Trace Archive, carried out using Mega Blast. Homo sapiens orthologous sequences described previously (Fernandez et al. 2005) were also used in some analyses. Multiple sequence alignments were performed with ClustalW (Thompson et al. 1994) and revised manually. Phylogenetic analyses were performed using the PAML software package, which excludes indels and stop codons from the analysis (Yang 1997).

Results and Discussion

The sequences of the corresponding coding regions of arpAT were obtained anew for some species (including a prosimian) and retrieved from databases for others; see Table 1 for data and species analyzed and Supplementary Information for details on the methods of obtaining them all. For some newly sequenced species, it was not possible to amplify some exons. Species analyzed include the main primate groups (human, Chlorocebus aethiops, and Macaca fascicularis as Old World monkeys and Aotus trivirgatus and Saguinus oedipus as New World monkeys—all of them haplorrhine; and Microcebus murinus or mouse-lemur as prosimian or strepsirhine). Other species include mouse, rat, and dog.

As shown previously for humans and chimpanzees, which share nearly all the coding disablements (Fernandez et al. 2005), the arpAT gene appears to be silenced in all primate lineages analyzed, including the prosimian. All species contain several stop codons and frameshift mutations disrupting the putative coding sequence (Table 1). To evaluate whether the number of indels is higher than expected, as suggested (Fernandez et al. 2005), we have used two estimates of the neutral indel substitution rate: 1.01·10–4 per site per million years (Britten 2002; Podlaha et al. 2005) and 8.83·10–5 (Silva and Kondrashov 2002), a rate found to be quite similar along the primate species in different studies. We mapped all the described indels in the phylogeny of the analyzed species (Fig. 1 and Supplementary Table S2) and compared them to the expected number of indels in each branch of the tree, according to the length of the studied region and divergence time (Table 2). We considered the mutations produced after the initial strepsirhines-haplorhines split, since the first inactivating mutations did occur after the rodents-primates split and before the initial primate split. This split is not much more recent than the rodents-primates split, calculated to have been between 75 and 90 million years ago (MYA) (Hedges 2002; Murphy et al. 2001). The last common ancestor of haplorhines and strepsirhines has been estimated at 77.5 MYA (Steiper and Young 2006). Given that, hereafter we consider the split between primates and rodents to have been at ∼90 MYA. The probabilities of the observed numbers of indels have been calculated considering that the number of indels, under a neutral model, follows a Poisson distribution (Table 2). In general, the observed and expected numbers of indels are quite similar, and the differences are not statistically significant. Only in the Saguinus oedipus branch are differences statistically significant. However, this branch also shows a higher mutation rate, with the highest rate of synonymous and nonsynonymous changes in the primate species (see Table 4), which could explain the excess of indels. When the full coding sequence including the six exons is compared between humans and rodents, the observed number of frameshift mutations (13) is also very similar to the expected numbers (13.34 and 11.66, depending on the mutation rate) considering that they have all appeared in the branch leading to the primates after the split with rodents (∼90 MYA). Similar results were obtained in all comparisons of the primate species to the mouse sequence included in this work.

Fig. 1
figure 1

Phylogenetic tree of arpAT, including the species with available sequences from exons 1 to 4. d N /d S ratios are shown above each branch. The number of frameshift mutations is shown below the branch. Topology of the tree and branch lengths were calculated considering all changes

Table 1 Stop codons and frameshift mutations described in the sequenced regions of each exon
Table 2 Observed and expected numbers of indels among different primate species

Alternatively, the action of natural selection favoring the silencing of arpAT could also be reflected in the heterogeneous rate of indel fixation across the different exons of the gene, since if the gene is evolving neutrally, a similar rate of indel fixation is expected between the different exons. Table 3 reports the observed and expected numbers of indels according to a random distribution in the comparison of the full coding sequence between mice and humans. The number of indels described in exon 5 is significantly higher than expected (χ2 = 19.21 after Yate’s correction, p < 0.001). Although this result could suggest the existence of some selective pressures favoring the accumulation of disrupting mutations in this exon, a deeper analysis shows that the rate of synonymous changes is also much higher in this exon (Table 3). This would suggest a higher mutation rate in this region as the most parsimonious explanation (Hardison et al. 2003; Kvikstad et al. 2007; Wetterbom et al. 2006).

Table 3 Exon rate of evolution in arpAT between Mus musculus and Homo sapiens

The specific rates of synonymous (d S ) and nonsynonymous (d N ) substitutions were estimated through maximum likelihood models (see Materials and Methods). The d N /d S ratios were estimated in the external and internal branches of three different phylogenetic trees (differing in the number of sequences and species included), using the Canis familiaris sequence as an outgroup (Fig. 1). The first analysis (Fig. 1), for exons 1–4, indicates that a relaxation of the purifying selection on the gene occurred after the split between rodents and primates, since in these branches the d N /d S ratio is closer to 1, which suggests that these sequences are under neutral evolution. In contrast, in the branches leading to rodent and dog, d N /d S values are lower, as expected for coding sequences under purifying selection. We tested this by comparing the likelihoods of each branch evolving under conserved evolution with ω = 0.25, since the average K A /K S ratio for the human-chimpanzee lineage has been estimated to be ∼0.23 (Chimpanzee Sequencing and Analysis Consortium 2005), and under neutral evolution with ω = 1 (Table 4). To do this, we compared the log-likelihood value of a model assuming one fixed ω (0.25 or 1) for the branch of interest to a model that estimates a free ω value for each branch of our phylogenetic tree. We then compared the models using the likelihood ratio test with as many degrees of freedom as the number of differences in the parameters estimated (that fits a chi-square distribution). All branches leading to primate species show likelihood values not compatible with conserved evolution but compatible with neutral evolution. The extremely high, although not significantly different from 1, d N /d S value obtained in the branch leading to Old World monkeys is probably due to its short length, and to the fact that randomly there is an extremely low number of synonymous substitutions. On the other hand, the likelihood values obtained in the branches leading to nonprimate species are mostly in agreement with conserved evolution. Only in the case of the branch leading to rat is the obtained d N /d S value significantly different from 0.25, but also less than 1, suggesting that some kind of relaxation has occurred in this species.

Table 4 Phylogenetic analysis of arpAT

Similar results are obtained in the second and third trees (Supplementary Fig. S1). When we included all the species with sequences of exons 1, 2, 3, 4, and 6 (Supplementary Fig. S1A), d N /d S values close to 1 also appear in all the branches leading to primate species after the split with rodents. Again, a relatively high d N /d S value is obtained in the branch leading to the rat, which is significantly different from 0.25 (Supplementary Table S3). The obtained probabilities are also in agreement with conserved evolution in the nonprimate species and neutral evolution in primate species (Supplementary Table S3). However, in this case the d N /d S ratio calculated in the rat branch is significantly different from 0.25. Finally, in the third tree (Supplementary Fig. S1B) we have included the prosimian species for which complete sequences of exons 4 and 6 have been obtained. Although it is based on a lower number of sites (309 after deleting gaps), this tree is in agreement with an inactivation and subsequent neutral evolution of arpAT after the split between rodents and primates. Although the d N /d S ratio obtained in the prosimian species branch is quite high, the possibility of positive selection is excluded given the stop codons and frameshift mutations described in this lineage (see above), indicating that this excess of nonsynonymous mutations is due to the fact that this sequence is evolving under neutrality. The absence of synonymous changes does not allow calculation of the d N /d S ratio in the primate branch before the split of strepsirhines and haplorhines, due both to the relatively short length of the alignment and the short length of this branch. The probabilities obtained are again in agreement with conserved evolution for nonprimate species and neutral evolution for primates, since their split with rodents (Supplementary Table S4).

In conclusion, this deeper analysis of the pseudogenization process of arpAT did not reveal the existence of selective pressures favoring its inactivation in primates, suggesting that the large number of frameshift mutations described in its coding sequence in several primate species is a consequence of the antiquity of this inactivation. The understanding of genome dynamics is complex and stochastic factors have a strong effect on the evolutionary output in specific genome regions, besides the directional effects of selective forces. These, although acting at a very fine scale in the genome, are not easy to discover in many of the results of a molecular evolutionary process, and random processes may account for large amounts of the existing variation, as in the case of the arpAT neutral pseudogene in primates.