Introduction

The field of comparative genetics has undergone a recent boom due to its power as a tool for the understanding of the evolutionary and functional factors shaping a given genome region and for the search of the genetic basis of species uniqueness. Despite the large effort made to obtain comparative genetic information, little is still known about similarities and differences between the human genome and that of our closest phylogenetic relatives, the apes. In recent years, several human-specific genetic traits have been detected by a comparative analysis of human and primate species, mostly from the analysis of genetic regions in chimpanzees (for reviews, see Gagneux and Varki 2001; Hacia 2001). Comparative information has mainly been obtained for noncoding regions, and such data have been the key to the understanding of hominoid phylogeny (Chen and Li 2001; see Ruvolo 1997 for a previous review). Comparative analyses of functional regions usually aim to understand the functional constraints on a particular genetic region, including purifying, positive, and balancing selection. In some cases, analysis may help to explain the appearance of a new genetic variant in a particular species, such as lysozyme enzymes in primates (Messier and Stewart 1997) or in the FOXP2 gene in humans (Lai et al. 2001; Enard et al. 2002).

We focused our study on the comparative analysis of a special group of genetic elements: functional CAG/CTG repetitive tracts. These are functional CAG/CTG short tandem repeats (STRs), mostly coding for polyglutamine tracts; they are found in genes that are highly expressed in the brain and their expansion in repeat number causes neurodegenerative disorders. Many of the genes in this study cause spinocerebellar ataxia (SCA) and are named, accordingly, SCAn loci, where n denotes a locus defining number.

These type of loci share the mutation dynamics of the rest of STRs of the genome, that is, they mutate by adding or subtracting one (or rarely more than one) repeat unit, but differ by expanding into abnormally long alleles that produce ataxia, dystrophy or similar diseases. This group of genetic diseases has not been detected in nonhuman species, a fact that could just be attributed to the greater knowledge of human disease but might also reflect a unique human pathogenic trend. Beyond the mere description, the comparison of the patterns of variability in the normal range of humans and apes can allow the testing of hypotheses on the causes of expansion and disease. This study can therefore be viewed in the context of evolutionary medicine, where the comprehension of the natural history of a disease may lie in the particular characteristics (either of the locus or of the species) of disease predisposition.

The variability of expanding loci has been studied in humans in order to determine population-specific disease risk factors and to understand the evolutionary forces shaping this variability (Watkins et al. 1995; Jodice et al. 1997; Andrés et al. 2003). On the other hand, interspecific comparative studies on these loci have focused on allele length comparison between humans and other primate species and have dealt mostly with a few species used as references in a single locus approach. The special interest in allele length differences among species is based on the observation that long alleles have an increased mutation rate and higher probability of very long leaps (Webster et al. 2002) and the possibility of expansion into the pathogenic range (as shown [Fu et al. 1991; Nolin et al. 2003] for fragile X).

We have analyzed, in four species (human, chimpanzee, gorilla, and orangutan), nine STR loci including those present on the SCA1 (spinocerebellar ataxia 1 locus), SCA2, SCA3 (or Machado–Joseph disease locus), SCA6, SCA8, SCA12, DRPLA (dentatorubral–pallidoluysian atrophy), KCNN3 (potassium intermediate/small conductance calcium-activated channel, subfamily N, member 3), and NCOA3 (nuclear receptor coactivator 3) genes (Table 1). They all are functional, have different genomic locations and functions, and share a high central nervous system expression and the presence of a CAG/CTG repetitive tract. Seven are expanding disease-related loci, while the remaining two are coding but do not seem to expand into pathogenic alleles. Variation at KCNN3, other than expansion, has been proposed as being associated with mental diseases and ataxia (Dror et al. 1999; Figueroa et al. 2001), although as a predisposing factor rather than as a single direct cause.

Table 1 Genetic characteristics of the nine loci examined in this study

A strong correlation between expansion and variability in the normal range has been demonstrated by the observation that expanding STRs are the most variable group of STRs of the human genome (Chakraborty et al. 1997; Jodice et al. 1997; Deka et al. 1999). As these expansions have not been detected in apes, our aim is to determine whether these loci show similar levels of variability in apes to those observed in humans (and therefore diversity and expansion potential would not be directly related) or whether the high levels of polymorphism are exclusive to the species for which the expanding disease has been detected.

We have also determined whether shared factors among loci (such as the existence of a coding poly [CAG] tract) or locus-specific ones (such as differential mutation patterns or selective events) led to the diversity of observed allele distribution of expanding loci.

Ascertainment bias may be a relevant problem when trying to infer general patterns from a group of loci selected in one of the species or populations compared. Nevertheless, this problem does not affect our study, as we are interested in determining the evolution of this specific group of loci (where the observed tendencies are clear in terms of expansion and disease) but do not try to generalize our observations to infer traits for the rest of STRs, which would produce a strong ascertainment bias.

Materials and Methods

Allele Typing

Twenty common chimpanzees (Pan troglodytes subspecies troglodytes [one individual] and verus), 13 gorillas (Gorilla gorilla subspecies gorilla), and 4 to 6 orangutans (Pongo pygmaeus subspecies abelii [two individuals] or unknown) were typed for the number of repeats at the nine loci shown in Table 1. Data for SCA1 and SCA3 in chimpanzees (16 individuals) were obtained from the literature (Limprasert et al. 1996, 1997). In order to avoid sampling a single primate population, we obtained the samples from very different sources: Coriell Cell Repositories (USA), European Collection Cell Cultures (UK), Barcelona Zoo (Spain), Kumamoto Primate Park (Japan), Institute of Zoology, London (UK), and Dr. Takafumi Ishida, Tokyo (Japan). Although it is not possible to assume that this sample set is representative of genetic variation of all apes, care was taken to try to obtain a diverse sample set, whose heterogeneous origin reduces the limitations of studying a small population sample. Subspecies of ape specimens were determined by amplification by PCR of both hypervariable regions of the mtDNA D-loop region and direct sequencing with internal primers. Subspecies identification may not be crucial for our study, however, as variation at nuclear loci does not cluster by subspecies, at least for chimpanzees (Kaessmann et al. 1999).

All DNA samples were amplified for the region containing the repetitive element by PCR, with primers and conditions previously described for human analysis (Table A1). In the KCNN3 gene, which contains two CAG repetitive segments, the most variable repetitive tract (nucleotides 513 to 569 in NCB1 Reference Sequence NM_002249.3) was amplified in a fragment that did not include the other repetitive tract. Lengths of the amplified fragments were subsequently typed with GeneScan software version 3.7 (Applied Biosystems) after electrophoresis on 6% denaturing gel performed on an ABI Prism377 automatic sequencer (Applied Biosystems). A human sample of known genotype, previously sequenced and typed, was used as a standard size control for all GeneScan runs.

Table A1 Primer sequences used in this study

At least one individual per species and locus was directly sequenced to verify the correspondence between amplified fragment length and number of repeats. Sequences were determined using BigDyes sequencing kits, versions 2.0 and 3.0 (Applied Biosystems), on automatic sequencers ABIPrism 377 and 3100 (Applied Biosystems). The only exception is DRPLA, for which ape sequences from GenBank were used (accession number AJ133270 [Pan paniscus], AJ133271 [Gorilla gorilla], AJ133272 [Pongo pygmaeus]). Sequences were assembled and analyzed with the SeqmanII program (Lasergene 1999 package; DNASTAR, Inc.). Homology between human and mouse sequences was obtained by BLAST, additional alignment of sequences (Seqman II and Clustal programs), and manual determination of the conserved regions.

Allele Frequencies and Statistical Analysis

The correspondence between allele size and number of repeats for every species allowed the estimation of allele frequencies by direct allele counting. Human distributions for healthy individuals result from pooling African, European, Indian, and East Asian samples for all loci; the distributions were obtained from the literature or the typing of healthy individuals when necessary (Andrés et al. 2003). As they come from normal individuals, we do not expect to find a significant proportion of premutated alleles (those with an intermediate allele length, i.e., between normal and pathogenic, and which have high expansion probability). Allele frequencies were averaged across populations without weighing by sample size, as the number of chromosomes varied greatly among populations. All subsequent analyses were performed with the pooled human distribution.

The different parameters of allele size distributions for every locus and species were determined. Mean, standard deviation, variance, and variation coefficient of repeat number were calculated with the SPSS statistical package. Expected heterozygosity was calculated with the Arlequin 2.000 package (Schneider et al. 2000). In order to analyze the possibility that variance was influenced by differences in sample size, we computed it on an increasing sample size, for every locus and species, with a program that takes a pseudosample of 8–40 chromosomes (the range in size between our smallest and our largest ape samples) from the original distribution and calculates its variance. After 1000 random extractions for each pseudosample size, the average variance for every pseudosample size is plotted in a graph that shows how variance relates to pseudosample size.

To determine whether variances observed in species with a smaller sample size could be obtained from a hypothetical human sample of eight chromosomes (our smallest sample size, that of orangutans), we performed a second resampling experiment. In this case, a pseudosample of eight chromosomes was obtained for every human distribution and its variance was calculated; after 10,000 sample extractions, we obtained a distribution of variances from pseudosamples of eight human chromosomes. To determine the significance of our results, the variance was compared to the 95% confidence interval of the pseudosample variance distribution.

The four species were compared in terms of mean and variance of repeat number. Two tests were performed for mean repeat number with permutation tests: the comparison between the four species for every individual locus (considering every species as a different category) and the comparison between humans and the rest of the species for every locus (considering “apes” a category, which included all ape species, and comparing it with the “human” category). The permutation test was performed as follows: Individual chromosomes were randomly shuffled between classes (species or species groups), maintaining the original sample sizes. For every permutated data set, the difference in the average number of repeats between the two classes was computed. This process was repeated 1000 times. The test is significant if the probability of obtaining a difference in average repeat number in the permutations as large as in the observed data set, in a one-tailed test, is <0.05 for the human–ape comparison or <0.0056 (after Bonferroni correction) for the four-species comparison.

Variance comparison among the four species, for every locus, was performed with Scheffé–Box (log ANOVA) test for homogeneity of variances (Sokal and Rohlf 1995, p. 397); as this test is not available in statistical packages, an ad hoc program has been written and is available on request (oscar.lao@upf.edu). In order to obtain a single overall significance value for all loci, the individual Scheffé–Box p values for every locus were combined with Fisher’s test for probability combination (Sokal and Rohlf 1995, p. 795). When a species showed a single allele for a given locus the test was performed using the rest of the species.

Results

Data on nine CAG repeat loci, whose characteristics are shown in Table 1, were obtained for humans (of African, European, Indian, and East Asian origin), for 20 chimpanzees, 13 gorillas, and 4 to 6 orangutans; all chromosomes analyzed were in the “normal,” nonexpanded range. Allele size distributions are plotted by locus in Fig. 1, and statistical parameters of the distributions and expected heterozygosity are shown in Table 2. The amplified fragment was sequenced in at least one individual for species and locus; sequences of all repetitive regions are shown in Fig. 2, and their accession numbers are available in Table A2, including compared mouse loci. Comparative locus analysis was carried out on different levels, searching for general species trends and for locus-specific trends.

Table 2 Sample size, statistical parameters, and expected heterozygosity for each species and locus
Table A2 New nucleotide data accession numbers
Figure 1
figure 1

Graphical representations of individual allele size distributions and average length and variability parameters for species and loci. 1.11.9 Allele length distributions in the four species. The distributions are the result of typing, for repeat number, chimpanzee (in red), gorilla (blue), and orangutan (yellow) chromosomes, and pooling human distributions from African, European, Indian, and East Asian origin (black). 1.10 Mean allele size for species and locus. 1.11 Variance of repeat number for species and locus.

Figure 2
figure 2

Sequence of the repetitive region for the expanding loci in the four species. The line over the sequences marks the STR sequence region, and the total length of the overlined sequence region is shown. Repetitive segments are indicated without detailing the exact number of repeats, and segments containing six or more repeat units, with higher probabilities of slippage events than shorter segments, aremarked in bold face.

General species-specific trends can be detected by comparing the parameters of the distribution for every species over all loci (Table 3). Humans show a higher mean number of repeats than the other species, reaching statistical significance (permutation test humans vs. apes, p < 0.001; see Materials and Methods for the groups considered), showing that a trend exists in allele length among the different species. Nevertheless, the trend is not followed by all loci, as discussed below.

Table 3 Parameters of allele size distributions for species: Pooled sample size, statistical parameters, and expected heterozygosity for each species in the different loci

Variance and coefficient of variation of repeat number show a decreasing trend from humans to orangutans, as shown in Table 3. Statistically significant differences in variance exist between species, with all individual loci showing significant differences among species (p < 0.05) and the combined p value also being statistically significant. Interestingly, for the seven expanding loci humans presented the largest variance.

The small sample size in the ape species may bias the estimation of the dispersion parameters. As stated in Materials and Methods, the samples for nonhuman species are of very heterogeneous origin, which reduces the possibility of underestimating variance by sampling from a single, localized population or inbred zoo collection. Furthermore, to explore to what extent human variance is larger than that of the rest of the species as a consequence of its larger sample size, we performed two independent tests.

First, we studied whether variance increased with sample size by a permutation test on pseudosamples of 8–40 chromosomes for the four species in every locus. Results showed that reduction of sample size does not lead to a reduction of variance (data not shown) and that the resampling average value remains within the range of the values obtained from the original distribution. Therefore, as expected, a reduced sample does not determine lower variance value. A second resampling test was performed, in which we tested whether variances similar to those obtained for apes could be obtained from pseusosamples of eight human chromosomes from our human distributions (see Materials and Methods). Of the seven loci with larger variance in humans than in the rest of the species, four loci (SCA3, SCA8, SCA12, and DRPLA) showed large and statistically significant differences (p < 0.05) between humans and orangutans (the species with the lowest sample size), and two of them (SCA3 and DRPLA) showed significantly larger variances in humans than in any other species. This analysis suggests that beyond some sample size influence, our results cannot be exclusively explained for differences in sample size, and variance divergence for this group of loci seems to be a species characteristic.

Heterozygosity is high and very similar between humans and chimpanzees and low in gorillas and orangutans (Table 3). This observation is due to the large influence of SCA2 and NCOA3, with higher heterozygosities in chimpanzees than in humans; without these two loci, mean heterozygosity is higher in humans than in chimpanzees.

In order to determine which loci are mostly responsible for the species trends, mean repeat number (Fig. 1.10) and variance of repeat number (Fig. 1.11) were compared for each species and locus. Humans do not always show larger allele sizes at all loci or at all expanding loci, and only three (SCA1, SCA8, and NCOA3) show statistically significant larger mean allele length in humans than in any other species. On the other hand, variance is higher in humans than in the rest of species for all expanding loci but not for the two nonexpanding loci. Nonetheless, the low number of nonexpanding loci analyzed prevents us from generalizing this interesting difference in variance.

The individual comparison among loci distributions shows a strong heterogeneity in allele size distribution (Fig. 1), and sequences of the STR alleles over the four species (Fig. 2) illustrate that repeat regions are very complex and that different loci and lineages show heterogeneous amounts and patterns of divergence. The sequence distribution comparison clearly shows a lack of general patterns (in STR sequence and interrupting complexity) that could explain the observed allele size and variability of expanding loci.

Interestingly, when comparing human and mouse, only very short tracts were detected for almost all mouse loci, from the complete absence of the tract (in SCA6, SCA8, and SCA12 the repetitive region could not be detected) to very short CAG/CAA repetitive regions (four CAG repeats in SCA1, two in SCA2, five in SCA3, four in DRPLA, and one in NCOA3). Only KCNN3 keeps the repetitive tract over phylogenetically distant species, as a repetitive CAG/CAA region exists in the mouse sequence, with the interrupting codons CAA (Gln) and TCG (Ser), which are absent in primates.

Discussion

Our study shows that the “flexible conservation” that exists in human functional CAG/CTG tracts (high polymorphism in number of repeats combined with sequence conservation) is also found in apes. The conservation of the variable tracts in all species strongly supports the presence of the repeat regions in the common ancestor of humans and apes in all nine loci studied, and the presence of polymorphism in almost all species suggests the existence of ancestral polymorphism.

Previous studies comparing human and mouse CAG/CAA tracts found a relationship between interrupting levels and conservation between very distant species (Albà et al. 1999), suggesting that older tracts would be more frequently interrupted by non-CAG codons. We failed to find this relationship in our set of repetitive tracts, as the most conserved locus (KCNN3, with a long repetitive region in both humans and mice) is totally uninterrupted in primates, and many tracts showing low conservation between distant species are profoundly interrupted in primates.

Species-Specific Characteristics

The combined analysis of all loci points to species-specific trends. In allele length comparison, larger alleles in humans than in other species were reported for nonfunctional STRs (Rubinsztein et al. 1995a; Crouau-Roy et al. 1996; Cooper et al. 1998), a controversial conclusion (Ellegren et al. 1995, 1997). Previous studies found longer alleles in humans than in other species for SCA1, SCA3, AR, and HD CAG repeats and in the FA (GAA) locus (Rubinsztein et al. 1995b; Djian et al. 1996; Limprasert et al. 1997; Choong et al. 1998; Gonzalez-Cabo et al. 1999; Justice et al. 2001). This suggests a general increase in the number of repeats from monkeys to apes and humans for expanding loci (although similar allele length was found by Limprasert et al. [1996] in the SCA3 locus). On the other hand, a study on functional nonexpanding CAG STR shows shorter alleles in humans than in apes (Saleem et al. 2001). Therefore, a clear picture of the human specific characteristics in functional STRs had not emerged beyond single locus comparisons.

The data presented in this paper for nine functional trinucleotides show significantly higher number of repeats in humans, a trend that is unique to this lineage and shows that, beyond locus heterogeneity, a specificity in allele length exists in this set of STRs. The trend is not present in all loci and is mainly (but not exclusively) due to SCA1, SCA8, and NCOA3.

In our set of samples and loci, we found differences in variance among the four species, with the highest values in humans. In the genomic regions studied so far, DNA sequence diversity is lower in humans than in chimpanzees and other apes (Crouau-Roy et al. 1996; Kaessmann et al. 2001; Noda et al. 2001), possibly due to a demographic bottleneck in the human lineage (Jorde et al. 2000). Surprisingly, the present results show that humans are more diverse (measured as variance of allele distribution) than any of the other species studied for an ample set of expanding functional trinucleotide tandem repeats. No demographic factor (which would affect the whole genome) or ascertainment bias effect for the selection of the STRs in humans could explain this finding, which is not a general STR trend, but it is exclusive of functional STRs that can expand and produce disease. This is not found in noncoding STRs (and previous results [Crouau-Roy et al. 1996; Garza et al. 1995; Wise et al. 1997] are not concordant) or in other similar STRs (such as the nonexpanding KCNN3 and NCOA3, analyzed in this study).

The possible sampling error in apes has been reduced by choosing individuals of different origins as much as possible (see Materials and Methods), and the effects of different sample size for the different species have been proven to be small through resampling procedures (see Results). Moreover, variability levels may be investigated to test whether our primate samples present lower variability levels than humans at other genetic loci. Seventeen of the chimpanzees typed in this study were previously analyzed for 16S rRNA (Noda et al. 2001); the subset of individuals analyzed in both studies show variability levels (π = 0.0014 ± 0.0003) comparable with those existing for the 16S rRNA of the whole human species (π = 0.0016 ± 0.0004) (Ingman et al. 2000). These results show that a part of our sample of chimpanzees is as variable as humans at a global scale, and thus the expected diversity of chimpanzees in CAG repeats should be higher than humans. Therefore, our results are not the consequence of a small sample size or the selected sample of individuals, and locus-specific factors (related to CAG loci) acting in different ways in different species are needed to explain the observation that the loci that can expand in humans are more variable in humans than in any other species.

Human expanding STRs have previously been shown to be more variable than other di-, tri-, or tetranucleotides in the human genome (Chakraborty et al. 1997; Jodice et al. 1997; Deka et al. 1999), suggesting a relationship among variability, expansion, and disease. Moreover, our results show that these loci are more variable in humans than in any ape, suggesting that this pattern could be related to a human specifically capacity for expansion. The relationship suggests that loci with increased variance may be more likely to expand to pathogenic alleles, and thus lead to disease.

Different scenarios would be compatible with the increase in variance in one species: the first is a high mutation rate in the absence of strong selective constraints, which would increase variability of the STR (Di Rienzo et al. 1998), leading to new alleles that, if long enough, would increase slippage probability to expanded alleles. Humans do not have mean larger alleles, and therefore allele length differences do not seem to explain the observation. Nevertheless, we observed a statistically significant relationship between the longest allele and variance, even correcting for influence of mean allele size (r = 0.6552, p < 0.0005). The observed relationship between variance and longest allele can be explained for the nature of variance calculation (as alleles far away from the mean allele size will strongly influence variance). Thus normal but long alleles might affect the STR dynamics: as mutation rate increases with allele size, very long alleles can contribute to the high mutation rate and high variability levels. But this factor, although important for the production of pathogenic alleles, would have a minor effect on the overall variance, as they are rare in the population, and its increased mutation rate will probably not be large enough to explain the huge differences observed in variance for the whole distribution. When computing variances without the very long alleles (eliminating the five longer alleles in a range of less than 40 repeats), variance decreases only 20% on average, far from the differences with other species, showing that very long alleles are not the main responsible alleles for the large human variances in expanding loci. No other factors seem to exist to explain differences in mutation rates.

A second possibility, given the functionality of the loci, is the existence of some kind of balancing selection that, as in the case of other human genes like HLA (Hedrick and Thomson 1983; Hughes and Nei 1988) or CCR5 (Bamshad et al. 2002), would maintain the high diversity by favoring the existence of different alleles in the population. This is a truly speculative explanation, as no external evidence exists for the presence of balancing selection acting over these loci, but the fact that they are functional require consideration of a nonneutral explanation for our findings.

A detailed knowledge both of the mutation rate and pattern for each locus and of the functional behavior of the different alleles may be necessary to interpret these findings and to infer whether mutation or selection are the main causes of the increase of variability in humans.

Locus-Specific Characteristics

Differences in a single functional locus among species may influence its function, and these differences are especially interesting as these genes are expressed in brain and their variation outside the normal range has pathogenic effects. These characteristics (functionality and variability) make poly(CAG) regions attractive candidates for having brain-related functions, especially in the human lineage, which shows many brain-specific phenotypic traits compared to the rest of species.

The analysis of each locus demonstrates the existence of a strong evolutionary heterogeneity, showing that specific evolutionary forces have been acting on each gene, governing its diversification and evolution. No important differences in allele length exist between humans and apes, the only exception being SCA1 and SCA8, the two candidate loci to present functional differences between humans and the other species if the length of the repetitive tract influenced protein function. This possibility has been suggested for SCA1 (Yue et al. 2001) and SCA8 (see Andrés et al. 2003 for a detailed explanation of SCA8 interspecific differences). The variability levels are also highly heterogeneous among loci, ranging from loci with similar diversity in all species (as in SCA6 and SCA12) to those with extreme differences (SCA3 and DRPLA).

As in allele length distributions, there is a strong heterogeneity in the repetitive sequence conservation among loci for the different species: in contrast to the total conservation of the CAG tract in some loci (SCA6, SCA12, and KCNN3), others show important sequence differences among species, both in the presence of interruptions and in the composition of the repetitive segment (as SCA2, SCA3, SCA8, or DRPLA), which results in differences in repeat number. Nevertheless, the complexity in sequence interruption patterns could not be related to the amount of allele size diversity; and any simple pattern of STR mutation depending on the repetitive sequence that could be inferred from the analysis of one functional STRs locus does not seem to apply to others. Heterogeneity is the main rule.

The comparative study of seven expanding and two nonexpanding brain-expressed functional STRs has shown the overall maintenance of polymorphism with a higher mean in humans for some loci and a larger amount of variability in humans for expanding loci. Comparative studies focused on the search of species specificities on functional regions often do not deal with intraspecific variability, trying to find out fixed changes among species. Here we suggest that differences in genetic variability in important brain-expressed genes should also be considered. Species variability levels on phenotypic traits can also be a species-unique characteristic, and the study of diversity in specific genes with important gene function may give a clue toward understanding the species-specific evolution of these genes.