Introduction

Microsatellites are short tandem DNA repeats present in variable numbers in the genomes of eubacteria, archaebacteria, and eukaryotes (reviewed in Charlesworth et al. 1994; Richard et al. 1999). Their size polymorphism was used to construct maps of the human, wheat, and trout genomes (Dib et al. 1996; Röder et al. 1998; Sakamoto et al. 2000). Microsatellites were also extensively used to type subpopulations belonging to the same species. Molecular typing was achieved by PCR amplification of selected microsatellite loci in Saccharomyces cerevisiae (Hennequin et al. 2001; Richard and Dujon 1996), Candida albicans (Lunel et al. 1998), Arabidopsis thaliana (Innan et al. 1997), or Vitis vinifira (Bowers et al. 1999).

Microsatellite tract length changes are believed to occur by replication slippage of the newly synthesized strand on its template during S phase replication or during recombination, promoted by the repetitive nature of the sequence (reviewed in Richard and Pâques 2000; Richards 2001). It was shown that perfect (uninterrupted) trinucleotide repeats were polymorphic among S. cerevisiae strains of different origins, whereas imperfect repeats were stable, supporting the notion that perfect repeats can be used to type populations differing at the molecular level (Richard and Dujon 1997).

Trinucleotide repeats are equally distributed in coding and non-coding regions, whereas other microsatellites (except hexanucleotide repeats) are underrepresented in coding regions (Debrauwère et al. 1997; Richard and Dujon 1997; Toth et al. 2000). This is most certainly due to the fact that trinucleotide repeat tract alterations within an ORF do not abolish its function by creating frameshift mutations. The most striking observation related to trinucleotide repeat-encoding genes is their propensity to encode nuclear products (Alba et al. 1999; Karlin and Burge 1996; Richard and Dujon 1997; Young et al. 2000). In S. cerevisiae, the most frequent aminoacid encoded by trinucleotide repeats is glutamine, but other charged aminoacids are also frequently encountered. Long poly-Gln stretches are supposed to be involved in a variety of nuclear processes, including protein–protein interactions and transcription activation (Karlin and Burge 1996).

During the course of the “Génolevures” sequencing project, 49,199 Random Sequenced Tags (RSTs) have been obtained from 13 hemiascomycetous yeast species (Souciet et al. 2000). Using these data (available at http://cbi.labri.u-bordeaux.fr/Genolevures) , we identified all dinucleotide, trinucleotide, and tetranucleotide repeats and compared their distribution in each of the 13 species with that found in S. cerevisiae. We show that the frequency of occurrence of each type of microsatellite varies among the different species studied, but that species close to S. cerevisiae exihibit a similar distribution. We confirm that trinucleotide repeats are more frequent in genes encoding nuclear factors, particularly transcription factors, and show that they are over-represented among the class of hemiascomycete-specific genes. We propose that trinucleotide repeats play a role in designing new regulatory domains by promoting rapid sequence diversity among the more flexible nuclear proteins.

Materials and Methods

Microsatellites were detected in the sequences of the Génolevures project (EMBL accession numbers from AL392203 to AL441602) using the software of Benson and Waterman (1994). The following parameters were used: match weight, + 1; mismatch weight, −2, −3, −4 (respectively, for di-, tri-, tetranucleotide repeats); insertion/deletion weight, −6, −9, −12; threshold to report, 10, 15, 20; pattern size, 2, 3, 4; lookcount, 2, 3, 4; no shortperiod, 1. This allowed us to detect microsatellites containing at least five perfect (uninterrupted) repeat units. Out of 49,199 RSTs sequenced during the Génolevures project, 4,814 contained rDNA, mitochondrial, Ty, or plasmid sequences and were excluded from the search.

The equation used to analyze the length distribution of di- and trinucleotide repeats is N = a × n −b, with n being the length class and N the number of occurrences in each length class. The resulting curves have been compared to the observed numbers using a Chi2 test (Chi2 > 6.63; p < 0.01).

Results

Sequence and Length Distributions of Microsatellites in the 13 Hemiascomycetous Species

In order to detect microsatellites in the 13 hemiascomycetous genomes, we used the same approach as the one described in former works (Richard and Dujon 1996, 1997; Richard et al. 1999). Altogether, we analyzed the 44,385 RSTs corresponding to nuclear DNA, totalling 41.2 Mb from 13 yeast species. We compared the results with the complete genome of S. cerevisiae. All di-, tri-, and tetranucleotide repeats that were at least five units long were scored. Their frequencies of occurrence are shown in Table 1. Three species, Saccharomyces exiguus, Kluyveromyces marxianus var. marxianus, and Candida tropicalis, exhibit a significantly higher number of di-, tri-, and tetranucleotide repeats as compared to S. cerevisiae. Only two species, Pichia sorbitophila and Pichia angusta, exhibit a lower number of both di- and trinucleotide repeats as compared to S. cerevisiae. The other species are either comparable to S. cerevisiae or show an excess or a deficiency in only one or two of the three classes. No species is entirely comparable to S. cerevisiae (Table 1).

Table 1 Total number of microsatellites identified in hemiascomycetous genomes

Then, we compared the frequencies of occurrence of each of the four possible non-monotonous dinucleotide repeats found in each species to the frequency observed in S. cerevisiae. Results are shown in Table 2. ACn and AGn repeats are over-represented in five and nine species, respectively, and both are underrepresented only in D. hansenii var. hansenii. The situation is opposite for ATn repeats which are underrepresented in eight species (including five of the previous ones) and over-represented only in D. hansenii var. hansenii. Finally, CGn repeats are rare in all species except K. thermotolerans and P. angusta, where they are significantly over-represented. Interestingly, the three closest species to S. cerevisiae (S. bayanus var. uvarum, S. exiguus, and S. servazii) show similar distributions of dinucleotide repeats, except for the excess of AGn repeats in S. bayanus var. uvarum. The most frequent dinucleotide repeats for each species are represented in Fig. 1.

Table 2 Frequencies of occurrence of each non-monotonous dinucleotide repeat in the hemiascomycetous species
Figure 1
figure 1

Distribution of di- and tri-nucleotide repeats in each hemiascomycetous species, and their respective position in the phylogenetic tree. Left: cladogram of the 14 species based on 25S rDNA sequence (Souciet et al. 2000). Right: Distribution of di- and trinucleotide repeats. For each species, di- and trinucleotide repeats whose frequencies of occurrence exceeds 25% or 10%, respectively,of the total number of their class are indicated (see Tables 2 and 3).

We also compared the frequency of each dinucleotide’s occurrence to the frequency of the corresponding dinucleotide repeat. There is no obvious correlation between dinucleotide frequencies and the corresponding dinucleotide repeat frequencies, except for K. thermotolerans which exhibits the highest frequency of CG dinucleotide and of CG dinucleotide repeats (Table 2). This shows that the frequency of each dinucleotide repeat is not determined by the genome average composition, suggesting that active mechanisms regulate the distribution of dinucleotide repeats in each genome.

Similarly, each of the 10 non-monotonous trinucleotide repeats was scored, and results are detailed in Table 3. The distribution of each kind of trinucleotide repeat is more homogeneous, as compared to S. cerevisiae. However, it is worth noting that AATn repeats are over-represented S. exiguus, S. servazii, Z. rouxii, and D. hansenii var. hansenii and that ATCn repeats are underrepresented in K. marxianus var. marxianus, P. angusta, and Y. lipolytica. The two more distantly-related species, P. angusta and Y. lipolytica, show the greatest divergence as compared to S. cerevisiae, with four trinucleotide repeats out of ten exhibiting significantly different frequencies of occurrence. The most frequent trinucleotide repeats are represented in Fig. 1.

Table 3 Frequencies of occurence of each non-monotonous trinucleotide repeat in the hemiascomycetous species

We have also examined the length distribution of di- and trinucleotide repeats. Mean lengths of di- and trinucleotide repeats in the 13 hemiascomycetous species range 8–12 repeat units and are not significantly different from that of S. cerevisiae (11 repeat units). Long repeats are less frequent than short ones in all species. If dinucleotide repeat lengths were not subject to any selection, we would expect to observe an exponential decrease of the number of dinucleotide repeats belonging to each length class. By fitting the observed distribution to an exponential curve, we calculated the expected number of dinucleotide repeats of each length class and for each of the 14 yeast species. We found that the most frequently over-represented length class is eight repeat units (11 species out of 14), followed by nine and 10 repeat units (seven species out of 14). For trinucleotide repeats, we did not observe an overrepresentation of a particular length class.

Amino Acids Encoded by Trinucleotide Repeats

Trinucleotide repeats located within a protein-coding sequence encode repetitions of amino acids. For each hemiascomycetous species, we analyzed all trinucleotide repeats occurring within the 339 ORFs identified by sequence homology to a S. cerevisiae ORF. The corresponding amino acids were examined and their frequency of occurrence computed (Fig. 2). A total of 15 distinct amino acids were found, the most common ones being Glutamine, Aspartic acid, Asparagine, and Glutamic acid. Phenylalanine, Tyrosine, Tryptophane, Isoleucine, and Methionine are never found.

Figure 2
figure 2

Amino acids encoded by trinucleotide repeats occurring within ORFs of the 13 hemiascomycetes. Amino acids are indicated by the single letter code. Number of occurrences of each homopolymer is shown on the vertical axis.

Because the reading frames for each yeast species were determined by comparing the hemiascomycete sequence to the S. cerevisiae genome, we had more matches between S. bayanus var. uvarum and S. cerevisiae than with other, more distant species. Therefore, in order to have numbers high enough to do statistical tests, we pooled the other 12 species (Tables 4 and 5). First, we noticed that the three possible frames of the same triplet (e.g., AAC, CAA, and ACA) are not used with the same frequency. AAC is more frequently encountered in S. bayanus var. uvarum and CAA is more frequently encountered in other species than ACA. We observe that GAA, AAT, CAG, GCT, GAC, and GAT are more represented than the other frames (Table 4). This phenomenon probably reflects an effect of amino acid selection on some of the frames because, in terms of DNA sequence, the three possible frames are equivalent. Second, a given amino acid is preferentially encoded by one of its synonymous codons. For example, poly-Glu are preferentially encoded by GAA repeats, poly-Asp by GAT (although GAC is also frequently found), and poly-Gln by CAA in all species except S. bayanus var. uvarum (Table 4). Interestingly, poly-Asn are preferentially encoded by AAC in S. bayanus var. uvarum and by AAT in other species. We then compared these numbers to the Relative Synonymous Codon Usage (RSCU) of each triplet (Lloyd and Sharp 1992) for S. bayanus var. uvarum only, because the numbers were high enough to perform statistical tests (Table 5). Asparagine is preferentially encoded by AAC and Glutamine by CAG in trinucleotide repeats, opposite to what is observed in the overall genome. This is to be compared with human neurological disorders where poly-Glutamine repeats are always encoded by CAG repeats and never by CAA (Orr 2001). We concluded that for a given amino acid, the repeated codon is not determined by its absolute frequency in the genome, but by cis- or trans-acting factors which favor its presence within a repeated DNA sequence.

Table 4 Number of occurrences of each amino acid encoded by trinucleotide repeats in S. bayanus var. uvarum (Su) or in the other 12 species (others)
Table 5 Comparisons between the expected number of codons (Exp.) (calculated by dividing the RSCU of a codon by the number of synonymous codons and then multiplying by the total number of occurrence of each amino acid), and the observed number (Obs.) of such codons, in S. bayanus var uvarum

Conservation of Amino Acid Stretches Among the 13 Hemiascomycetes

From the alignments of each of the different hemiascomycetous sequences with S. cerevisiae sequences we could determine whether the amino acid repeats were conserved or not between the two yeast species. We found conservation in 64% of the cases and absence of conservation in 36% of the cases. Figure 3 shows some examples. In NOG1 and SSN3 gene products, the poly-Glu and poly-Ala stretches are perfectly conserved at the same position in all other yeast species where the gene was sequenced. In NAP1 and RPP2B gene products, the amino acid repeats are still detectable in all yeast species where the genes were sequenced but are not very well conserved and the length of the stretch changes between species. The YLR104w ORF product shows an interesting example of conservation of the repeat in four species out of five. Finally, one of the most intriguing cases we found was the PAB1 (YER165w) gene product. In all five species, this protein contains a poly-Ala. A poly-Gln repeat is only present in K. lactis, but two glutamines are present at the same position in the other species. The total length of the two repeats is similar in the different species, suggesting that the length of the repeat on the function and/or stability of the protein.

Figure 3
figure 3

Alignments of sequences containing trinucleotide repeats in orthologs of S. cerevisiae genes. Yeast species are designed by their initials (Table 1). Trinucleotide repeats are in red and in blue. Stars represent identical residues. The microsatellite in NAP1 was initially only detected in P. angusta and S. exiguus. Alignment with the two other species was necessary to see that the repeated sequence was also present in S. cerevisiae and S. bayanus var. uvarum but in a degenerate form. Similarly, the Poly-Gln repeat in PAB1 was initially only detected in K. lactis. Alignment showed that two glutamine were present at the same position in the other species.

Trinucleotide Repeats are Over-Represented in Ascomycete-Specific Genes

In S. cerevisiae, trinucleotide repeats are often found in genes encoding products localized at the nucleus (Richard and Dujon 1997; Young et al. 2000), particularly transcription factors (Alba et al. 1999). In order to determine whether this phenomenon is specific to one or a few yeasts or is widespread throughout evolution, we examined the distribution of the 574 S. cerevisiae ORFs previously defined as containing trinucleotide repeats (Richard and Dujon 1997). We used the set of 1571 ascomycete-specific genes, as defined by Malpertuy et al. (2000), the set of 770 genes encoding nuclear products, as defined in the MIPS database, and the set of 101 transcription factors from the Yeast Protein Database (Fig. 4). This analysis confirmed that nuclear factors are over-represented among genes containing trinucleotide repeats, with twice as many observed occurrences (142) than expected (71). The bias is even stronger for the subset of genes encoding transcription factors (nine expected; 27 observed). The novelty is that genes containing trinucleotide repeats are also more frequently observed than expected among the set of ascomycete-specific genes (145 expected; 200 observed). Note that genes encoding transcription factors are themselves found more often than expected in ascomycete-specific genes, as previously observed by Malpertuy et al. (2000). The functions of the 27 transcription factors containing trinucleotide repeats are detailed in Table 6. Out of 27 proteins, 25 have been experimentally shown to interact directly with DNA in the yeast nucleus. By contrast, we did not find an over-representation of trinucleotide repeats in genes encoding the nuclear pore components or involved in nuclear export/import functions, nor in genes encoding DNA replication and/or DNA repair functions (data not shown).

Table 6 Function of the 27 transcription factors containing trinucleotide repeats, according to the Yeast Protein Database
Figure 4
figure 4

Comparisons between expected and observed frequencies of genes containing trinucleotide repeats in genes encoding nuclear products, transcription factors, and in ascomycete-specific genes. Arrows connecting circles represent genes belonging to the two categories. Their expected number is calculated from the product of frequencies of each class (total number of S. cerevisiae genes is 6213). The 99% confidence intervals are indicated in parentheses. Observed numbers significantly higher than expected values are bolded in red. Subcellular localizations and putative functions of S. cerevisiae ORF products are defined according to the MIPS database, available at http://mips.gsf.de/proj/yeast/catalogues/subcell/index.html , and theSGD database, available at http://genome-www.stanford.edu/Saccharomyces .

Discussion

Microsatellites Are De Novo Created and Lost During Evolution

In the present work, we compared the distribution of microsatellites in the genomes of 13 hemiascomycetous yeast species. Although the sequence data used are only partial (0.2–0.4 X coverage only; see Souciet et al. 2000), we believe that they represent a statistically valid sample of the entire genomes because the sequenced libraries are essentially unbiased (Tekaia et al. 2000). In addition, results previously obtained with 30% of the S. cerevisiae genome (Richard and Dujon 1996), gave the same distribution as the whole genome sequence (Richard and Dujon 1997).

It is clear that the four species that are closest to S. cerevisiae are strongly biased toward ATn repeats, like two less-related species, K. lactis and D. hansenii var. hansenii (Table 2 and Fig. 1). If we take di- and trinucleotide repeats into consideration, only four species show similar distributions for both: S. cerevisiae, S. bayanus var. uvarum, Z. rouxii, and K. lactis. The other ten species all exhibit unique distributions, a sort of molecular signature of their genome. This suggest not only that microsatellites evolve at a rapid pace, but also that they are created de novo and lost during evolution of living organisms. Mechanisms that alter microsatellite size involve replication slippage during S-phase DNA synthesis unrepaired by the mismatch repair system (reviewed in Kolodner 1996; Modrich and Lahue 1996), or repair slippage during gene conversion of a microsatellite (reviewed in Richard and Pâques 2000). It is not clear what kind of mechanism could de novo create microsatellites. Zhu et al. (2000) suggested that replication slippage could be important even earlier in a microsatellite lifetime, expanding even very short microsatellites, but at least two tandem repeats are needed to get an expansion due to slippage, nevertheless. Gene conversion can lead to insertion of a microsatellite at a locus previously devoid of microsatellite (Richard et al. 1999; Richard et al. 2000), and could be a possible mechanism to spread microsatellites at new locations in the genome. Among the possible mechanisms responsible for loss of microsatellites, point mutations leading to degeneration of the repeat has been suggested (Zhu et al. 2000), but other mechanisms could also be involved. Figure 3 shows a possible example of a microsatellite loss by degeneration of the repeat, occurring in the NAP1 gene of S. cerevisiae and S. bayanus var. uvarum. Once point mutations occur within a microsatellite they will tend to stabilize the repeat, as was shown in mismatch repair-deficient human cells (Bacon et al. 2000). Size changes will thus become less frequent and accumulation of additional mutations will finally make the microsatellite undetectable.

Some dinucleotide-repeat length classes are over-represented in a majority of the analyzed species. If most, if not all, microsatellite length changes are produced by unrepaired slippage of DNA polymerases during replication or repair, there should be an equilibrium depending on: i) repeat length (long repeats have a higher mutation rate than short repeats; Tran et al. 1997; Wierdl et al. 1997); ii) mutations in replication and mismatch repair factors that destabilize microsatellites (Kokoska et al. 1999; Kokoska et al. 1998; Strand et al. 1993; Tran et al. 1997). Xu et al. (2000) showed that the rate of tetranucleotide repeat expansions in humans is constant for all alleles, whereas the rate of contractions increase exponentially with repeat length. Overrepresentation of eight-unit dinucleotide repeats suggests that, at equilibrium, forces driving expansions and contractions of dinucleotides would be equally balanced, leading to the accumulation of repeats of the critical length. Shorter repeats could correspond to young repeats, before equilibrium, or alternatively, old degenerated repeats.

Charged Amino Acids are Over-Represented in Trinucleotide Repeats Contained in Transcription Factors

Four amino acids (D, E, N, and Q) are most frequently found in ORFs containing trinucleotide repeats in all 13 hemiascomycetes studied. The same four amino acids were found in S. cerevisiae in addition to Serine being found frequently in short trinucleotide repeat arrays (Alba et al. 1999; Richard and Dujon 1996, 1997; Young et al. 2000). As previously observed (Richard and Dujon 1996), the three possible frames of a given repeat have distinct different frequencies of occurrence suggesting structural (or functional) contingencies at the protein level and not at the DNA level. Poly-Methionine are probably counter-selected to avoid possible interference with the translation machinery and the start codon.

From database compilation, Karlin and Burge (1996) first observed that proteins containing long homopeptide stretches were often playing a role in developmental functions, particularly in Drosophila. They encode long homopolymers of Gln, Asn, and His. Proteins containing long poly-Gln or poly-Asn tracts tend to aggregate. The best known examples are proteins involved in human neurodegenerative disorders, like Huntingtin (Scherzinger et al. 1997), and prions like the yeast Sup35p (Santoso et al. 2000) or the mammalian PrP (Prusiner et al. 1998). In the 13 hemiascomycetous yeasts studied, like S. cerevisiae, transcription factors are more often than expected encoded by genes with trinucleotide repeats, especially in ascomycete-specific genes, that were shown to diverge more rapidly during evolution (Malpertuy et al. 2000). Transcription factors might be more flexible than other proteins. By adding homopolymers of charged amino acids, biochemical properties of a transcription factor can be changed, modifying its interaction with DNA, with other DNA binding proteins, or with other transcription factors. This modified protein can then be selected for its new function, allowing the cell to increase diversity among its transcription factors, to specialize them, to adapt to a new environment, and eventually to speciate. The overrepresentation of hemiascomycete genes among these transcription factors containing trinucleotide repeats supports this hypothesis of fast evolving genes.