Introduction

There has been considerable recent interest in the mutational processes leading to length polymorphisms within microsatellite loci, with several studies using microsatellite marker loci. A mutation model is sometimes assessed on the basis of its agreement with observed allele frequencies of microsatellite markers, although it is typical to use relatively small numbers of loci and modestly sized populations (e.g., Chakraborty et al. 1997; Di Rienzo et al. 1994; Estoup and Cornuet 1999; Shriver et al. 1993; Valdes et al. 1993). Mutation rates may also be more directly estimated from the observation of marker allele transmissions between generations using parent–child pedigree data (Brinkmann et al. 1998; Brohede et al. 2002; Ellegren 2000; Huang et al. 2002; Whittaker et al. 2003; Xu et al. 2000) or by genotyping sperm from a few individuals of known genotype (Leeflang et al. 1995). In addition, phylogenetic approaches have been used using microsatellite markers (Harr et al. 2000). The alternative strategy adopted by many workers is to extract many microsatellite loci from across the genome, a system known as genomic data mining (e.g., Bell and Jurka 1997; Kruglyak et al. 1998; Rose and Falush 1998; Sibly et al. 2003). This approach, which has become possible thanks to the availability of genomic sequences, involves recording relevant features of all the microsatellites of a given type across the entire genome. The observed data for a selected motif type are then assessed for goodness of fit against a mutation model for microsatellite slippage mutation.

There are currently inconsistencies between the results of marker-based and non-marker-based studies. For instance, the former suggest that the expansion rate exceeds the contraction rate for short microsatellites (e.g., Whittaker et al. 2003; Xu et al. 2000), although this thesis appears incompatible with observations made from genomic mining (e.g., Kruglyak et al. 1998; Sibly et al. 2003). The reasons for this discrepancy are unresolved. One possibility, pursued here, is that the differences arise from the way that microsatellite markers have been selected for their utility as genetic markers, which involves a complex ascertainment process. For example, for the AC microsatellites in the CEPH database (http://www.cephb.fr/cephdb/), which we concentrate on here, potential markers had to satisfy the requirements that (i) the sequence including flanking sequences had to reliably identify a long AC microsatellite; (ii) the microsatellite and its flanking region had to allow reliable amplification by PCR, and (iii) the microsatellite had to show high allelic polymorphism in the CEPH reference families (Weissenbach et al. 1992).

It is therefore possible that the inconsistencies between the results of marker-based and non-marker-based studies may result from structural differences between markers and genomic microsatellites. Thus far, however, no quantitative comparison of the two classes of loci has been performed. Accordingly, we here compare the structure of marker and genomic microsatellites in terms of total length, base composition of the regions flanking the microsatellite, interruptions that interrupt the microsatellite into segments of perfect repeats, and the numbers and lengths of perfect segments. It turns out the structures of marker and genomic AC microsatellites differ in several important respects. This is perhaps not surprising, but it is important: proper modeling of the effects of these differences on mutation rates is essential if we are to resolve the current inconsistencies in our understanding of microsatellite evolution.

Methods

The Fondation Jean Dausset CEPH database contains genotypes for all the genetic markers that have been used in the human linkage mapping project (Dib et al. 1996). On 9/4/2003, 5282 AC microsatellite sequences were retrieved automatically using the 5761 (AC) n microsatellite records in CEPH database version 9.0. Some records lacked or had invalid accession numbers, and so the corresponding sequence was not retrieved. For the purpose of analysis, an operational definition of the boundary between a microsatellite and its flanking sequence is needed, although it is not clear how this should be formulated (Santibanez-Koref et al. 2001). The retrieved sequences were tabulated according to the following criteria: (1) the microsatellite consists of a series of uninterrupted (AC)n sequences, which we call segments, in which two consecutive segments are separated by an interruption, here meaning a dinucleotide different from AC; (2) at least one of the segments comprises five or more repeats; and (3) the 10-base flanking sequences do not contain either an AC or a CA dinucleotide. These criteria are a minor variation of those of Sibly et al. (2003). Of the 5282 available sequences evaluated in this way, 940 were accepted. For each accepted microsatellite, we recorded its total length, number of segments, length of each segment, base sequence of the 10-base flanking regions, and interrupting dinucleotide motifs.

The complete sequence of the human genome was downloaded from ftp.ncbi.nih.gov/genomes/H_sapiens/April_14_2003/ on 16/04/2003 and the AC microsatellites within it were tabulated using the criteria above. This resulted in the extraction of the features of 18,308 AC microsatellites. In the following these are called “genomic microsatellites,” as opposed to the “marker microsatellites” obtained from the CEPH database.

Results

As expected marker microsatellites show a distinct tendency to be longer than genomic microsatellites, with the majority of markers containing between 11 and 25 repeats compared with less than 10 repeats in most genomic microsatellites (Fig. 1). The changed length distribution of marker microsatellites is an inevitable result of the way the markers were chosen [see requirement (i) in the Introduction]. The length distribution of genomic microsatellites agrees with previously published results (Sibly et al. 2003).

Figure 1
figure 1

Log frequency distribution of the lengths of marker and genomic microsatellites.

Microsatellites are sometimes broken into segments by interruptions. Attention in this paper is restricted to the case that the interruptions are dinucleotides. The frequencies of the 15 possible interruptions did not differ between marker microsatellites and the length-corrected genomic microsatellites (χ 214  = 16.7; n.s.). It follows that there has been neither conscious nor unconscious bias for (or against) a particular interruption motif during the process of designating microsatellite marker loci. We now consider analysis of the number and length of segments.

Number and Length of Segment

The length differences between genomic and marker microsatellites could confound further analyses. To circumvent this problem we controlled for length by creating a “length-corrected genomic” dataset, in which genomic microsatellites were matched by length to marker microsatellites. This was achieved by randomly selecting, for each marker microsatellite locus, two genomic microsatellite loci of the same length. Two was the largest number of genomic loci available for sampling since there are only six genomic microsatellites of length 33 but there are three markers of that length. Sampling was carried out without replacement.

Intriguingly, there are dramatic differences in the number of segments in marker and genomic microsatellites after controlling for total microsatellite length (Fig. 2) (χ 24  = 48.4, p ≈ 10−9). Marker microsatellites possess fewer segments than genomic microsatellites; for instance; there are six times as many four-segment microsatellites among the genomic set as among the markers. In each case, frequency declines approximately geometrically with number of segments, as demonstrated previously by Sibly et al. (2003) for genomic microsatellites, although it should be noted that estimates for five-segment microsatellites have a large standard error because there are few such loci. It is also noteworthy that the fraction of microsatellites lacking interruptions is higher among marker microsatellites than it is in genomic microsatellites. These results suggest that there was bias against the presence of multiple interruptions when selecting the microsatellite loci to be used as markers.

Figure 2
figure 2

Fraction of microsatellites with one to five segments for markers and a genomic data set corrected for length.

Further evidence of bias comes from comparing the segment lengths of the two types of microsatellite for each possible total length (Fig. 3). The results for the marker data set are most reliable for microsatellites containing between 11 and 24 repeats. As expected from Fig. 2, marker microsatellites within this range possess significantly fewer and longer segments than genomic microsatellites. Given that microsatellite markers were selected partly on the basis of exhibiting high levels of polymorphism between individuals, it is difficult to establish whether these differences reflect increased slippage rates, relative to point mutation rates, among the marker microsatellites or if the difference is entirely attributable to bias against the selection of “imperfect” microsatellite arrays for marker purposes. Note that the left-hand side in Fig. 3 is difficult to interpret because our definition of microsatellite requires that one segment be at least five repeats long. Hence microsatellites of length 5 have exactly one segment by definition, those of length 6 have at most two, length 7 at most three, and so on.

Figure 3
figure 3

Mean length of segments in microsatellites of each length. “Genomic data” refers to the whole genomic data set. Vertical bars indicate standard errors.

Lastly, we show that the longest segment in marker microsatellites is disproportionately long. We saw in Fig. 3 that for a given length, marker microsatellites have longer segments than genomic microsatellites. For both microsatellite types, much of the length of the microsatellite is contained within the longest segment (Fig. 4). The proportion of total length accounted for by the longest segment naturally declines as the number of segments increases but surprisingly remains above 50% of the length on average in all cases. This pattern is noticeably more pronounced among the marker microsatellites, where the longest segment comprises a significantly greater proportion of the total sequence for all numbers of segments. This difference may result from direct (probably unconscious) bias for asymmetry among imperfect microsatellites for use as markers so that they contain at least one long segment. Alternatively, it may simply be a correlative association caused by experimenter selection for increased polymorphism, if long uninterrupted segments show a greater tendency for slippage.

Figure 4
figure 4

Fraction of microsatellite sequence (±standard error) in the longest segment, by number of segments.

Composition of Flanking Sequences

Table 1 shows the base composition of the flanking regions for the marker microsatellites (M) and length-corrected genomic microsatellites (CG). Intriguingly, there are significant differences in the base composition of these flanking regions (p < 0.001), as shown by a randomization test controlling for length. Overall, marker microsatellite flanking regions are far richer in C and G content, and so poorer in A and T. It is difficult to see how unconscious bias during the designation of loci for use as markers could account for this discrepancy. It seems rather more plausible that the base composition of the flanking region in some way influences the likelihood of slippage and/or interruptions such that the loci selected on the basis of being polymorphic tend also to have a shared base composition in the flanking regions. A very similar result, albeit on disease loci, has been obtained by (Brock et al. 1999). Specifically, they found an association between the GC content of flanking regions and the “relative expandability” of the locus.

Table 1 Base composition of the flanks in the marker and the corrected genomic data sets

Discussion

We have identified three major differences between markers and genomic microsatellites: markers are longer (Fig. 1); controlling for length, markers contain fewer interruptions (Fig. 2); and segment lengths are longer (Fig. 3).

The reasons for these differences are readily understood in terms of the way markers were designed. They probably all result from scientists selecting for high polymorphism. Thus, length increases mutation rate (Brinkmann et al. 1998; Ellegren 2000; Kruglyak et al. 1998; Sibly et al. 2001; Whittaker et al. 2003; Wierdl et al. 1997). Controlling for overall length, markers have fewer interruptions, and this increases slippage rates (Petes et al. 1997; Rolfsmeier and Lahue 2000). And the longest segment of markers is disproportionately long (Fig. 4). If slippage rate increases with segment length faster than linearly as argued (Brinkmann et al. 1998; Lai and Sun 2004; Rose and Falush 1998; Sibly et al. 2001), then the longest segment will have an overriding effect on the overall slippage rate of the microsatellite. Unconscious preference for long perfect arrays at the marker selection stage could also account for this difference. To sum up, the internal characteristics of marker microsatellite loci produce higher mutation rates in marker than in genomic microsatellites and, hence, are likely to produce higher rates of polymorphism. These differences are therefore plausibly the result of the marker design process.

While the internal structure of markers seems to be associated with higher mutation rates, this is not true of their flanking sequences. The flanking sequences of marker microsatellites contain more C and G, and fewer A and T, than do genomic microsatellites. C–G pairings contain an extra hydrogen bond and so should be more stable than A–T pairings and so less prone to slippage. A possible reason is that added stability in the flanking region stabilize the microsatellite during PCR amplification and so renders it less prone to forming artifact “stutter” bands. Certainly, loci generating such bands are less likely to be selected as markers because of difficulty in allele designation after fractionation by electrophoresis.

In conclusion, our results indicate that marker and genomic microsatellite loci differ profoundly in terms of their length and composition and in the sequence of their flanking regions. Future analyses should take account of these differences. A unified theory is needed that can account both for the genomic distribution of interrupted microsatellites as shown in Figs. 14 for experimental observations of mutation frequencies and step sizes in relation to microsatellite structure. A unified model might perhaps write mutation rate for a given step size = f(l1, m1, l2, m2, 13, m3, ⋖) where l i is the length of the ith segment, and m i is the identity of the ith interruption. Models of this type would ideally be fitted in all types of analysis, to provide a unified account of the effects of structural characteristics on the mutational mechanism.