Introduction

The highly selective splicing of pre-mRNAs is a critical step in the production of functional proteins. Approximately 15% of inherited genetic disorders in humans are caused by deleterious mutations that interfere with mRNA splicing (Cooper and Mattox 1997; Blencowe 2000; Steiner et al. 2004). Furthermore, splicing is an important source of protein diversity since a large proportion (∼60%) of human genes is alternatively spliced (Maniatis and Tasic 2002), enabling the production of >90,000 different proteins from only 20,000–25,000 protein-coding genes (Harrison et al. 2002; International Human Genome Sequencing Consortium 2004). During mRNA splicing, introns are removed from pre-mRNAs by the spliceosome, a large complex consisting of five small nuclear RNAs (snRNAs) and many different protein factors (Staley and Guthrie 1998; Rappsilber et al. 2002). Certain features of mRNAs allow the spliceosome to distinguish exons from introns, including conserved motifs at exon/intron boundaries (i.e., splice sites) and motifs within the introns such as the polypyrimidine tract and the branch point. However, apart from the highly conserved GU and AG dinucleotides at the 5′ and 3′ ends of introns, respectively, splice site consensus sequences are quite degenerate. In large introns it is possible to find many pseudo splice sites that resemble the consensus sequence more closely than do the true splice sites (Sun and Chasin 2000). Since these features alone cannot support the observed precision with which splicing occurs, additional information must be required to ensure the proper splicing of mRNAs (Cartegni et al. 2002).

Recently, attention has focused on features within exons that may affect splice site recognition (Berget 1995; Liu et al. 1998; Cartegni et al. 2002, 2003; Zhang et al. 2003; Fairbrother et al. 2002, 2004). The identification of exonic features that affect splicing is complicated by the fact that exons also specify the amino acid sequence of the encoded proteins. As opposed to conserved motifs in introns, conserved regions in exons usually reflect selection to maintain protein function. Experimental and computational approaches have been applied to identify exonic splicing enhancers (ESEs), cis elements in exons that are involved in constitutive and alternative splicing. Fairbrother et al. (2002) developed a computational approach they termed RESCUE (Relative Enhancer and Silencer Classification by Unanimous Enrichment) to identify 238 hexamers that function as ESEs in the human genome. The RESCUE approach predicted 238 ESEs based on statistical analyses of exon-intron and splice site base composition. The 238 RESCUE-ESEs fell into 10 clusters based on sequence composition. In vitro tests of splicing efficiency for one hexamer from each cluster provided strong evidence that the ten hexamers functioned as ESEs (Fairbrother et al. 2002). However, in vitro assays are a time consuming and expensive procedure, so some other method of validating the other putative ESEs was desirable.

Toward that end, Fairbrother et al. (2004) used the human single nucleotide polymorphism (SNP) database to confirm the functionality of the 238 ESEs, focusing on biallelic SNPs in internal exons. They distinguished the variant allele from the ancestral allele by comparing the exon sequence from humans to that from the chimpanzee, and the allele shared with the chimpanzee was designated the ancestral allele. For each biallelic SNP they surveyed, they could determine whether the ancestral or variant allele contained a motif that matched 1 or more of the 238 ESEs. This analysis allowed them to divide SNPs into four categories with respect to ESE function: ESE neutral, ESE disruption, ESE creation, and ESE alteration. ESE neutral SNPs were defined as those where neither allele matched a RESCUE-ESE. ESE disruption SNPs were defined as those where only the ancestral allele matched a RESCUE-ESE. Conversely, ESE creation SNPs were defined as those where only the variant allele matched a RESCUE-ESE. ESE-alteration SNPs were defined as those where both alleles matched ESEs. The latter category was classified as deleterious based on the reasoning that each ESE is likely recognized by a unique SR protein, so that any change in an ESE might negatively affect splicing. In comparison with simulated SNPs in human exons, Fairbrother et al. (2004) found an excess of ESE neutral SNPs, providing evidence that purifying selection removed alleles that negatively affect splicing and supporting the functionality of the 238 RESCUE-ESEs.

Surprisingly, Fairbrother et al. (2004) did not find a statistically significant difference between synonymous and nonsynonymous sites in the selective pressure to remove SNPs that disrupt ESEs, although they did observe a higher frequency of ESE neutral mutations in synonymous versus nonsynonymous SNPs. Nonsynonymous SNPs are less likely to be selectively neutral than synonymous SNPs; it has been estimated that ∼80% of human nonsynonymous mutations are deleterious (Fay et al. 2001). Deleterious nonsynonymous SNPs are of more recent origin and tend to occur at lower frequencies than synonymous SNPs (Fay and Wu 2003). A fraction of nonsynonymous SNPs is adaptive and the beneficial alleles are in the process of sweeping to fixation. In either case, selection at the protein level would often negate selection for ESE function at the mRNA level. Therefore, due to interference selection at the protein level, mutations that are deleterious to ESE function are predicted to be more common in nonsynonymous SNPs than in synonymous SNPs.

We adopted a different strategy for using the SNP database as an independent means of validating the 238 RESCUE-ESEs and, also, to determine whether there was any difference between synonymous versus nonsynonymous SNPs with respect to purifying selection for ESE function. Here we compare the distribution of RESCUE-ESEs in biallelic and triallelic SNPs within five different functional classes of the genome (exons, introns, UTRs, and nongenic SNPs). Since ESE motifs are not expected to function outside of exons, we hypothesized that SNPs that disrupt ESEs should be more common in these regions than in exons. Because the base composition of the different functional classes of the genome varies (e.g., exons tend to have a high GC content), we compared the observed differences with expected differences generated from analysis of 1000 bootstrap replicates of 238 randomly sampled hexamers. Our results provide additional support for the conservation of the 238 RESCUE-ESEs. However, in contrast to Fairbrother et al. (2004), we observed a significantly higher proportion of ESE-neutral mutations in synonymous SNPs than in nonsynonymous SNPs.

Materials and Methods

Collection of SNP Data

Build 121 of the SNP database was downloaded from the NCBI FTP site for the SNP genotype data in dbSNP: ftp://ftp.ncbi.nih.gov/snp/human/. The NCBI Entrez system (http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Snp&cmd=Limits ) was used to query the SNP database and to partition the SNP data. Only the fully annotated variations (= refSNPs) were included in the analyses. RefSNPs are annotated by the NCBI on reference genome sequence contigs. For each chromosome, the SNP data were partitioned into the following five functional classes using the NCBI function codes: coding synonymous, coding nonsynonymous, intron, mrna utr, and nongenic. Nongenic SNPs were identified as SNPs not annotated as any of the other four function codes, and SNPs within 2 kb 5′ or 500 bp 3′ of a gene feature (function code = “Locus region”) were also excluded from the nongenic function class. To minimize the impact of nonindependence, only SNPs with a map weight of 1 were included in the search query where the map weight corresponds to the number of times a SNP maps to the genome contig. Only true SNPs (SNP class = snp) were included. Each query generated a file containing a list of rs numbers (refSNP identifiers) that satisfied the conditions of the query. The set of SNPs in each functional class for each chromosome was then analyzed using the methods described below. Pooled autosomes and sex chromosomes were analyzed separately.

Searching SNPs for ESEs

For any given allele, there are six hexamers that must be surveyed for the presence of an ESE because the polymorphic base may occur at any of six positions. Thus, each biallelic SNP contains 12 hexamers (=6 hexamers per allele × 2 alleles), and each triallelic SNP contains 18 hexamers. For each SNP, all hexamers were examined for the presence of an ESE, defined as any hexamer from the list of 238 hexamers of Fairbrother et al. (2002). To calculate the density of ESEs in biallelic SNPs, the number of hexamers that match 1 of the 238 putative ESEs divided by the total number of hexamers. For example, in a survey of 10 biallelic SNPs there are 120 hexamers. If 6 of those hexamers match 1 of the 238 ESEs, then the density of ESEs in that set of biallelic SNPs will equal 5% (=100 × 6/120). Five hundred forty-three SNPs containing less than 5 bp of flanking sequence on either side of the polymorphism were excluded from the analysis because it was not possible to survey all possible hexamers in such cases. These excluded SNPs were proportionally distributed among the five functional classes examined and constituted only 0.007% of all SNPs surveyed.

For each of six positions at biallelic SNPs, there are three possible patterns: neither allele is an ESE (=type 0), only one allele is an ESE (=type 1), or both alleles are ESEs (=type 2). A java script was written to determine the number of positions in each SNP falling into each of the three categories (source code available from the authors upon request). Type 0 SNPs are neutral because if neither allele is an ESE, then the SNP is neutral with respect to ESE function. Type 1 SNPs, when one allele is an ESE and the other is not, are deleterious for one of the following two reasons. First, if the common (or ancestral) allele is an ESE and the mutant (or derived) allele is not an ESE, then the mutation represents the loss of an ESE which would abrogate ESE function, negatively impacting the splicing efficiency of the associated transcript. Alternately, if the common (or ancestral) allele is not an ESE and the mutant (or derived) allele is an ESE, then the mutation represents the gain of an ESE. In the majority of cases, this would probably negatively affect fitness because it would lead to enhanced splicing efficiency in regions where enhanced splicing might be detrimental. In some instances such mutations could create ectopic splicing by activating cryptic splice sites (Cartegni et al. 2002). The classification of type 2 biallelic SNPs is not as straightforward as the classification of types 0 and 1. Type 2 SNPs could be considered neutral because the replacement of one ESE by another with a highly similar sequence may not significantly affect splicing efficiency. Although there is a variety of SR proteins that recognize different ESEs, the consensus recognition sequences for individual SR proteins are surprisingly degenerate (Graveley 2000), such that exchanges between ESEs that differ by a single mutation will not result in the significant loss (or gain) of splicing activity. On the other hand, it is possible that ESEs that differ at a single position might be recognized by entirely different SR proteins, or might be recognized with different affinities by the same SR protein, such that the effects of a point mutation might be deleterious. For these reasons, we opted to treat type 2 SNPs as distinct from type 0 and type 1 SNPs, and analyzed the biallelic SNPs using 2 × 3 contingency tables (see Statistical Analysis, below).

For triallelic SNPs, there are four possible patterns at each of the six positions: none of the three alleles are SNPs (type 0); one allele is an ESE and the other two are not (type 1); two alleles are ESEs and the other is not (type 2); or all three alleles are ESEs (type 3). In statistical comparisons between functional classes of SNPS, we treated each category independently using 2 × 4 contingency tables. The distribution of ESEs in rare tetra-allelic SNPs could not be analyzed in this study due to insufficient sample size in Build 121 of dbSNP.

Statistical Analysis

Comparisons between functional classes of SNPs were conducted using contingency tables. The results from contingency tables were tested for statistical significance using the G-test of independence in conjunction with Williams’s correction (Sokal and Rohlf 1995). Because the nucleotide composition of the different functional classes varies, it was necessary to compare the p values from contingency tests on the 238 RESCUE-ESEs to p values obtained from analysis of random hexamers. The frequencies of ESEs in different functional classes of SNPs are not directly comparable. The probability of encountering any given hexamer is highly dependent on base composition. For example, by virtue of how they were identified, RESCUE-ESEs are more common in exons than introns (Fairbrother et al. 2002). Therefore, the probability of a random mutation eliminating an ESE is higher in exons than in introns. The nucleotide composition of coding sequence and introns differs from noncoding sequence, potentially adding an additional source of bias. Bootstrapped sets of 238 hexamers were generated to determine whether the observed departure from expectation was due to differences in nucleotide composition and/or ESE density, or whether those differences were due to natural selection. One thousand sets of 238 random hexamers were generated and analyzed in tandem with the set of 238 hexamers from Fairbrother et al. (2002). Each set of 238 random hexamers conformed to two criteria: (i) each contained a single copy of each hexamer (i.e., each of the 238 hexamers was unique), and (ii) the 238 random hexamers did not overlap with any of the 238 RESCUE-ESEs. These criteria ensured that each set of 238 hexamers was appropriate for comparison to the set of 238 putative ESEs. The frequency of bootstrap replicates with a p value less than or equal to that for the 238 ESEs was then used as the measure of significance for each pairwise test among functional classes of SNP. The Dunn-Sidak method of sequential Bonferroni tests was employed to correct for spurious statistical significance arising from multiple tests.

Results

Overall Density of ESEs in SNPs

Biallelic SNPs in the five functional classes differed in overall ESE density (Table 1). Among the autosomes, nongenic SNPs contained the largest proportion of ESE hexamers, at 7.68%, whereas synonymous SNPs contained the fewest, at 6.69%. The proportion of nonsynonymous SNPs containing ESEs, 7.46%, was much closer to the fraction at nongenic SNPs than at synonymous SNPs. The corresponding proportions at introns (7.25%) and UTRs (7.20%) were intermediate between those at the synonymous and nongenic classes. The differences in ESE density among the different functional classes were highly significant (χ2 = 5956.82, p = 0). The observed number of ESEs within autosomal synonymous SNPs was 26,793, approximately 11% fewer than the 30,088 expected based on the overall frequency of ESEs within SNPs of 7.52%.

Table 1 Frequency of ESEs in biallelic SNPs surveyed

Biallelic SNPs on the X and Y chromosomes had lower frequencies of ESEs than the autosomal SNPs (Table 1). As was the case for the autosomes, synonymous SNPs on the sex chromosomes had the lowest frequencies of ESEs at 6.86% and 3.15% for the X and Y chromosomes, respectively, while nongenic SNPs had the highest frequencies of ESEs at 7.90% and 6.55% for the X and Y chromosomes, respectively. For the X and Y chromosomes, the differences in ESE density among the different functional classes were highly significant (X chromosome: χ2 = 61.09, p = 1.71 × 10−12; Y chromosome: χ2 = 84.70, p = 1.76 × 10−17). Within the same functional classes, the frequencies of ESEs in SNPs on the Y chromosome were substantially lower than on the X chromosome or autosomes.

The frequency of ESEs in triallelic SNPs was slightly lower than that in biallelic SNPs (Table 2). The relative frequency among functional classes was quite similar to that for biallelic SNPs, with synonymous SNPs having the lowest frequency of ESEs and nongenic SNPs having the highest. The differences in autosomal ESE density among the different functional classes were significant (χ2 = 21.73, p = 2.27 × 10−4). The observed number of ESEs within autosomal synonymous triallelic SNPs was 91, approximately 30% fewer than the 130 expected based on the overall frequency of ESEs within triallelic SNPs of 7.20%. Very few triallelic SNPs on the X and Y chromosomes were available as of Build 121 of dbSNP, so it was not possible to conduct meaningful statistical comparisons between functional classes of triallelic SNPs on the sex chromosomes.

Table 2 Frequency of ESEs in triallelic SNPs surveyed

Frequency of Deleterious Versus Neutral SNPs

The proportion of SNPs in each of the three categories of autosomal biallelic SNPs differed among the different functional classes (Table 3). Synonymous SNPs contained the highest proportion of neutral type 0 variants, whereas nongenic SNPs contained the highest proportion of deleterious type 1 variants. Nonsynonymous SNPs had a smaller fraction of neutral variants, and a larger fraction of deleterious variants, than SNPs in either introns or UTRs. Type 2 variants were also the least common among synonymous SNPs and most common among nongenic SNPs. This observation is in accordance with Fairbrother et al. (2004), who considered exchanges among ESEs to be deleterious. Based on the overall frequencies of type 0, 1, and 2 SNPs in all functional classes combined, type 0 synonymous SNPs were 1.64% more common, and type 1 and 2 SNPs were 14.34% and 8.33% less common, respectively, than expected. Taken together, type 1 and 2 SNPs were 13.11% less common than expected.

Table 3 Frequency of biallelic SNP types in the five functional classes

The relative proportion of SNPs in each of the three categories of biallelic SNPs on the X and Y chromosomes roughly mirrored that observed for autosomal SNPs (Table 3). For both the X and the Y chromosomes, synonymous SNPs had greatest proportion of neutral type 0 variants whereas nongenic SNPs had the smallest proportion of type 0 variants. Likewise, nongenic SNPs had higher frequencies of type 1 and type 2 variants. SNPs on the Y chromosome had higher frequencies of type 0 variants than autosomal or X-linked SNPs for all five functional classes.

The proportion of SNPs in each of the four categories of autosomal triallelic SNPs differed among the different functional classes (Table 4). Synonymous SNPs contained the highest proportion of neutral type 0 variants, whereas nongenic SNPs contained lowest proportion of neutral variants. Nongenic SNPs also had the highest proportion of deleterious variants (type 1 + type 2). Type 3 variants (all three alleles are ESEs) were most common among synonymous SNPs, although this unexpected observation may be due to sampling error since only 100 triallelic synonymous SNPs were available in Build 121 of the SNP database. There was a greater difference in the proportion of type 0 variants at synonymous SNPs versus type 0 variants at other functional classes of SNPs for triallelic SNPs (4.46–5.86%) than for biallelic SNPs (0.94–1.72%). Based on the overall frequencies of type 0, 1, and 2 SNPs in all functional classes combined, type 0 synonymous SNPs were 6.04% more common than expected, and type 1 and 2 SNPs were 61.98% and 61.64% less common than expected, respectively. Very few type 3 triallelic SNPs were observed, so the comparison of 8 observed versus 7.05 expected leads to a relatively large difference (11.81%) in the frequency difference between observed and expected type 3 triallelic SNPs. Taken together, type 1, 2, and 3 SNPs were 52.37% less common than expected.

Table 4 Frequency of triallelic SNP types in the five functional classes

Pairwise Contingency Tests Between SNP Functional Classes

Pairwise 2 × 3 contingency tests were conducted to test if the frequencies of type 0, 1, and 2 SNPs were significantly different among functional classes. For example, synonymous SNPs had 88.89%, 8.84%, and 2.27% of type 0, 1, and 2 SNPs, respectively. The corresponding frequencies for nongenic SNPs were 87.17%, 10.29%, and 2.54%. If one were to assume that the frequency of SNP types did not differ between the two functional classes, then the expected number of SNPs of each type can be calculated. For synonymous SNPs, the observed number of type 0, 1, and 2 SNPs was 29,657, 2949, and 757, respectively. The expected number of type 0, 1, and 2 SNPs is 29,087, 3430, and 847. Therefore, 570 more type 0 SNPs were observed than were expected based on the expectation that the frequencies would not differ between synonymous and nongenic SNPs. Likewise, 480 fewer type 1 synonymous SNPs and 89 fewer type 2 synonymous SNPs were observed than expected.

The results of pairwise contingency tests strongly support functional constraints on the Fairbrother et al. (2002) set of 238 hexamers identified as ESEs (Table 5). However, due to differences in base composition among the different functional classes of SNPs, the raw p values resulting from pairwise contingency tests are not appropriate in assessing the statistical significance of the results. Therefore, 1000 bootstrap replicates of 238 random hexamers were analyzed in the same manner as the 238 ESEs. “Bootstrap p” represents the proportion of bootstrap replicates exhibiting a raw p value less than or equal to the p value for the ESE hexamers. Thus, a bootstrap p value of 0.05 corresponds to the case when 50 of the 1000 sets of 238 hexamers exhibited p values less than or equal to that for the observed data. After Bonferroni corrections for multiple tests, the following comparisons were statistically significant for the autosomal SNPs: synonymous vs. nonsynonymous, synonymous vs. nongenic, intron vs. nongenic, and UTR vs. nongenic (Table 5). The only significant comparison for the nonsynonymous class was that with synonymous SNPs, indicating that purifying selection to maintain ESE function is stronger in synonymous SNPs.

Table 5 Results of pairwise 2 × 3 contingency tests among functional classes of biallelic SNPs

Because the X and Y chromosomes contained far fewer SNPs than the pooled autosomal SNPs, fewer of the pairwise comparisons were statistically significant, although differences in the frequencies of type 0 and type 1/2 SNPs were as pronounced as the corresponding frequency differences for the autosomes. For X-linked SNPs, only the synonymous vs. UTR and intron vs. nongenic comparisons remained significant after correction for multiple tests. For SNPs on the Y chromosome, the synonymous vs. nongenic, intron vs. UTR, and intron vs. nongenic comparisons were significant.

Pairwise 2 × 4 contingency tests on triallelic autosomal SNPs also provided strong evidence for functional constraint on the 238 ESEs in synonymous SNPs (Table 6). All comparisons between synonymous SNPs and SNPs of the other four functional classes were statistically significant. As was the case for biallelic SNPs, the only significant comparison for the nonsynonymous class was that with synonymous SNPs. The paucity of triallelic SNPs on the sex chromosomes precluded comparisons among functional classes.

Table 6 Results of pairwise 2 × 4 contingency tests among functional classes of triallelic SNPs

Discussion

Our comparison of ESEs in different functional classes of SNPs revealed that ESE frequencies differed among the functional classes, with synonymous SNPs containing the fewest ESEs, and nongenic SNPs containing the highest frequency of ESEs (Tables 1 and 2). This is consistent with the expectation that SNPs in functional ESEs should be more rare than SNPs at nonfunctional positions due to the removal of deleterious variants by purifying selection (Li 1997). On average, synonymous mutations in the vicinity of elements that affect splicing would be deleterious, and therefore underrepresented, relative to mutations in the same context at positions that do not affect splicing, such as in the nongenic portions of the genome. The ∼11% underrepresentation of SNPs in ESEs at synonymous sites makes sense in light of the distribution of SNPs in exons and introns, which is lowest near the exon-intron boundaries and increases with distance from those boundaries (Majewski and Ott 2002; Fairbrother et al. 2004).

The diminished ESE densities in protein coding SNPs on the sex chromosomes relative to the autosomes are also indicative of purifying selection. Due to the fact that X-linked loci are hemizygous in males, and Y-linked loci are always hemizygous, deleterious alleles are removed from the population more efficiently on the sex chromosomes (Charlesworth et al. 1993). Thus, the reduced density of ESEs in synonymous and nonsynonymous SNPs on the sex chromosomes probably reflects a higher efficiency of selection on the sex chromosomes.

We found significant differences in the frequencies of the SNP patterns (i.e., types 0, 1, 2, and 3) among the different functional classes. Type 0 SNPs were most common at synonymous sites on the autosomes (bi- and triallelic) and on both sex chromosomes (biallelic). Mutations that eliminate or create an ESE hexamer are predicted to occur less frequently in regions where that ESE would be functional. Therefore, one would expect a paucity of type 1 and type 2 SNPs in transcribed mRNAs relative to nongenic regions because such SNPs would negatively affect splicing, either through reducing splicing efficiency when an ESE is eliminated or through enhancing splicing when an ESE is created. While there are numerous examples of the deleterious effects of ESE elimination (Shiga 1997; Fackenthal et al. 2002; Moseley 2002; Pfarr et al. 2005), the novel creation of an ESE through mutation is also expected to be deleterious in most cases (Cooper and Mattox 1997; Graveley 2001). The de novo creation of an ESE through mutation could either trigger splicing when unnecessary, e.g., within an exon, resulting in a drastically altered protein, or it could enhance splicing above the current level which has been optimized by natural selection for that particular mRNA. Relative to nongenic SNPs, we did observe a significantly reduced frequency of type 1 and 2 biallelic SNPs at synonymous sites, and within introns and UTRs. The reduction in type 1 and 2 biallelic SNPs at synonymous sites and in UTRs (which are also spliced) is straightforward, but such a reduction in introns is somewhat surprising because intronic splicing enhancers (ISEs) are thought to be distinct from ESEs in base composition (Yeo et al. 2004). The reduction of type 1 and 2 SNPs in introns suggests that there might be a significant overlap in splice control elements (SCEs). A SNP-based validation of ISEs might reveal that disruption of RESCUE-ISE hexamers in synonymous SNPs is avoided. This might be particularly true of short internal exons, which require special enhancing sequences in adjacent introns (Sterner and Berget 1993; Berget 1995; Carlo et al. 2000).

Relative to synonymous SNPs, or SNPs in introns and UTRs, nonsynonymous SNPs exhibited higher frequencies of type 1 and 2 SNPs. A large proportion of nonsynonymous SNPs is likely to have strong negative effects on the function of the encoded proteins (Sunyaev et al. 2000; Fay et al. 2001; Ng and Henikoff 2001), the difference in fitness between type 0 and type 1 or 2 SNPs at nonsynonymous sites is likely to be negligible since in all three cases one allele is deleterious due to its effect on protein function. In contrast, at synonymous sites, both alleles of type 0 SNPs are predicted to be selectively neutral with respect to splicing and protein function, whereas a single allele of type 1 and 2 SNPs may be deleterious with respect to splicing. Thus due to selective constraint type 0 SNPs are overrepresented at synonymous SNPs relative to the other functional classes.

Fairbrother et al. (2004) provided evidence for purifying selection acting on the 238 RESCUE-ESEs. That study examined biallelic SNPs in internal exons, and the allele polarity of each SNP was determined through comparison with the chimpanzee orthologue. The use of genomic sequence data from the chimpanzee for outgroup comparison involves two assumptions. First, although the assumption that the allele matching the chimpanzee allele is ancestral is valid for most SNPs, it is most certainly not valid for all of them since evolutionary changes are not restricted to the human lineage. Any changes on the chimpanzee lineage after divergence from a common ancestor would incorrectly be classified as ancestral. Second, the chimpanzee allele was determined through unassembled reads from the chimpanzee genome project. As additional chimpanzee sequences become available, it is likely that a significant number of these internal exons will be polymorphic at positions homologous to those in human SNPs. In cases where the two chimp alleles match the two human alleles, designation of ancestral alleles would require comparisons with additional outgroup species.

Our approach does not require any assumptions about allele polarity or the dynamics of mutation. In effect, what we designate type 1 SNPs is equivalent to the sum of ESE disruption and ESE creation categories of Fairbrother et al. (2004). On the whole, these two categories are expected to be deleterious (Cooper and Mattox 1997; Graveley 2001). Our data from analyses of biallelic SNPs follows this expectation: the frequency of type 1 SNPs was lowest in the synonymous class for the autosomes and sex chromosomes.

Perhaps the most important difference between the results of this study and that of Fairbrother et al. (2004) pertains to the level of selection at synonymous versus nonsynonymous sites. Fairbrother et al. (2004) did not detect a statistically significant difference in the strength of purifying selection to avoid disrupting ESEs at synonymous and nonsynonymous SNPs, whereas we demonstrate that synonymous SNPs are under stronger selective constraint. This difference may be due to methodological differences between the two studies, but it might also be due differences in the data sets analyzed. We examined a much larger number of synonymous and nonsynonymous SNPs (77,047 versus 8408 biallelic SNPs), which provided us more power to detect subtle differences in the frequencies of ESEs at synonymous versus nonsynonymous SNPs. Unlike Fairbrother et al. (2004), we did not confine our analyses to internal exons. There may be a difference in ESE functional constraint in terminal versus internal exons, as internal exons are more likely to be alternatively spliced than terminal exons; conversely, terminal exons are probably more likely to undergo constitutive splicing (Sorek and Ast 2003; Phillips et al. 2004). If this is true, then ESE disruption in constitutively spliced exons would be more deleterious than ESE disruption in alternatively spliced exons because all mRNA isoforms would be affected by disrupted ESEs in constitutively spliced exons, whereas only a subset of mRNA isoforms would be affected by disrupted ESEs in alternatively spliced exons. Therefore, the difference in selective constraint on ESE function at synonymous versus nonsynonymous SNPs would be more profound in terminal exons than internal exons. Inclusion of terminal exons resulted in a larger proportion of constitutively spliced exons in our data set and could account for the significant difference in levels of selection for ESE function at synonymous versus nonsynonymous sites observed.

The splice sites of alternative exons are generally weaker than those of constitutive exons (Ast 2004), and alternative exons may contain more ESEs than constitutive exons (Cartegni et al. 2002; Graveley 2001). However, a systematic comparison of ESE density in constitutive versus alternative exons has not been conducted to date. Even if it ESEs are more common in alternative exons, those ESEs that are present in constitutive exons should be under greater selective pressure to retain function for the reasons outlined above. This hypothesis can easily be tested by comparing the frequency with which SNPs disrupt ESEs in constitutive versus alternative exons. A SNP-based approach might also determine whether there is a difference in the types of ESEs used in constitutive versus alternative exons.