Introduction

It is commonly believed that human brain function, cognitive abilities and linguistic skills have emerged as a result of natural selection [4, 37, 38, 56], although the specific genes and genetic structures underlying this process have remained elusive. Intelligence is a highly heritable trait in humans with heritability estimates reaching 66% in young adulthood [24]. Therefore, it is conceivable that variants affecting cognition are still evolving adaptively in anatomically modern humans [4, 38], although no specific evidence for this has been reported to date. While several genes involved in neural development and cognitive performances have been identified as targets of natural selection in modern human populations [14, 25, 30, 53, 64], none of their polymorphisms has been reliably shown to influence intelligence, the only exception being a nonsynonymous SNP in BDNF [36]. Still, in this latter case the Met66 allele, which is thought to have been driven to high frequency by selection, has a negative impact on cognitive abilities [12].

SNAP25 (synaptosomal associate protein 25 kDa) encodes a central component of the exocytotic machinery necessary for vesicle docking and fusion, and regulation of neurotransmitter release [42, 46]. In neurons, SNAP25 is located at the presynaptic plasma membrane where it also modulates the activity of voltage-gated calcium channels [8, 39, 58]. Polymorphisms in SNAP25 have been associated with different neuropsychiatric conditions such as ADHD (reviewed in [9]), schizophrenia, bipolar disorders, and autism [9, 13, 23], and variants in the gene were reported to modulate cognitive performances [19, 20, 45]. In particular, SNPs located in a ~14 kb region in intron 1 were associated with the intelligence quotient (IQ) in Dutch cohorts [19, 20]. A subsequent report in a Swedish sample replicated the association of one of these variants (rs363039, A/G) with cognition by measuring working memory capacity [45]. Yet, this study indicated the A allele as increasing cognitive ability, a situation opposite to the one observed in the Dutch cohorts where the G allele was associated with higher IQ [45].

We performed a population genetic analysis of SNAP25 and association with cognitive performances in different Italian populations. Data herein indicated that the region carrying rs363039 has been a target of balancing selection in human populations and that heterozygosity at this variant confers higher verbal skills in female individuals.

Materials and methods

HapMap subject resequencing

All analysed regions were PCR amplified and directly sequenced; primer sequences are available upon request. PCR products were treated with ExoSAP-IT (USB Corp., Cleveland, OH, USA), directly sequenced on both strands with a Big Dye Terminator sequencing Kit (v3.1 Applied Biosystems) and run on an Applied Biosystems ABI 3130 XL Genetic Analyzer (Applied Biosystems). Sequences were assembled using AutoAssembler version 1.4.0 (Applied Biosystems) and inspected manually by two distinct operators.

Data retrieval, linkage analysis, and haplotype inference

Data from the Pilot 1 phase of the 1000 Genomes Project were retrieved from the dedicated website (http://www.1000genomes.org/) [1]. Sliding window analysis was performed on overlapping 20-SNP windows moving with a step of 3 SNPs. For each window we calculated the estimate of the nucleotide diversity for each population [59] divided by the total number of fixed differences with Pan troglodytes. As a control set, genotype data for 5-kb regions from 238 resequenced human genes were derived from the NIEHS SNPs Program web site (http://egp.gs.washington.edu). In particular we selected genes that had been resequenced in populations of defined ethnicity including CEU, YRI, and EAS (NIEHS panel 2). Haplotypes were inferred using PHASE version 2.1 [47, 48]. Linkage disequilibrium analyses were performed using Haploview (v. 4.1) [3]. Data for LD analysis were derived from HapMap.

Population genetic analyses

Tajima’s D [49], Fu and Li’s D* and F* [15] statistics, as well as diversity parameters θ W [59] and π [33] were calculated using libsequence [51]. Calibrated coalescent simulations were performed using the cosi package [44] and its best-fit parameters for YRI, CEU, and EAS populations with 10,000 iterations. Coalescent simulations were conditioned on mutation rate and recombination rate. The maximum-likelihood-ratio HKA test was performed using the MLHKA software [65], as previously proposed [16]. In all analyses the chimpanzee sequence was used as the out-group. The median-joining network to infer haplotype genealogy was constructed using NETWORK 4.5 [2]. Estimate of the time to the most recent common ancestor (TMRCA) was estimated through the application of a maximum-likelihood coalescent method implemented in GENETREE [21, 22]. The method assumes an infinite-site model without recombination; therefore, haplotypes and sites that violate these assumptions need to be removed: for the analysis of SNAP25 we removed eight variants. Again, the mutation rate μ was obtained on the basis of the divergence between human and chimpanzee and under the assumption both that the species separation occurred 6 million years (MY) ago [18], and of a generation time of 25 years. The migration matrix was derived from previous estimated migration rates [44]. Using this μ and θ maximum likelihood (θ ML), we estimated the effective population size parameter (N e), which was equal to 12,923. With these assumptions, the coalescence time, scaled in 2N e units, was converted into years. For the coalescence process, 106 simulations were performed.

Child cohort

Subjects were drawn from a still-ongoing study on genetic and environmental influences upon language development in a general population sample. Children (n = 368) were recruited in Italy with the help of elementary schools and kindergartens in the greater Milan area and in the province of Lecco. The inclusion criteria were a mean score between vocabulary/information and block design sub-tests above 4 (i.e., >−2 SD), no disability certification, Caucasian ethnicity, and a self report of Italian descent of at least one generation. Participants were grouped in two age groups: 3–8 (n = 185, mean age: 6.18, SD: 1.57) and 9–11 (n = 183, mean age: 9.68, SD: 0.68) years. Two sub-tests of the age-appropriate Wechsler Intelligence Scale (WISC -R; WPPSI-R) [60, 61] were used to provide a proxi measure of verbal and performance IQ, i.e. Vocabulary/Information and Block Design, which are the most representative of, respectively, verbal IQ (r = 0.82/0.73) and performance IQ (r = 0.70/0.59) (WISC-R/WPPSI-R) [60, 61]. Scaled scores were compiled following manual instructions and were used in the statistical analyses. Mouth-wash samples for DNA extraction were obtained from each subject. Ethics approval for this study was received from Eugenio Medea Scientific Institute Ethical Committee.

Neuromuscular cohort

Seventy patients with different types of neuromuscular disorders not associated to mental retardation were recruited. Details concerning disease type, male/female ratio, and age are available as Electronic Supplementary Material, Table 1. Inclusion criteria was Italian Caucasian origin, age below 60 years, and FIQ ≥70. All subjects had a confirmed clinical, histological, and/or molecular diagnosis of LGMD, FSHD, congenital myopathy, SMA, and hereditary peripheral neuropathies, according to international criteria [5, 6, 32, 34, 35]. Cognitive performance was assessed using the Wechsler Preschool and Primary Scale of Intelligence (WPPSI), or the Wechsler Intelligence Scale for Children-Revised (WISC-R) and For Adult-Revised (WAIS-R), according to age [6062]. FIQ, VIQ, and PIQ did not significantly differ in males and females. Genomic DNA was extracted from peripheral blood using standard procedures. Ethics approval for this study was received from Eugenio Medea Scientific Institute Ethics Committee.

SNP genotyping and statistical analyses

Genotyping of all SNPs was performed by direct resequencing, as described above; primer pairs are available upon request. All SNPs were in Hardy-Weinberg equilibrium in all cohorts, as assessed by application of Exact Tests [63] as implemented in PLINK [40]. One-way ANOVA was used to assess the difference in cognitive performances. A conservative Bonferroni correction for multiple tests was applied to account for the analysis of multiple SNPs. Post hoc tests were performed using the Tukey’s Honest Significant Difference method. All calculations were carried out in the R environment [41].

Results

Evolutionary analysis of SNAP25

In order to obtain an overall picture of SNAP25 intra-specific genetic diversity, we exploited data from the 1000 Genomes Pilot project deriving from the low-coverage whole genome sequencing of 179 individuals with different ancestry [1]. In particular, individuals from three distinct ethnic groups have been analysed: Europeans (CEU), Yoruba from Nigeria (YRI), and Japanese plus Chinese (East Asian, EAS). Using these data we calculated θ W [59], an estimate of the expected per site heterozygosity, and human-chimpanzee divergence in sliding windows moving along SNAP25. As shown in Fig. 1, for the three populations a peak in the ratio of θ W over divergence was observed in an intron 1 region where an SNP (rs363039) previously associated with cognitive performances is located. The low-coverage 1000 Genome Project approach is estimated to have relatively low power to detect singleton SNPs or rare variants [1]; thus, we used a Sanger resequencing strategy to verify the sliding-window data and to apply classic population genetic tests. For this purpose we selected three regions: the one corresponding to the diversity peak (3.4 kb), which also carries rs363039, and two additional gene segments in intron 1 (2 kb each, Fig. 1), where two other SNPs associated with cognitive abilities (rs363043 and rs363016) have been described (Fig. 1) [19, 20]. We resequenced these regions in 20 HapMap CEU individuals and recalculated θ W, as well as π [33], this latter corresponding to the average number of pairwise sequence nucleotide differences between haplotypes. In order to compare the values we obtained for the SNAP25 regions, we calculated θ W and π for 5 kb windows (thereafter referred to as reference windows) derived from 238 genes resequenced (Sanger method) by the NIEHS program in the same population; the percentile rank corresponding to the SNAP25 regions in the distribution of reference windows is reported in Table 1. No exceptional nucleotide diversity was observed for the gene portions encompassing rs363043 and rs363016. Conversely, a value of θ W corresponding to the 98th percentile was observed for the region where rs363039 is located, confirming the sliding-window results. π was also high in this region, although it did not reach statistical significance.

Fig. 1
figure 1

Sliding window analysis of nucleotide diversity along SNAP25 . The ratio of θ W over human/chimpanzee divergence was calculated in sliding windows moving along the gene for the three different populations resequenced by the 1000 Genomes Pilot Project, namely YRI (red), CEU (blue), and EAS (green). The exon-intron structure of SNAP25 is also shown, and shading indicates the three regions we resequenced. The LD plot in CEU for the region where rs363039, rs363016, and rs363043 are located is also shown (r 2 data were derived from the HapMap website)

Table 1 Nucleotide diversity and summary statistics for the three SNAP25 gene regions we resequenced

High nucleotide diversity might be suggestive of balancing selection, as neutral variants tend to be maintained with the selected alleles. Data in Fig. 1 suggest that the excess of nucleotide diversity we observe is not due to higher local mutations rates as, in this case, an increase in human-chimpanzee divergence would also be expected. In order to formally rule out this possibility we applied a maximum likelihood HKA (Hudson-Kreitman-Aguadè) test [65]. This is based on the concept whereby, under neutral evolution, the amount of within- and between-species diversity is expected to be similar at all loci in the genome [27]. Therefore, the test compares polymorphism and divergence levels for a region of interest with those calculated for other neutrally evolving genomic segments. The results for the SNAP25 region carrying rs363039 indicated that a significant excess of polymorphism compared to divergence is observed (selection parameter k = 4.46; p value = 4.74 × 10−4; Electronic Supplementary Material, Table 2).

Natural selection acting on specific gene regions can determine a distortion in the allele frequency spectrum (AFS). Common neutrality tests based on the AFS include Tajima’s D (DT) [49], and Fu and Li’s D* and F* [15]. Since population history, in addition to selective processes, is known to affect the AFS, we evaluated the significance of neutrality tests by performing coalescent simulations that incorporate demographic scenarios (see “Methods”) [44]. As explained above, we also applied an empirical comparison by calculating the percentile rank of DT, F*, and D* for the three SNAP25 gene regions relative to reference windows. Neutrality tests failed to reject the null hypothesis for the regions covering rs363043 and rs363016. Conversely, significant values of D* were obtained for the gene segment where rs363039 is located (Table 1). In order to gain further insight into the evolutionary history of this gene region, we extended resequencing analysis to two additional HapMap populations, namely YRI and EAS. As reported in Table 1, high nucleotide diversity was observed in both populations, and most neutrality tests rejected the null hypothesis of selective neutrality in both YRI and EAS. Similarly, the MLHKA test indicated a significant excess of polymorphism compared to divergence in these populations (both p values <10−3; Electronic Supplementary Material, Table 2).

It has recently been shown that biased gene conversion (BGC) affects neutral substitution patterns, possibly mimicking the effects of natural selection (reviewed in [11]). The recombination rate is relatively high in the region where rs363039 is located (1.65 cM/Mb averaged over the three populations); therefore, we analysed the frequency of A/T → G/C substitutions in the region: out of a total of 36 segregating sites, only 38% were accounted for by A/T → G/C mutations, suggesting that BGC does not play a major role in shaping nucleotide variability at SNAP25 intron 1.

Additional insight into the evolutionary history of genomic regions can be gained by the analysis of haplotype genealogies. In order to estimate the coalescence time (time to the most recent common ancestor, TMRCA) of haplotypes in the SNAP25 region carrying rs363039, we used GENETREE, which is based on a maximum-likelihood coalescent method. The estimated TMRCA amounted to 2.08 million years (MY) (SD: 0.52 MY) (Electronic Supplementary Material, Fig. 1). We next constructed a median-joining network of the haplotype genealogy (Fig. 2) that shows the presence of two major haplotype clades both containing chromosomes from all three population samples. In agreement with the relatively high recombination rate in the region, the network shows some recurrent mutations that may result from recombination or gene conversion events.

Fig. 2
figure 2

Network analysis for the SNAP25 region where rs363039 is located. Each node represents a different haplotype, with the size of the circle proportional to frequency. Nucleotide differences between haplotypes are indicated on the branches of the network. The position of SNPs we used in association analysis is shown. Circles are colour-coded according to population (green: YRI, blue: CEU, red: EAS). The most recent common ancestor (MRCA) is also shown (black circle). The relative position of mutations along a branch is arbitrary. For all haplotypes observed in CEU the allelic status at rs363039 (A/G) is reported

Association of SNAP25 variants with cognitive performances

In order to analyse the role of SNAP25 variants in the modulation of cognitive performance, we enrolled two population of children divided in age ranges: 3–8 (n = 185) and 9–12 (n = 183) years. These subjects underwent two sub-tests of the Wechsler Intelligence Scale for Children (WISC), namely Vocabulary/Information and Block design. Five SNPs were selected for genotyping: in addition to rs363039, we analysed two variants (rs6039787 and rs3787297) located in the balancing selection region that define specific haplotypes (Fig. 2). Although no natural selection signature was identified in the regions where rs363016 and rs363043 are located, we included these two variants, as they were previously associated with IQ scores [19, 20]. Analysis was performed separately depending on gender. No genotype effect was observed for rs6039787, rs3787297, or rs363043 on either Vocabulary/Information or Block design scores (Electronic Supplementary Material, Table 3). Conversely, in both age groups a significant genotype effect was observed for rs363039 on Vocabulary/Information scores in females (Table 2), but not on Block design in either sex. Analysis of the combined cohort showed a strong effect of this SNP on verbal performances (one-way ANOVA, Bonferroni corrected p = 0.00029) in the female sample. Tukey post hoc analysis indicated that in both age cohorts the genotype effect is driven by differences between heterozygotes and both homozygous genotypes (3–8 year group: AA vs. AG, p = 0.017; AG vs. GG, p = 0.032; 9–12 year group: AA vs. AG, p = 0.008; AG vs. GG, p = 0.044). Thus, heterozygote females for rs363039 display, on average, higher verbal performances compared to both homozygotes, while this effect is absent in males (Fig. 3). As for rs363016, a significant genotype effect in females was observed in the 3–8 age group, but was not confirmed in older children (Table 2). The combined cohort yielded a significant difference in the female cohort, but resulted in weaker association compared to rs363039 (Table 2). Therefore, the genotype effect for rs363016 is most likely accounted for by linkage disequilibrium with rs363039 (r 2 = 0.49, D′ = 0.83).

Table 2 Association analysis of rs363039 and rs363016 with cognitive performance in the child and neuromuscular cohorts
Fig. 3
figure 3

Boxplot of Vocabulary/Information and VIQ scores. Vocabulary/Information and VIQ scores are shown according to rs363039 genotype. Results refer to two age cohorts combined for children (a, b). For the neuromuscular cohort, VIQ scores are reported based on genotype status at rs363039 with homozygous genotypes analysed separately (c and d) or together (e and f)

In order to replicate these findings in a third independent cohort, we recruited 70 subjects suffering from neuromuscular disorders that do not affect intellectual abilities (see Electronic Supplementary Material, Table 1 for demographic and clinical details). Cognitive performances were evaluated by recording FIQ, VIQ, and PIQ, and rs363039 was genotyped in all subjects. No genotype effect was observed for FIQ and PIQ (Table 2). As for VIQ, females showed a pattern for the three genotypes that recapitulates that observed for Vocabulary/Information scores in the child cohort (Fig. 3), although the difference only showed borderline significance (one-way ANOVA, p = 0.050, Table 2). Given the small sample size and the data obtained in the post hoc analysis above (showing a higher Vocabulary/Information score for heterozygotes compared to both homozygote genotypes), we pooled together AA and GG homozygotes and compared their VIQ with that obtained from heterozygotes (Fig. 3): a significant difference with higher VIQ for heterozygotes was observed in females but not males (one-way ANOVA, p = 0.014, Table 2).

Thus, the results of these three independent cohorts are consistent in showing that heterozygosity for rs363039, which is located in the balancing selection region, confers higher verbal performances to female subjects.

Discussion

The search for signatures of natural selection acting on human genes serves a double purpose: first it helps to clarify the evolutionary history of our species, especially in relation to complex phenotypic traits such as cognition; second, the identification of gene regions subject to non-neutral evolution can be exploited to recognise functional variants, which are a prerequisite for selection to act. In particular, balancing selection signatures, due to the combined effects of mutation and recombination, are expected to extend over relatively short genomic regions [7] where the functional selection target(s) is located. Indeed, as shown in the sliding-window analysis (Fig. 1), the high-diversity peak is restricted to a relatively narrow region in SNAP25, which encompasses rs363039. Application of population genetic tests based on the allele frequency spectrum in most cases allowed rejection of the null hypothesis of selective neutrality for this gene region. It is worth noting that the failure to reach statistical significance for some of these tests is likely due to the fact that in non-African populations, the two major haplotype clades have extremely different frequencies, resulting in no marked skew towards intermediate frequency variants (to which these tests are sensitive). Support of the notion whereby the region carrying rs363039 represents a selection target also comes from the estimation of the coalescent time: we obtained a TMRCA dating back to 2 MY, which is deeper than the same estimates obtained for most autosomal neutrally evolving genes, these ranging from 0.8 to 1.5 MY [52]. However, some caution should be used in interpreting this result, as inference of phylogenies in the presence of recombination is not extremely robust. This observation also leads to the question as to whether rs363039 represents the selection target or not, and, in turn, if increased verbal abilities represent the selected trait. On one hand, theory predicts that the selected variant(s) should be located on the basal branches separating the major haplotype clades, and this is not the case for rs363039. On the other hand, we selected variants to be genotyped depending on their location in the haplotype genealogy (so as to allow association analysis of major haplotypes in Europeans) but only rs363039 yielded significant association with IQ scores, suggesting that it represents the causal variant in the region. Different explanations can reconcile these apparent conflicting results. Selection may have acted to maintain heterozygosity at rs363039 (due to its conferring higher verbal abilities), and historical recombination (or gene conversion) events have resulted in the observed phylogeny. Alternatively, a variant different from, but in proximity to rs363039 and a trait unrelated to cognition may have been targeted by selection. SNAP25 is a central component of the exocytosis machinery in pancreatic beta islets, where it interacts with several components, including syntaxin A1 (STX1A) and calpain-10 (CAPN10) to modulate insulin release [29, 50]. In line with this function, SNPs in both CAPN10 and STX1A have been associated with the predisposition to type 2 diabetes [43, 54, 55]. Genes associated with metabolism in general, and type 2 diabetes in particular, are thought to have represented selection targets during human evolutionary history, as a result of changes in diet following the advent of agriculture. This possibility has been verified in the case of CAPN10 [57]. Therefore, the balancing selection signature we identified in SNAP25 may be related to its role in insulin secretion and metabolism. Nonetheless, these two alternatives are not mutually exclusive as different selective forces, occurring across different time periods, may have affected the evolutionary fate of the genomic region.

Despite the generally held view that intelligence and verbal skills should have represented selected traits [4, 37, 38, 56], the description of genes/gene regions that are adaptively evolving in humans and that carry polymorphisms with an effect on cognitive performances have been virtually absent. Even the signatures of strong and recent positive selection at MCPH1 and ASPM [14, 30], two genes that result in microcephaly and cognitive impairment when mutated, seem to be driven by pressures unrelated to cognition [31]. In fact, both genes are expressed in tissues other than brain, and their variants may have pleiotropic effects. Similarly, the selective force responsible for the increase of the Met66 allele of BDNF in North African, European, and Asian populations is presently unknown [36], and this allele is associated with lower rather than higher cognitive abilities [12].

One possibility for this failure to identify selection signature for variants that affect intelligence is that most large-scale analyses of selective patterns in the human genome have focussed on relatively recent, positive selection events. This is mainly due to the difficulty in identifying long-standing balancing selection signatures at the genome-wide level. However, older (and possibly weaker) selective events may have operated over a time frame that is more relevant to the evolutionary history of cognitive abilities given that, as noticed by Vallender [56], anatomically modern humans emerged between 100,000 and 200,000 years ago, and the “human” brain was already established by then. In this respect it is worth noting that heterozygote advantage is one of the possible reasons underlying the maintenance of genetic diversity in natural populations. Thus, the observation that heterozygotes for rs363039 display higher verbal abilities compared to homozygotes perfectly fits the underlying balancing selection model. Besides this evolutionary consideration, the fact that heterozygosity at this variant is associated with higher cognitive performances may have a biochemical explanation, as well. Typically, the first intron of genes contains sequences important for gene regulation, and different transcription factors (TF) have been predicted to bind the region where rs363039 is located, depending on its allelic status [19, 45]. Albeit these predictions should be taken with caution because of the largely degenerated consensuses for TF binding sites, it may be speculated that heterozygotes display a wider SNAP25 expression flexibility in terms of either response to specific stimuli or cell-type specificity.

As noted above, the association of rs363039 with cognitive skills has been reported previously, although different alleles have been associated with higher performances. The reason for this inconsistency has remained unexplained, although our data suggest that stratification by sex and application of non-allelic models might help reconcile those data. Indeed, we analysed three independent populations and obtained the same pattern in all cases, with heterozygote females displaying higher verbal performances compared to homozygotes. Thus, we consider this finding to be robust, although it would certainly benefit from replication in larger cohorts. Similarly, the possibility exists that the neuromuscular sample does not perfectly reflect the IQ score variation in healthy subjects. Nonetheless, we envisage no reason why allelic status at rs363039 should have a sex-driven interaction with any of these disorders, and we included a similar number of male and female subjects for each disease (Electronic Supplementary Material, Table 1). Moreover, the average FIQ, VIQ, and PIQ were not significantly different between male and female neuromuscular patients, and the VIQ pattern in relation to rs363039 perfectly paralleled the Vocabulary/Information scores obtained in the healthy children cohorts. As for the remaining SNPs, we failed to replicate the association, previously reported by Gosso and co-workers [19] of rs363016 and rs363043 with intelligence. Yet, it should be noticed that Gosso et al. used a family-based approach, which might have resulted in the strongest power to detect associations, and large subject cohorts from the Netherlands may display higher genetic homogeneity and linkage disequilibrium compared to our Italian sample.

Finally, the reason why heterozygosity at rs363039 affects verbal performances in female subjects alone remains to be clarified. A sex-specific difference has been described in the brain expression of SNAP25, although few samples were used for the analyses and no genotypic information was incorporated in the study [10]. Previous works have also indicated that estrogen regulates SNAP25 expression in rat brain areas and in the pituitary [26, 28]. In this respect it is worth mentioning that several neuropsychiatric disorders that have been associated with variants in SNAP25 (e.g. ADHD, schizophrenia, autism) display well-known gender prevalence biases. Whether these differences also depend on a differential expression of SNAP25 in males and females [17] remains to be elucidated, and might provide further insight into the role of this protein in both normal cognitive development and in neuropsychiatric disease.