Introduction

Persistence of small-intestinal lactase production into adult life in humans is caused by genetic differences cis-acting to the lactase gene, LCT (Wang et al. 1995), which enable some alleles to escape the developmental down-regulation characteristic of the ancestral state, in which lactase expression declines and restricts the ability to digest lactose in milk after childhood. Lactase persistent individuals (lactose digesters) can readily consume large amounts of milk as adults, and there is considerable evidence to suggest that the trait is subject to strong positive selection in humans (reviewed in Ingram et al. 2009). To date, three different nucleotide polymorphisms have been identified that are associated with increased lactose digestion capacity. The first allele to be identified, –13910*T (rs4988235) was discovered in a Finnish sample (Enattah et al. 2002), and is tightly associated with high adult lactase expression throughout Europe. –13910*T is present on the background of a single very extended haplotype previously defined ‘A’ (Hollox et al. 2001; Poulter et al. 2003; Bersaglieri et al. 2004; Coelho et al. 2005). More recently, –13915*G (rs41380347) was found to be associated with increased lactose digestion capacity (Ingram et al. 2007; Tishkoff et al. 2007) and greater lactase activity (Imtiaz et al. 2007) in Sudanese and Middle Eastern populations respectively, and a third allele, –14010*C, was found to be associated with increased lactose digestion capacity in Kenyan and Tanzanian populations (Tishkoff et al. 2007). –13915*G and –14010*C are also present on single extended haplotypes that are distinct from the A haplotype and from each other (Tishkoff et al. 2007; Enattah et al. 2008). –13910*T, –13915*G and –14010*C are all located within an intron of MCM6 (the upstream neighbor of LCT), in a sequence that acts as an enhancer for lactase expression in vitro (Troelsen et al. 2003). Each of the alleles increases transcription compared with the ancestral allele in promoter-reporter constructs (Troelsen et al. 2003; Tishkoff et al. 2007), although the effects are small and there is no simple change in transcription factor binding (Ingram et al. 2007; Enattah et al. 2008). The pattern of association of each of these alleles with lactose digestion status is not absolute (Ingram et al. 2007; Tishkoff et al. 2007), and although there is an intrinsic error rate in the lactose tolerance test (LTT) (Mulcare et al. 2004), many persistent individuals can be confidently identified who do not carry –13910*T, –13915*G or –14010*C. Two other putative functional alleles have been identified, –13913*C (rs41456145) and –13907*G (rs41525747), although these occur at very low frequencies in the groups tested so far, resulting in insufficient power to examine their association with lactose digestion (Ingram et al. 2007; Tishkoff et al. 2007; Enattah et al. 2008).

Since –13907*G gave suggestive evidence of function in vitro (Tishkoff et al. 2007) and had been found previously in Ethiopians (Ingram et al. 2007), we initially collected samples from an Ethiopian pastoralist population with a view to testing the association of this allele with lactase persistence. Resequencing of the enhancer region in this cohort revealed remarkable heterogeneity, including previously reported, as well as novel variant alleles, which show a very marked difference in distribution with respect to lactose digestion status. The significance of this diversity is evaluated by testing for population substructure at other loci and re-examining our previously tested phenotyped cohort which shows the same phenomenon. We report haplotype associations of each allele using a large pooled data set (n = 746) for increased accuracy of statistical inference.

Subjects and Methods

Test Cohort and Lactose Tolerance Testing

The ethnic group selected and tested were Somali, of whom about 3 million are resident in Ethiopia. The Somali were selected because of their documented history as a traditional pastoralist population (Blench 1999). Volunteers were recruited in Shinile (9.6833 N, 41.8500 E; approximately 10 km from Dire Dawa), in the Somali region of eastern Ethiopia. Individuals over the age of 18 and of self-declared Somali ethnicity were invited to participate in the study on the day prior to testing, when the purpose and possible side effects of the lactose tolerance test were clearly explained by a local interpreter and nurse. Each person who consented to participate agreed to observe an overnight (8 h) fast in preparation for the test. A local interpreter with personal knowledge of the participants was employed to recruit volunteers who were unrelated at least at grandparental level. Each sample donor was asked to complete a questionnaire which recorded whether they had taken antibiotics recently or experienced any gastro-intestinal illness. We also recorded their milk-drinking habits and that of their parents, and their grandparents. The age range of sample donors was from 19 to 70 years old, with 80% being between 20 and 50. Buccal cell samples were collected as described by Freeman et al. (2003). DNA samples were linked to questionnaire and lactose digestion data by code, but names were not taken. Local approval for this study was obtained from Addis Ababa University and the genetic part of the work was conducted in London under UCLH 99/0196 and 01/0236 ethics approvals.

Lactose tolerance testing was conducted as follows: breath hydrogen baseline readings were obtained using a MicroH2 meter (Micro Medical Ltd) and all eligible individuals (i.e. breath hydrogen > 0 ≤ 20 ppm) were given 50 g lactose dissolved in 250 ml water at room temperature and were requested to stay for the entire test duration (3 h). Breath hydrogen readings were taken at 30-min intervals during the test. In total, 107 samples were collected. Participants were classified into four mutually exclusive categories, two unambiguous (lactose digesters, D, and lactose non-digesters, ND) and two ambiguous (intermediate, I, and hydrogen non-producers, H2NP). Assignment to these categories used the following criteria: D, a rise in breath hydrogen of not more than 20 ppm for the entire test duration; ND, a rise in breath hydrogen of greater than 20 ppm or more for at least two consecutive readings; I, fluctuations in breath hydrogen occur throughout the test, not giving a clear sustained rise above 20 ppm nor a clear sustained flat line; H2NP, a baseline reading of zero which remained at zero for the test duration. H2NP individuals do not release hydrogen in their breath following the lactose load either because they have an absence or low number of hydrogen producing bacteria in their gut flora (Gilat et al. 1978) or because they are genuine lactose digesters. It is impossible to distinguish these two phenotypes without conducting further tests for the presence/absence of hydrogen producing bacteria, and so here they have been included in an ambiguous category which reflects this uncertainty.

Details of our earlier collection of DNA samples with associated lactose digestion phenotype, collected from the Jaali population residing in Shendi in central Sudan (n = 86), are described in Ingram et al. (2007).

Non-Phenotyped Cohorts Tested

Non-Phenotyped samples used for distribution data and haplotype inference (n = 553) consist of 89 European samples including 23 newly collected samples, 20 CEPHs and 46 samples previously described (Harvey et al. 1998); 89 Cameroonian samples including Fulani, Mambila and Shuwa Arabs; 96 Ethiopian samples including Amhara, Afar and Somali (from a separate collection in Jijiga); 96 Sudanese samples including Beni Amer, Donglawi and Shaigi; and 183 samples from the Middle East (including Bedouin and non-Bedouin Arabs).

Genotyping

Sequencing

PCR of the MCM6 enhancer sequence was carried out in 15 μl total volume of 1× reaction buffer IV (Abgene), 0.25 units of Taq DNA polymerase (Abgene), 0.2 mM dNTPs, approximately 10–20 ng genomic DNA and 0.5 μM of each primer (MCM6ex13 5′-ATTTCCAAAGAGTCAGAGGACTTC-3′ and MCM6778 5′-CCTGTGGGATAAAAGTAGTGATTG-3′). Cycling conditions were 95°C for 5 min, followed by 34 cycles of 95°C for 30 s, 58°C for 30 s and 72°C for 1 min. PCR products were cleaned by PEG precipitation and sequenced using Big-Dye Terminator chemistry (Applied Biosystems, Foster City, CA, USA) with the MCM6ex13 primer. Novel variants were confirmed by sequencing the reverse complement with the MCM6778 primer. Sequencing of the hypervariable segment 1 (HVS-1) of the mitochondrial DNA (mtDNA) was carried out as previously described (Thomas et al. 2002). All sequence fragments were electrophoresed on an ABI 3730xl genetic analyser, and chromatograms examined using ChromasPro software (Technelysium).

LCT Haplotypic Markers

–942/3TC>ΔΔ, –678A>G, 666G>A (rs3754689) and 5579T>C (rs2278544) were genotyped as described elsewhere (Ingram et al. 2007). rs309180, rs4954490, and rs3769005 were typed by PCR-RFLP and rs4954493 was typed by tetra-primer arms PCR. Primers and PCR conditions can be found in Supplementary Table 1. Genotypes were inferred from agarose gel phenotype assuming no silent alleles.

Y Chromosome

Typing of six unique event polymorphism (UEP) markers (92R7, SRY+465, SRY4064, sY81, Tat and YAP) and six microsatellite markers (DYS19, DYS388, DYS390, DYS391, DYS392, and DYS393) was carried out using the technique previously described in Thomas et al. (1999).

Autosomal Short Tandem Repeat (STR) Markers

Fifteen unlinked autosomal STRs, commonly used for forensic differentiation (Penta E, D18S51, D21S11, TH01, D3S1358, FGA, TPOX, D8S1179, vWA, Penta D, CSF1P0, D16S539, D7S820, D13S317 and D5S818), were genotyped using the PowerPlex 16 kit (Promega) modifying the manufacturer’s instructions only to reduce reaction volume to 10 μl and enzyme units to 0.5 U/μl. Amplified fragments were detected using a 3730xl genetic analyzer (ABI) and analyzed using GeneMapper v4.0 software (Applied Biosystems).

Statistical Analyses

Hardy–Weinberg Equilibrium

Tests for deviation from Hardy–Weinberg Equilibrium were conducted for each genotyped locus within each population using the exact test for HWE based on a Markov chain method implemented either in the program GENEPOP (Raymond and Rousset 1995b) or in Arlequin (Excoffier et al. 2005).

Haplotype Inference

Haplotypes (i.e. intron 13 variations plus LCT core haplotype tagging SNPs) were inferred using the Bayesian algorithm implemented in Phase (Stephens et al. 2001). Haplotypes were inferred for each individual twice; once within their own population group and once for the pooled data set of 746 individuals (providing greater power for inferring the haplotypic associations of rare alleles (Andres et al. 2007). Outputs were inspected for agreement, and in the few cases where discrepancies were present a decision was made based on visual inspection of the haplotype pairs and consideration of the Phase probabilities generated by the software. Such decisions were only necessary in the case of two individuals in the Somali data set. Haplotype inference within the Somali population assigned the single occurrence of –14010*C to a C haplotype, but when inferred for the pooled data set this allele was assigned to the B haplotype in all three individuals in whom it occurred. The second discrepancy involved assignment of –13915*G on to a recombinant haplotype both within the Somali population and in the pooled data set. Upon visual inspection, genotype data for this sample (DD-084) was resolved into common C and U haplotypes.

Population Differentiation

GENEPOP (Raymond and Rousset 1995b) was used to investigate differentiation in STR allele distribution between pairs of sample groups. Contingency tables of alleles were constructed for pairs of populations and tested for independence using a Fishers exact test (Raymond and Rousset 1995a). Genepop uses Fisher’s combined probability test to generate an overall P-value for differences in allele distribution between pairs of populations across all loci (Sokal and Rohlf 1994). Population differentiation at the intron 13, Y-chromosome and mtDNA loci was tested using permutation-based AMOVA of Wright’s FST statistic (Excoffier et al. 1992) and an exact test of population differentiation (Raymond and Rousset 1995a) using Arlequin software (Excoffier et al. 2005). Genetic diversity was calculated using the test_h_diff program written by M. Weale (Thomas et al. 2002). The program calculates Nei’s h (Nei 1987) for each population and tests for a significant difference in allele distribution between them, based on samples of haplotypes. P values are obtained using both a bootstrapping method and a Z test and the solution for P is equal to the larger of the two.

Results

Lactose Tolerance Test Results

DNA samples and lactose digestion data were collected from 49 females and 58 males of self-declared Somali ethnicity. Eleven were of intermediate lactose digestion status and eight showed no evidence of hydrogen production (Table 1). Unambiguous lactose digestion phenotypes were obtained for 88 of the 107. There were 21 lactose digesters, giving a frequency of 0.24. In this population, lactose digestion capacity did not correlate with milk-drinking behavior (P = 1.00 for a Fishers exact test of drinking 500 ml or more of milk per day), and 71% of the participants reported drinking at least half a litre of fresh milk regularly in a day.

Table 1 Summary of lactose digestion results obtained from lactose tolerance testing in the Somali cohort

Resequencing

The intron 13 enhancer region (−14133 to −13684) was sequenced in all 107 people of the phenotyped Somali cohort. The sequence from exon 13 up to position −14010 was completely invariant, but downstream of −14010 eight polymorphic sites were identified (see Fig. 1 and Table 2). Three of these sites (–14009T>G, –13806A>G and –13779G>C) had not been previously reported.

Fig. 1
figure 1

Map of chr2q21 showing LCT and MCM6 and the location of the SNPs identified within the LCT enhancer sequence within intron 13 of MCM6. Variant alleles observed in the Somali cohort are shown in black, and nucleotides previously reported to have variant alleles in other populations are indicated by a black dot. Nomenclature of the SNPs indicates position in number of nucleotides upstream of the LCT transcription start site (on the allele in the human golden path sequence). Note that the genome sequence (indicated by the black bar at the top of the figure) is in the opposite orientation from the direction of transcription of MCM6 and LCT (the genomic nucleotide positions correspond to those given by the UCSC Genome Browser, March 2006 assembly)

Table 2 Variant alleles observed in the lactase enhancer in the Somali cohort

Table 2 shows the distribution of each variant allele with respect to lactose digester status. There is a difference in allele distribution between the different groups. Seventeen of the twenty-one lactose digesters showed one or more derived alleles (ascertained by comparison with primate sequence) compared with only 16/67 non-digesters. Only three individuals within the data set are heterozygous for two different derived alleles, and no individual carried more than two derived alleles (Table 3). Each of the novel alleles is associated with different haplotypes, as seen below, so that there is reasonable evidence that the derived alleles are independent. Under this assumption of independence, the difference in prevalence of derived alleles between unambiguous digesters and non-digesters is highly significant (P = 4.3 × 10−6, Fishers exact test for a 2 × 2 table of variant/ancestral chromosomes and digester status; Table 3).

Table 3 Contingency table of the unambiguous digester (D) and non-digester (ND) phenotype groups showing the distribution of individuals carrying either none, one or two variant alleles within the LCT enhancer sequence

Association of each allele with lactose digestion capacity was also tested, by comparing allele counts in the two groups, lactose digesters and non-digesters (Table 2). The most frequent derived allele (–13730*G, rs4954492) at the 3′ end of the enhancer sequence shows no hint of association, and is, unlike the other derived alleles, widespread in several populations (B. Jones and D.M. Swallow, unpublished data, and Supplementary Table 2). Both –13915*G (previously shown to associate with lactase persistence; Ingram et al. 2007; Tishkoff et al. 2007; Imtiaz et al. 2007) and –13907*G (previously demonstrated to have function in vitro; Tishkoff et al. 2007) are individually highly significantly associated with the lactose digester phenotype in this cohort, P = ≤1 × 10−4 for a 2 × 2 contingency table (Fishers exact test), and remain significant after Bonferroni correction for eight tests (corrected significance threshold = 0.006).

The –14010*C allele, previously shown to be associated with lactase persistence in Kenyan and Tanzanian populations (Tishkoff et al. 2007) and the European –13910*T allele each occur in single individuals in whom the LTT was inconclusive. Two of the newly identified loci (–13806A>G and –13779G>C) are rare and occur only in lactose non-digesters.

The third novel allele (–14009*G) is also rare and occurs in two lactose digesters and in one hydrogen non-producer. It shows marginal significant association with lactose digestion capacity in the Somali cohort (P = 0.056, Fishers exact test) at the uncorrected significance threshold (0.05), but is no longer close to significance at the corrected threshold of 0.006.

The discovery of –14010G>C and –14009T>G led us to re-examine the Sudanese Jaali cohort in which –13915T>G was originally identified (Ingram et al. 2007). –14009*G was found at a frequency of 0.06 in this cohort (11/166 chromosomes). Eight of these eleven –14009*G alleles occurred in lactose digesters, six of whom carried no other variant alleles within the enhancer region, see Supplementary Table 2. Association of the SNP with lactase persistence status was however not statistically significant (one tailed P = 0.08, Fishers exact test). Nevertheless, taken together with the Somali data, the excess of –14009*G alleles in persistent individuals is noteworthy.

Genetic Diversity and Differentiation

Our results show a significant difference in heterogeneity in the LCT enhancer locus between people of different inferred lactase persistence status. Since this might result from recent population admixture or other causes of structuring, we have examined the distribution of Y chromosome, mitochondrial and autosomal markers in the same samples. All should be sensitive to detecting population sub-structure; Y chromosome and mtDNA with their smaller effective population size, and/or high mutation rate, and the autosomal STRs because of their known power to differentiate closely related populations (Krenke et al. 2002).

Pairwise FST and Fishers exact tests of population differentiation, between the digesters and non-digesters, were calculated for HVS-1 mtDNA and Y chromosome haplotypes as well as for the LCT enhancer. Only the LCT enhancer showed a significant difference between the digesters and non-digesters (P < 0.001 for both FST and population differentiation). In order to compare the pattern of diversity observed at the LCT enhancer region with that observed for mtDNA and Y-chromosome, Nei’s h (Nei 1987) was calculated for the digester and non-digester groups, and tested for significant difference (Thomas et al. 2002). In both the Somali and Jaali populations, only the LCT locus revealed a difference in the apportionment of genetic diversity (Table 4). Table 5 shows the results of a combined Fishers exact test of the difference in allelic distribution of 15 unlinked autosomal microsatellite markers between the digesters and non-digesters in both the Somali and the Jaali cohorts. There was no evidence of population differentiation between the phenotype groups of either cohort, although the Somali were significantly different from the Jaali.

Table 4 Genetic diversity, Nei’s h, of digesters and non-digesters of the Somali and Jaali cohorts for LCT enhancer (haplotypes), mtDNA (HVS-1 haplotypes) and Y-chromosome (haplotypes composed of 12 loci)
Table 5 Outcome of contingency table comparison between the microsatellite allele distribution in the Somali and Jaali cohorts and the digesters/non-digesters

Association of Intron 13 SNPs with Haplotype

To infer LCT gene haplotypes for each of the enhancer alleles, four LCT haplotype tagging SNPs were typed, allowing haplotypes to be defined according to the nomenclature published by Hollox et al. (2001). Genotype data for the intron 13 variants (see Supplementary Table 3) as well as the haplotyping SNPs in the Somali and Jaali cohorts were pooled with the same data from an additional 553 individuals from a number of geographic locations including Europe, the Middle East, and east and west Africa. The total data set used for haplotype inference included 746 individuals. Table 6 shows the frequency with which a given intron 13 lactase persistence-associated allele was observed on different LCT haplotypes.

Table 6 Frequency with which a given intron 13 lactase–persistence associated allele was observed on different LCT haplotype backgrounds

All but one of the 127 –13910*T alleles were observed on an A haplotype (76 from European individuals, 46 in the Cameroon Fulani and singletons in populations from the Middle East, Sudan and Ethiopia). The single non-A haplotype –13910*T allele was found in a Fulani individual from Cameroon and was inferred to be part of the F haplotype, which may result from a recombination between haplotypes B and A (Hollox et al. 2001). –13915*G was predominantly associated with the C haplotype, and the novel –14009*G allele was found to associate with the X haplotype. –14010*C was very rare and confined to the Somali ethnic group where it occurs on the B haplotype. –13907*G shows more evidence of variation in its LCT haplotype association than the other alleles. While it is observed mostly on the A haplotype (as reported by Enattah et al. 2008), we found a relatively high proportion of alleles (20%) associated with other haplotypes.

Discussion

The most striking outcome of this study is the finding that the LCT enhancer sequence is significantly more heterogeneous in the lactase persistent Somali than in the non-persistent members of the cohort. Reanalysis of the previously collected Jaali cohort shows the same phenomenon. Analysis of other loci (Y chromosome and autosomal STRs) demonstrates that this difference is not attributable to population stratification, and the observation of equally high diversity of mtDNA in the non-digesters as in the digesters excludes the possibility that the non-digesters are genetically more homogeneous. These findings are in dramatic contrast to the situation in Europe, where a single allele causal of lactase persistence is found, at very high frequency, in a genomic region of reduced genetic diversity, and which is a “textbook example” of the classical signal for a positive selective sweep. Here we argue that the increased degree of genetic diversity seen in the lactase persistent group resulting from multiple advantageous mutations is also a consequence of selection.

The clustering of the lactase persistence associated variants in a single short sequence region, the fact that they occur on different haplotype backgrounds, and the substantial degree of genetic differentiation of this region between phenotypically distinct groups, taken together support the conclusion that these changes are of functional importance, but also suggest that the enhancer region affects LCT expression in a complex manner. Most of the persistence associated variants (–13910*T, –13915*G, and –13907*G) cluster around an Oct-1/GATA binding site (Lewinsky et al. 2005), where other rare SNPs have also been reported (–13913T>C/rs41456145 (Ingram et al. 2007); –13914G>A (Tag et al. 2007); and –13908C>T/rs4988236) and it now appears that a similar clustering of SNPs occurs upstream. The –14010*C lactase persistence associated allele is found in the centre of a run of three variable nucleotides, with –14009T>G and –14011C>T (rs4988233) on either side (Fig. 1). It is not yet known how these SNPs increase lactase expression; however, we speculate that while each of the persistence associated variants may have a different effect on protein interactions and/or chromatin structure of the enhancer region, all have the same effect of preventing the process of LCT down-regulation. In this sense, the lactase persistence associated SNPs can be regarded as ‘loss of function’ mutations that lead to a gain in activity. The situation with LCT may be directly analogous to the recently published study of the sonic hedgehog gene (Shh), in which a number of different point mutations in a cis-regulatory region located 1 Mb upstream act as ‘gain of function mutations’ activating ectopic Shh expression (Lettice et al. 2002, 2008), although in that case there is an associated pathology.

It is possible that the MCM6 gene region is susceptible to mutations due to having an open chromatin state because of expression in gametogenesis (Swiech et al. 2007). However, if this is the case and the enhancer alleles are selectively neutral, a similar level of nucleotide diversity should be observed in both lactose digesters and non-digesters.

In this study, we also report the unexpected finding that lactose digestion capacity is not necessarily correlated with milk consumption. This is contrary to our findings that individuals adapted their milk intake to reflect digester status in the Sudanese Jaali cohort (Ingram et al. 2007). The frequency of lactose digesters in this cohort is lower than might have been expected (24%), but agrees well with a large previously published study of the same ethnic group (Flatz 1987). It is possible that in this population, adaptation of the gut flora has occurred, allowing non-persistent individuals to consume lactose without symptoms. We did observe a general trend of lower starting breath hydrogen readings, and a large number of hydrogen non-producers in the Somali cohort, and this may signify increased colonic acidity (possibly due to dietary factors) which can prohibit colonization by hydrogen producing bacteria (Perman et al. 1981; Vogelsang et al. 1988). Whatever the nature of the adaptation that allows lactose non-digesters to consume milk in large quantities, the observation has implications for interpretation of the genetic pattern observed. This and the low lactase persistence frequency implies that the selective pressure has not acted to drive one particular lactase persistence allele to high frequency, and may indicate that selection has been either weak, or has fluctuated over time, which is possible if lactase persistence is more advantageous during periods of famine and drought. The underlying selective advantage of lactase persistence is still not clearly defined and further work is required in order to understand the circumstances under which selection for the phenotype increases, for example by studying two or three generations of a famine-exposed population.

Each of the alleles described here also occurs in other populations, and the distribution patterns of each allele suggest quite different origins. It is possible that –13907*G arose in Ethiopia, being most frequent in the nomadic camel milking Afar (Supplementary Table 3), but the other alleles probably did not (Ingram et al. 2007; Tishkoff et al. 2007; Imtiaz et al. 2007; Enattah et al. 2008). Therefore, the occurrence of –13910*T, –13915*G, –13907*G and –14010*C together in a single ethnic group may signify past contact between migratory milk-drinking peoples through shared cultural practices.

Haplotype associations of the intron 13 alleles show that each of them is primarily restricted to a single haplotype background. However, it is interesting to note the relatively high proportion of –13907*G alleles that are observed on haplotypes that differ from A (the predominant haplotype, and assumed to be the haplotype upon which –13907*G originally arose). Although our data are consistent with the allele arising on A, as reported by Enattah et al. (2008), who found evidence for extended A haplotypes carrying –13907*G in Middle Eastern populations, our data in the Ethiopian populations show more evidence of disruption of this haplotype, most likely by recombination, and may reflect a longer presence of –13907*G in Africa.

In this study, the breath hydrogen lactose tolerance testing was the most thoroughly conducted recent survey of lactose digestion within an African population, with full 3-h breath hydrogen readings obtained for nearly all participants. All were requested to observe an overnight fast and not smoke, and individuals with elevated starting breath hydrogen who might have deviated from these requirements were excluded. We also took care to exclude from the analysis individuals with ambiguous results, though the raw data are presented here. Despite these strict procedures, four lactose digesters carried no variants in the entire sequenced region. These observations suggest the presence of additional genetic changes outside the enhancer region, or modifying factors in the lactase persistence phenotype. Our studies of the Senegalese Wolof population seem to support these findings, as despite having a calculated persistence allele frequency of 0.30 (published lactose digestion frequency = 0.51, representing 2pq + q2; Arnold et al. 1980), no intron 13 variation has been identified in this group (Supplementary Table 3).

The findings reported here provide an important example of multiple rare variants being responsible for a common phenotype. It is interesting to speculate on the reasons why very different patterns of genetic diversity in LCT are found in Europe. In Europe, selection on LCT appears to have been very strong but the reasons for this are still unclear. The calcium assimilation hypothesis, which proposed that the calcium in milk is more advantageous at high latitudes due to reduced incident sunlight and reduced vitamin D synthesis (Flatz and Rotthauwe 1973), was not supported in a recent study (Itan et al. 2009). Itan et al. (2009) propose that the lactase persistence associated allele homogeneity in Europe may be a consequence of underlying demographic processes in addition to strong selection, an interpretation consistent with that of others (Hollox et al. 2001; Gerbault et al. 2009). Demographic constraints are likely to have been different in Africa. Here, human settlement was longer standing, thus preventing population expansion in the same way as took place in Europe.

We propose that the contrasting pattern in Africa may be an example of a ‘soft’ selective sweep, as described by Pennings and Hermisson (2006a, b). Such sweeps can result from high mutation or migration rate, or large effective population size. The migration rate may be of particular importance in pastoralist groups where close regular contact with multiple other communities is to be expected. Pennings and Hermisson (2006a, b) describe soft sweeps involving a low but constant coefficient of selection. Another factor which may also result in allelic diversity is variable selection in time and/or space. The phenomenon of soft sweeps is poorly recognized in human genetics, but is of potential relevance to disease association studies in which independent mutations at the same locus may be involved. However, soft sweeps are more difficult to detect than hard sweeps given the statistical tools currently at our disposal (Pennings and Hermisson 2006b). Our findings illustrate the clinical value of phenotype/genotype research in multiple different groups across Africa and also the need to develop suitable statistical methods that would recognize multiple causative mutations located in close proximity to each other and having a similar effect, which will be increasingly important with the advent of high-throughput whole genome sequencing.