Introduction

South Africa is characterized by extensive population diversity with groups originating from African (79%), Asian (2.5%) and European (9.6%) populations (http://www.statssa.gov.za). This population diversity is the result of both a multi-faceted colonization history (Mountain 2003) and South Africa’s location with respect to major trade routes from the fifteenth to the nineteenth century. The contribution of these previously continentally divided population groups from Europe, Asia and the rest of Africa, to South Africa’s diversity led to the establishment of a mixed ancestry population, predominantly in the Western Cape, known today officially as the South African Coloured population (SAC) (Adhikari 2005; Nurse et al. 1985; Van der Ross 1993). This population, which currently comprises approximately 9% of the South African population and 54% of the population of the Western Cape Province, has a complex genetic history, influenced by both the colonization history of South Africa and its historical legislature.

The South African Coloureds have their roots in the indigenous Khoesan [denoting Khoekhoe and San (Boonzaaier et al. 1996; Elphick 1985; Mountain 2003)] native to the Western Cape at the time of first colonization by European settlers of the Dutch East India Company (VOC) in 1652 (Mountain 2003; Shell 1994). After the establishment of a refreshment station at the Cape of Good Hope, now Cape Town, the VOC brought in small numbers of political exiles from Indonesia and Malaysia (Mountain 2004), and slaves from the Indian subcontinent (25.9%), the east coast of Africa (26.4%), Madagascar (25.1%) and Indonesia (22.7%) (Nurse et al. 1985; Shell 1994). These figures were calculated from the records of the slave trade (Shell 1994). The active trade in slaves began in 1658 and continued until the banning of the seaborne slave trade in 1806, with the last recorded illegal imports in 1822 (Shell 1994). In the early 1700s, the slave population in the Cape regularly outnumbered the European settlers (Mountain 2003), and men virtually always outnumbered women in both the slave and free populations (Shell 1994).

The indigenous Khoekhoe were not enslaved, but frequently served as indentured labourers or serfs on the farms (Mountain 2003; Shell 1994). A small, but significant number of women of Khoekhoe or of slave descent and their children were integrated into the colonial household, often by marriage (Mountain 2003; Shell 1994). Mixed marriages, usually between European men and women who were either Khoekhoe, manumitted (freed) slaves or of mixed parentage (Keegan 1996), and between Khoekhoe and slave (Mountain 2003) were socially acceptable in early Cape society. However, in the majority of cases, and particularly after 1700, the progeny of such mixed marriages and liaisons were assimilated into the growing group known as the “Cape Coloureds” (Keegan 1996; Mountain 2003; Nurse et al. 1985), a term used since the mid-nineteenth century (Keegan 1996). These unions were more common in the farming areas, but also occurred in the towns (Mountain 2003; Shell 1994). By the late 1700s, race-based restrictions were common, and these were formalised under the British administration from 1806 (Mountain 2003), when class was more easily overcome in society than race and ancestry (Keegan 1996).

The cohesion of the SAC population was further facilitated by both the establishment of early mission stations (from 1738) amongst Coloured and Khoekhoe populations (Mountain 2004), and by legislation. After emancipation by the British administration (1834–1838), large numbers of ex-slaves and other indigent people settled at mission stations (Mountain 2004), some of which formed the nucleus of a “Coloured group area” (Boonzaaier et al. 1996; Mountain 2003). Many of the Khoesan at these mission stations had European and/or African (particularly Xhosa) ancestry (Keegan 1996). The formalization of the racial order in society began in the late 1700s. From 1910, and particularly 1948–1994, the apartheid regime introduced legislature that outlawed inter-racial marriage and prescribed areas of residence (http://www.sahistory.org.za/pages/chronology/special-chrono/governance/apartheid-legislation.html). This separation of ethnic groups ensured further cohesion of the already established highly admixed SAC population in the Western Cape, the traditional centre of concentration of the Coloured people (Adhikari 2005; Cilliers 1963). The term “ethnic group” is used here as according to Barth (1969). Briefly, the term denotes a culturally defined group which identifies itself and is identified by others as constituting a distinguishable category.

The majority of people who self-identify as Coloured are Afrikaans speaking. According to the 2001 census, 81.0% of the Coloured people in the Western Cape were Afrikaans speaking, and 18.6% English speaking, while in our study area of Ravensmead/Uitsig these figures were 90.1 and 9.3% respectively. The population of Ravensmead/Uitsig is 91% Christian, and only 1.5% Muslim (2001 SA census), which raises an important distinction with another population group in South Africa known as the Cape Malays. The latter have their origins in the political exiles brought from the Dutch East Indies (mainly Indonesia) in the 1700s (Mountain 2004). They brought the religion of Islam to South Africa, which served as a unifying force in the community and may have created a genetic subgroup, which is not the focus of our study. The term “Malay”, or alternatively “Cape Muslim” is used by members of the group to denote their affiliation with Islam. The Malay from a minority group (10.3% of the SAC in the Western Cape) which has not been incorporated into the core structure of the South African Coloured people (Nurse et al. 1985).

The SAC population, characterized by extensive admixture of multiple population sources, provides a unique opportunity to investigate genomic patterns of population admixture. Given that Africa (Conrad et al. 2006; Tishkoff and Williams 2002; Tishkoff and Kidd 2004), and in particular South Africa (Tishkoff et al. 2009), has the most diverse human populations it is imperative that large-scale genome studies of both human demographic history and disease association are carried out using African samples (Campbell and Tishkoff 2008; Tishkoff et al. 2009). Knowledge of the nature of admixture in a population is also important when considering disease association studies on a population such as the SAC (Babb et al. 2007; Barreiro et al. 2006; Cooke et al. 2008; Hoal et al. 2004; Möller et al. 2007, 2009; Rossouw et al. 2003), and comparing these associations with the results found in other ethnic groups. We present the results of a large genome-wide analysis consisting of 959 individuals from the SAC group, genotyped with a panel of 500,000 single-nucleotide polymorphism (SNP) markers, of which nearly 75,000 markers are shared with both the International HapMap Consortium (Frazer et al. 2007; The International HapMap Consortium 2005) and the Human Genome Diversity Project (HGDP) (Cann et al. 2002). This is the first high-resolution SNP study of a large and representative sample of this unique population. Although understanding the demographic history of the SAC population is of interest in its own right, we anticipate that this highly diverse admixed population will provide opportunities for the identification of genes associated with complex diseases in this population and its ancestral source populations. Characterizing the pattern of genetic variation in this study population will provide valuable baseline data for subsequent analysis of disease association.

Materials and methods

Study site and subjects

Study subjects, self-identified as SAC, were enrolled from Ravensmead and Uitsig, two suburbs of Cape Town, which are contiguous and which we subsequently found to be genetically indistinguishable (F st = 0.001). In 1962, these suburbs were declared an area for habitation by Coloureds only, under the Group Areas Act of the apartheid government. Although this act was repealed in 1991, 98% of people in this suburb self-identified themselves as “Coloured” in the 2001 South African census. Informed consent was obtained from all study participants. The study was approved by the Institutional Review Board of Stellenbosch University, Tygerberg, South Africa. Blood was taken and DNA extracted by standard methods.

Sampling, genotyping and genotype calling

All samples and CEU (Utah residents with ancestry from Northern and Western Europe) controls from the International HapMap Project (Frazer et al. 2007; The International HapMap Consortium 2005) were genotyped using the Affymetrix 500k genotyping platform. SNP genotypes were called using the Affymetrix Power Tools pipeline (V1.10.0). First, samples that had a reported NSP/STY concordance rate of <90% were discarded. The dynamic model (DM) algorithm’s call rate was used as an initial quality control measure. CEL files with a call rate of 93% or higher were selected, and used to train probe-specific models using the BRLMM algorithm (Affymetrix 2006). These models were then saved, and used to call all samples with STY and NSP call rates of 70% or higher. Genotype calling performance was determined by measuring concordance of the included HapMap cell line samples with the genotypes of these individuals from the HapMap project (The International HapMap Consortium 2005), and was found to be >99%. Furthermore, four duplicate SAC samples were included in the SNP genotyping experiment, which allowed for validation of the genotype-calling algorithm on SAC samples. Genotype concordance of these SAC samples was >97%. In addition to data generated in this study, we obtained genome-wide SNP data from two additional public data sources: the International HapMap Project (Frazer et al. 2007; The International HapMap Consortium 2005) (http://www.hapmap.org) and the Human Genome Diversity Project (Cann et al. 2002) (HGDP; http://hagsc.org/hgdp/files.html). Populations were chosen from these public data sources to represent putative ancestral populations that may have contributed through admixture to the SAC population. The populations chosen were representative of four major groups, namely (1) European (2) non-Khoesan African (including East African, Bantu and Pygmy populations) (3) Khoesan and (4) Asian (Table 1). We reduced the SNPs genotyped in this study to a subset (n = 74,889) shared between SAC and the public data sources (Table 1).

Table 1 Putative ancestral populations that were included in population structure analysis of South African Coloureds (SAC)

Population structure

Population structure analyses were performed to characterize the genetic contributions to the SAC population. We used STRUCTURE (Falush et al. 2003; Pritchard et al. 2000), which identifies population structure without prior assignment of individuals to populations. STRUCTURE has an upper limit on the number of SNPs that can be analysed, and assumes both Hardy–Weinberg equilibrium and complete linkage equilibrium between adjacent markers (Falush et al. 2003; Pritchard et al. 2000). The selection of highly informative markers reduces the number of genotypes required for the accurate inference of ancestry. Therefore, we selected SNPs from the set of shared markers (n = 74,889) that were ancestry informative for the putative contributions to the SAC, and that were putatively unlinked. We used Rosenberg’s Ancestry Informative Markers (AIMs) selection method (Rosenberg et al. 2003), taking potential linkage into account by selecting AIMs separated by a physical distance of at least 1 MB. Alternative marker selection strategies including random selection, random selection accounting for linkage disequilibrium and AIMs not accounting for linkage disequilibrium were also tested (Table S1).

We used the admixture model with correlated allele frequencies to investigate the number of populations evident in the combined SAC-HapMap-HGDP dataset (Table 1). Convergence of MCMC chains was assessed with five independent runs (burn-in = 1,000, chain length = 2,500) for each K between 1 and 8. The number of populations (K) was estimated as the number that maximized the probability of the data, and minimized the variance in this probability over successive iterations (Pritchard et al. 2000). For each SNP subset, we estimated the proportions of inferred ancestry for each individual using the optimal number of ancestral populations (K), and plotted these proportions using DISTRUCT (Rosenberg 2004). A potential limitation in estimating proportions of ancestry for SAC in these analyses is that the standard implementation of the admixture model used does not account for linkage disequilibrium due to admixture (Falush et al. 2003), known to be a feature of our study population (Nurse et al. 1985). An alternative model, accounting for linkage disequilibrium due to admixture provides more accurate estimates of statistical uncertainty in admixed populations, but has runtimes that scale exponentially with the number of ancestral subpopulations (Falush et al. 2003). Therefore, we also estimated ancestral proportions using the linkage model in STRUCTURE, but only for the optimal number of ancestral subpopulations identified in the previous analyses. In this case, we used a larger sample of 10,000 SNPs, since linkage due to admixture is incorporated into the model, but we still maintained a physical distance of at least 10 Kb between adjacent SNPs to limit the effect of background linkage disequilibrium. We performed principal component analysis using SMARTPCA in the EIGENSOFT package (Patterson et al. 2006; Price et al. 2006) and included all SNP markers shared between the populations analyzed (n = 74,889). Finally, we used FRAPPE (Tang et al. 2005; Li et al. 2008), which models background linkage equilibrium and thus allows for the inclusion of physically linked SNPs. All 74,889 SNPs were used in a FRAPPE analysis that comprised 10,000 EM iterations with a convergence threshold of 10,000.

Results

Population structure

In the STRUCTURE analyses, including both SAC data and data from potential ancestral populations derived from public sources; the number of populations was estimated as between 4 and 7 (Fig. S1). Not accounting for background linkage disequilibrium between SNPs, by including putatively linked markers (Fig. S1 A, C), resulted in higher estimates of K. The inferred major contributions to the South African Coloured population were consistent (Figs. 1, S2, S3, S4), although the estimate for the number of ancestral populations varied with each subset of SNPs. Consistent with historical data, the four major inferred contributions to SAC were Khoesan Africans, non-Khoesan Africans, Europeans and a smaller Asian contribution. Of these, the Khoesan contribution is the largest under both the linkage model in STRUCTURE and the FRAPPE analyses, followed by European, African and Asian (Fig. 2; Tables 2, S2). A large contribution from Khoesan was inferred despite the fact that data for only a small number of Khoesan individuals (n = 5) were present in the publicly available datasets. Although the Khoesan sample size is small, this is compensated for by the relatively large proportion of Khoesan ancestry in SAC individuals, and by the large sample size of SAC. The STRUCTURE method estimates ancestral population allele frequencies and further allows for individual’s genomes to be drawn from multiple ancestral populations, thus accounting for admixture. Therefore, the estimation of ancestral population allele frequencies is not based on only a small sample of Khoesan, but on the entire sample of SAC that have Khoesan ancestry. Whilst the small sample of Khoesan assists in the clustering and allele frequency estimation, it is most useful in identifying which of ancestral populations identified using STRUCTURE are Khoesan (Fig. 1). Indeed, a STRUCTURE analysis without putative parental populations and only using SAC individuals reveals three of the four ancestral contributions to SAC (results not shown). It is likely that the minor Asian contribution is difficult to detect without pure Asian samples. Estimates of Khoesan ancestry proportions in SAC, obtained from the STRUCTURE model which takes admixture linkage disequilibrium into account, were lower than from the admixture model without accounting for admixture linkage disequilibrium (Fig. 2; Table 2). FRAPPE, which uses all loci and accounts for background linkage equilibrium, provided estimates consistent with the admixture model in STRUCTURE. Therefore, differences between the admixture and linkage models in STRUCTURE may be due to the increased computational complexity of the latter.

Fig. 1
figure 1

Proportion of each individual’s ancestry for the number of ancestral populations from K = 2 to the estimated number of ancestral populations with greatest probability (Fig. S2). Plots shown are for unlinked Ancestry Informative Markers and admixture model. Plots for additional datasets/models are available as Supplementary material (Figs. S2, S3, S4, S5). Population labels are as in Table 1

Fig. 2
figure 2

Mean, range and 95% confidence limits (notches) on estimated proportions of ancestry for SAC individuals using either an admixture or linkage model. This figure is based on inclusion of all ancestral populations in Table 1

Table 2 Mean and standard error on proportion of ancestry for each of four populations contributing to South African Coloureds (SAC), for admixture and linkage models

Principal component analyses showed SAC spanning the variation between Africans and non-Africans along the first pair of eigenvectors (Fig. 4). The PCA results suggest that the ancestral Asian population that contributed to SAC is more closely related to the contemporary Gujarati Indian population, than to the Chinese (CHB) and Japanese (JPT) populations from HapMap 3 (Fig. 4), as has been shown previously (Tishkoff et al. 2009).

The 959 individuals investigated in our study have a greater proportion of Khoesan ancestry and lower proportion of both European and Indian ancestry than the 39 individuals genotyped by Tishkoff et al. (2009) who showed approximately equal ancestries of Khoesan, European, Black African and Indian (19–25%), with 8% attributed to East Asian. It is possible that their group of 39 contained a proportion of people from the Cape Malay group, who may have a genetic make-up higher in Indian and lower in Khoesan ancestry, due to greater Indonesian or Malaysian ancestry. The samples used by Tishkoff et al. (2009) were collected from volunteers and blood donors residing (some temporarily) in the Western Cape (MJ Kotze, personal communication), and not from a specific area. Individuals from the SAC group sampled in other areas of South Africa could have a different genetic make-up. An early study of blood group gene frequencies in Cape Town found similar ancestral contributions from European, Black and Asian, but the criteria for inclusion were not clear (Botha 1972).

Discussion

This is the first genome-wide analysis of a large, well-defined set of individuals from the SAC population. Our results illustrate the very high degree of admixture in the SAC population, comprising input from mainly four geographically distant populations. We have genotyped 959 individuals from the SAC population, and selected almost 75,000 markers for population structure analyses. The results that we inferred using STRUCTURE, which suggest the SAC population group to have four major ancestral components, are consistent with the historical record. As expected, Khoesan, European, African and Asian (Indian) populations have contributed to SAC, the proportions of which are dependent on the statistical model used in inference (Fig. 2; Table S2). Differences between the admixture and linkage model are to be expected, since each accounts for different components of linkage disequilibrium. The admixture model ignores linkage disequilibrium along chromosomes as a result of admixture, whereas the linkage model does not (Falush et al. 2003), and thus the latter is a better approximation of the population history of SAC. Nonetheless, the inferred ancestry proportions indicate a substantial contribution from the Khoesan, and considerable variation in ancestry proportions between individuals (Figs. 1, 3). The degree of Khoesan ancestry reflects the role of indigenous Khoesan in the early establishment of the SAC population (Mountain 2003). It could be argued that the rather small Khoesan sample size contributes to uncertainty with respect to estimating ancestral proportions. However, these results are consistent with an independent study with a slightly larger population of Khoesan (Tishkoff et al. 2009). A recent report by Quintana-Murci et al. (2010) comparing maternal and paternal contributions to the SAC has put the Khoesan contribution at over 70%, and about 40%, respectively.

Fig. 3
figure 3

Proportion of each individual’s ancestry (K = 4) sorted (in ascending order from left to right) by the proportion of ancestry for each of the major contributions to the SAC

Some authors have proposed that the Khoesan people in South Africa are becoming extinct (Mountain 2003). The San in particular endured bouts of genocide from all other groups (Mountain 2003; Shell 1994), and the Khoekhoe society had collapsed completely before 1713 (Elphick 1985), the time of a devastating smallpox epidemic (Nurse et al. 1985). Although many members of the Khoesan existed on the fringes of colonial society (Mountain 2003), many others, particularly the women, were part of the household of the pioneer farmers, in a patriarchal societal system that had elements of slavery, indentured labour and authoritarian family life (Shell 1994). Often Khoekhoe men were bonded labourers on the farms (Keegan 1996; Shell 1994) and integrated into European colonial society (Elphick 1985). The near extinction of the Khoesan, however, is not apparent from our results; given that some SAC individuals harbour large proportions of Khoesan ancestry (Figs. 1, 2), and assuming that the HGDP Khoesan population is a sufficiently pure source of ancestral Khoesan diversity.

In addition to the strong Khoesan contribution to SAC, a large proportion of their ancestry is derived from non-Khoesan Africans (Fig. 1), in particular Bantu-speaking populations. The East African contribution expected was not detected, probably since the populations used here are Bantu-speaking (LWK), admixed with European (MKK as evident in Figs. 1, 4), or since the SNPs used provide insufficient resolution to resolve this contribution. Furthermore, many imported male slaves did not reproduce (Shell 1994), making this expected contribution minor in comparison to southern Bantu-speaking individuals. Although we did not have samples for southern African Bantu, these groups are themselves admixed with the Khoesan (Nurse et al. 1985; Thorp 2000; Tishkoff et al. 2009), which is also evidenced by click consonants in the Xhosa language. Khoesan ancestry was, therefore, assumed to be derived primarily from the Khoesan populations, and in addition from the Bantu-speaking ancestors who also had Khoesan ancestry. Furthermore, substantial input from the European settlers (mainly Dutch, German, British and French), and a smaller contribution from Asia, is evident in the SAC (Fig. 1). This Asian contribution is consistent with 26% of imported slaves originating in East India, mainly Bengal (Shell 1994), and the apparent shared ancestry between populations from East Asia and Gujarati Indians from the Indian subcontinent (Fig. 1). The use of HapMap Gujarati Indians as a proxy for the Indian populations that were the actual ancestral populations in the SAC is supported by the genetic homogeneity of Gujarati and Bengali populations (Tishkoff et al. 2009). One analysis we performed, which included a random subset of SNPs, did detect the Gujarati Indian contribution to SAC (Fig. S2). However, this result may be influenced by the larger proportion of linked SNPs in that analysis, and the inability of STRUCTURE to account for background linkage disequilibrium. Nonetheless, the Indian contribution to SAC is supported by PCA analysis of all 75,000 markers (Fig. 4). Low levels of ancestry from East Asia (CHB/JPT in HapMap) may be ascribed partly to the Chinese who formed part of the “free blacks” (Keegan 1996), a group forming 9% of the Cape Town population by 1821 (Shell 1994). Free blacks were free persons not of European origin, and comprised manumitted slaves, a few political exiles, and several hundred Chinese convicts (Mountain 2004; Shell 1994). Chinese, Indian and Cape-born slaves have also been found to contribute to the Afrikaner population, apart from the predominant European component (Greeff 2007; Heese 1971).

Fig. 4
figure 4

Plot of the first four eigenvectors in the PCA analysis of SAC, HapMap 3 and HGDP populations selected as putative ancestral populations for the South African Coloured population

The limitations of this study include the use of the Affymetrix 500k SNP chip, containing markers primarily designed for use in Europeans. This could have led to a strong ascertainment bias that may well have influenced the quantitative details of the analyses performed. The resolution of inferred ancestral contributions could certainly be improved with the addition of both more suitable ancestral population samples from Malaysia and Indonesia, an appropriate Bantu-speaking population, and with a larger sample of San, currently not publicly available. The genotype results concur with the historical record, but in addition provide quantitative information of the extent of the contribution of putative ancestral groups, not obtainable by conventional historical research. The contributions of the parent populations to the present-day SAC population were made at different periods in the past. Our estimates, therefore, reflect the result of their past contributions (after drift or variance in reproductive success), and not the absolute contribution of these different source populations.

In addition to the results presented here being of historical interest, the inferred ancestral contributions are highly relevant for mapping of disease genes. The SAC population in the Western Cape suffers from one of the highest incidence rates of tuberculosis (TB) ever recorded (Kritzinger et al. 2009), and knowledge of their population structure, and ancestry could be used to search for TB susceptibility loci through admixture mapping (McKeigue 1997; Montana and Pritchard 2004; Seldin 2007; Zhu et al. 2006, 2008). The SAC population in this study, of which we have genotyped approximately 3%, constitutes an excellent study population for the mapping of TB susceptibility genes, because their ancestral populations have substantially different rates of TB infection and disease (Stead et al. 1990). An essential requirement for admixture mapping is the elucidation of ancestral proportions of the populations involved. Thus, the results reported here will enable the investigation of the impact of admixture on TB susceptibility for example, and potentially explain the apparent high vulnerability of this population to disease. Furthermore, given the unique composition of this population, novel susceptibility alleles to complex diseases could be identified.