Introduction

Recent advances in genotyping and sequencing technologies have facilitated genome-wide investigations of human genetic variation and provided new insights into population history and genotype–phenotype relationships. Large-scale international collaborative efforts such as the International HapMap Project (The International HapMap Consortium 2005, 2007) (http://www.hapmap.org) and the 1000 Genomes Project (The 1000 Genomes Project Consortium 2011) (http://www.1000genomes.org) have produced the most detailed catalogues to date of human DNA variation that may be useful for future evolutionary and medical studies. Although these projects covered world-wide populations, such as those located in or with ancestry from Europe, East and South Asia, West Africa, and the Americas, few Southeast Asian populations were included. Thus, to understand regional human demographic history and local adaptation, more Southeast Asia data are needed (Lu and Xu 2013).

Our earlier studies and those of other groups suggest that Peninsular Malaysia was critical in the peopling of East and Southeast Asia (The HUGO Pan-Asian SNP Consortium 2009). Thus, it is conceivable that differences in diet, climate, and exposure to pathogens among ethnically and geographically diverse Malaysian ethnic groups may have produced distinct selection pressures. Therefore, population genomic studies of Malaysian populations may advance our understanding of human population history and the genetic basis of adaptation to diverse environments.

Proto-Malay, Senoi and Negrito (Semang) are the three main Orang Asli (indigenous) groups consisting of 18 ethnic subgroups in Peninsular Malaysia (6 subgroups for each main group) (Ang et al. 2012). Among these three groups, Senoi is the largest, and Negrito is the smallest (Lim et al. 2010; Ang et al. 2011). Traits that best distinguish among groups are skin color, hair, and height. Traditionally, Negritos live as hunter-gatherers; Proto-Malays are predominantly farmers, fisherfolk, factory workers and small traders; whereas Senoi communities are involved in plantation and factory work, trading, and to a small extent, hunting and gathering (Fix 1995; Ang et al. 2012).

Previous investigations based on mitochondrial DNA (mtDNA) and Y chromosome indicated that Negritos are genetically closer to Southeast Asian populations than to hunter-gatherers from other continents, e.g., African Pygmies (Thangaraj et al. 2003), although Negritos are very similar to Pygmies with respect to appearance and living environment. Studies of African Pygmies suggested that their phenotypically smaller body size arose from adaptation to tropical rain forests, including food limitation, high temperature and other challenges, and that a convergent adaptation occurred to Negritos (Perry and Dominy 2009; Jarvis et al. 2012; Mendizabal et al. 2012; Lachance et al. 2012); however, the complexity of this prevented us from excluding the possibility that the two dark-skin-color groups shared a common ancestor. Indeed, previous studies indicated that Negritos might be the earliest settlers in Southeast Asia (Endicott et al. 2003; Thangaraj et al. 2005; Reich et al. 2011).

To extend our knowledge and provide the first comprehensive landscape of population history and local adaptations in ethnically diverse Malaysians, we conducted a genome-wide study of the four Malaysian ethnic groups using the Affymetrix Genome-Wide Human SNP Array 6.0, which interrogates more than 900 K single nucleotide polymorphisms (SNPs) encompassing the entire genome. We first measured population genetic diversity among Malaysian ethnic groups, on both the haplotype (haplotype diversity) and SNP (heterozygosity) levels. We also investigated population demography by examining the effective population size (N e) against time. We next analyzed the population structure of the four ethnic groups together with the other world-wide population samples, to elucidate their genetic relationships and time of divergence with surrounding ethnic groups living in East Asia, Europe and Africa. We then examined the implications of genetic structure and admixture of Malaysian ethnic groups to understand the initial peopling of East Asia and the subsequent human migrations in Asia. Finally, we performed a genome-wide scan for the local adaptation signatures and identified genes related to immune response, human height, skin pigmentation, hair morphology and blood pressure in each ethnic group.

Materials and methods

Populations and samples

After obtaining approvals from all local university research ethics committees, the Ministry of Health and Department of Indigenous Development (Jabatan Kemajuan Orang Asli, JAKOA), and acquiring informed consent from each participant, blood samples of individuals from four Malaysian ethnic groups, containing 17 Malays (MLY, from Melayu Kelantan), 4 Proto-Malays (PML, from Temuan), 17 Senois (SNI, from Temiar), and 46 Negritos (NGO, from 6 subgroups: Lanoh, Kintak, Jehai, Kensiu, Mendriq and Bateq), were collected from different regions of Malaysia. When possible, as determined by national identity cards and interviews with participants and their chieftains, individuals selected were offspring of non-consanguineous marriages between members of the same ethnicity within three generations. These procedures were also in accordance with the ethical standards of the Helsinki Declaration of 1975 (revised in 2000). For the purpose of making comparisons, we examined recent published genome-wide data from HapMap 3 (Olivier 2003) including 116 YRI (Yoruba in Ibadan, Nigeria), 112 CEU (Utah residents with Northern and Western European ancestry from the CEPH collection), 137 CHB (Han Chinese in Beijing, China), 113 JPT (Japanese in Tokyo, Japan), and 88 GIH (Gujarati Indians in Houston, Texas); and unrelated samples from Human Genome Diversity Project (HGDP) (Li et al. 2008) (http://www.hagsc.org/hgdp/): 5 San from Namibia and 17 Papuan from New Guinea. In addition, we integrated samples from Southern Chinese ethnic groups, including 9 Dong, 8 Li, 8 Yao, and 6 Zhuang, of which the genotyping data were previously reported by Lou et al. (2011) and from two Kuwait populations including 42 individuals with Arabian ancestry (ARB) and 22 individuals with Persian ancestry (IRN).

Retrieval of genome-wide association studies data

A list of genes associated with complex traits was compiled from published genome-wide association studies (GWASs) via the National Human Genome Research Institute (NHGRI) GWAS catalog available online (http://www.genome.gov/26525384), including 619 genes in total, of which 480 genes are associated with 9 traits of particular interest: facial morphology, hair color, hair morphology, height, skin pigmentation, skin sensitivity to the sun, melanoma, immune response, and malaria. These complex traits are the most distinctive traits in Malaysian populations, especially in indigenous populations. In field work, we observed substantial differences in the incidence of hypertension between NGO and other groups, so we included 139 genes associated with blood pressure and cardiovascular disease in this compiled gene list, with the intent of exploring the possible mechanism of high incidence of high blood pressure in NGO.

Genotyping and data quality control

Genotyping of 84 Malaysian samples with the Affymetrix Genome-Wide Human SNP Array 6.0 was performed according to the “48 Sample Protocol” (Affymetrix, Genome-Wide Human SNP Nsp/Sty 6.0 User Guide, Rev. 3, 2008, P/N 702504). *CEL files containing raw intensity data were analyzed with Birdsuite version 1.5.3 (Korn et al. 2008). There are 2,973 duplicate SNPs among the 893,634 SNPs on the array according to RS numbers. One of each pair with the fewest missing genotypes was kept for further data filtration. Finally, data of 827,204 autosomal SNPs common to all Malaysian ethnic populations were used for downstream analyses.

To minimize bias from the limited sample size, we combined all 6 NGO subgroups. Then two data filtration strategies were performed separately for each population. First, we excluded nearly 100,000 SNPs with missing rates >0.1, and 1 NGO individual with missing rates >0.1, yielding a first dataset of 721,082 SNPs in 83 individuals (17 MLY, 4 PML, 17 SNI, and 45 NGO). We generated the second dataset by another more rigid filtration, involving 353,440 SNPs with missing rates <0.05 and minor allele frequency >0.02, and 60 individuals (17 MLY, 4 PML, 17 SNI, and 22 NGO) with missing rates <0.05. Both datasets comprised autosomal SNPs meeting the criteria in all the 4 populations.

A similar procedure was applied to the 31 Southern Chinese samples and 64 Middle East samples genotyped by Affymetrix Genome-Wide Human SNP Array 6.0 following the standard protocols. For each population, samples with missing rates >0.1 and SNPs with missing rates >0.1 were removed. Finally, 688,865 autosomal SNPs shared by 9 Dong, 8 Yao, 8 Li, and 6 Zhuang samples, and 725,896 autosomal SNPs shared by 41 ARB samples (one sample was removed due to a calling rate less than 85 %) and 22 IRN samples were kept for subsequent analyses.

Besides, samples from HGDP and HapMap 3 database were also filtered separately. Finally, 1,451,219 SNPs (missing rates <0.05) shared by YRI, CEU, CHB, JPT and GIH, and 638,931 SNPs (missing rates <0.1) shared by Papuan and San were included. No individual was excluded due to the missing rates >0.1.

For different purposes of analysis, several combined sub-datasets were generated by employing different data filtration criteria (Table S1).

Haplotype diversity and population heterozygosity

Haplotypes for the 22 autosomes were inferred for each individual from its genotypes with BEAGLE (Browning and Browning 2009) version 3.3.2. We used 651,695 SNPs shared by YRI, CEU, CHB, JPT and 4 Malaysian ethnic groups to estimate the haplotype frequency with a sliding window ranging from 10 to 250 kb along the genome, and measured haplotype diversity based on these estimated haplotype frequencies following Nei (1987). This measure of haplotype diversity is analogous to the heterozygosity at a single SNP. To provide an overall measurement for each population, we averaged SNP-specific heterozygosity as the population heterozygosity.

Effective population size and divergence time

Totally there were 214,852 SNPs shared by YRI, CEU, CHB, JPT and samples from Malaysian ethnic groups. In the Malaysian samples, one of the 22 Jehai (JEH) individuals was included in the 22 NGO. We phased the genotypes of the 214,852 SNPs that were all biallelic to estimate the effective population size (N e) and divergence time (Mcevoy et al. 2011). To estimate N e, r 2 was calculated using PLINK (Purcell et al. 2007) version 1.07 to measure linkage disequilibrium (LD) between SNPs. In a finite population, genetic drift and recombination may be the two main contributors to the LD pattern across the genome, with the former increasing the LD and the latter leading to LD decay. Therefore, recent N e can be reflected on the LD between distant SNPs considering only the effect of recombination, whereas ancient N e might chiefly affect the LD of adjacent SNPs. Knowing N e, we could estimate the divergence time between any two populations, because they confound genetic drift under the neutral evolution theory, which determines the population genetic differentiation as measured by F ST.

Analysis of population structure

The 289,998 autosomal SNPs shared by YRI, CEU, CHB, JPT and Malaysian ethnic groups were used for calculating the global F ST (Weir and Hill 2002) to measure population differentiation, with standard deviations over loci calculated by bootstrapping with 1,000 re-sampling replications, and for constructing the phylogenetic tree with PHYLIP version 3.69 (http://www.phylip.com/) and MEGA (Tamura et al. 2011) version 5.05. Then, 25,347 independent SNPs randomly sampled from the overall 289,998 SNPs were used for principal component analysis (PCA) at the individual level with EIGENSTRAT (Price et al. 2006) version 2.0, and also for genetic structure analysis which was performed using a model-based clustering method with STRUCTURE (Evanno et al. 2005) version 2.3. For population structure analysis, we randomly selected 80 samples from YRI, CEU, CHB and JPT (20 for each), to create a comparable population size to Malaysian ethnic groups.

Population admixture analysis

To detect population admixture, 145,852 SNPs shared by all Malaysian samples, and those from HGDP and the HapMap 3 database, were integrated for the 4 Population Test (Reich et al. 2009) which was conducted with TreeMix (Pickrell and Pritchard 2012) version 1.1. Then, to estimate population admixture time, we also incorporated samples from Southern Chinese ethnic groups and Kuwaitis but excluded samples from HGDP, and applied an LD-based method [ALDER (Loh et al. 2013)] to the shared 463,912 SNPs.

F ST, locus-specific branch length, integrated haplotype score calculation, and gene functional enrichment analysis

We took 651,695 SNPs shared by YRI, CEU, CHB, JPT and Malaysian ethnic groups for the unbiased estimate of SNP-specific pairwise F ST according to Weir and Hill (2002), and for computing locus-specific branch length (LSBL) (Shriver et al. 2004). We used phased genotypes of 700,359 SNPs with a known ancestral allele to calculate integrated haplotype scores (iHS) for each Malaysian ethnic group. Standardized iHS statistics (Sabeti et al. 2007) were calculated for each Malaysian group separately using the R package rehh (Gautier and Vitalis 2012). First, we identified a list of SNPs significantly different between each Malaysian ethnicity and other groups (by F ST and LSBL analyses) or within each Malaysian ethnic group (by iHS analysis). Then we focused on the top signatures, i.e. the top 0.1 % SNPs showing the most extreme differentiation in each comparing pair on F ST and LSBL statistics, and the top 1 % genomic regions in each single group on iHS. Correspondingly, genes covering those significant SNPs or falling within those genomic regions were regarded as candidate genes. Next, functional enrichment analysis was performed to identify putative signatures under local adaptation using DAVID Bioinformatics Resources 6.7 (http://david.abcc.ncifcrf.gov/). It was conducted independently on different lists of candidate genes identified in different pairs of populations, using 5 categories: functional category (SP_PIR_KEYWORDS), gene ontology (GOTERM_BP_FAT), pathways (KEGG_PATHWAY), protein domains (INTERPRO) and disease (OMIM_DISEASE).

Results

Genetic diversity of Malaysian ethnic groups

Population genetic diversity differs greatly across continents. As expected, YRI harbored the most diverse haplotypes and the largest heterozygosity among all continental groups (Fig. 1; Table S2). Collectively, Malaysian ethnicities were genetically more diverse than East Asians with an identical sample size, reflecting the complex Malaysian population history. However, individual population patterns were different across Malaysia. Except MLY, the 3 Orang Asli groups (PML, SNI and NGO) were not obviously more diverse than East Asians, and especially, SNI and PML were less diverse than any world-wide populations which indicated longer isolation and inbreeding. To examine whether the potential ascertainment bias could affect our analysis of genetic diversity, we randomly sampled 142,766 SNPs based on the minor allele frequency (MAF) spectrum of the full sequence data obtained from the 1000 Genomes Project (Fig. S1), and used the 142,711 SNPs with MAF larger than 0.05 in all the populations to re-calculate haplotype diversity and heterozygosity. We observed slight difference when we compared results based on re-sampling data and those obtained from the original data. Therefore, ascertainment bias could exist. However, estimation of haplotype diversity was not significantly affected, since our comparisons of relative diversity showed the similar trend as in the original data (Fig. S2). To correct possible overestimations of genetic diversity caused by population substructures in NGO, we did independent estimation for JEH, a sub-tribe of NGO, which had the largest sample size among all NGO subgroups in this study. The genetic diversity observed in JEH and that in the entire NGO group were very similar.

Fig. 1
figure 1

Comparison of genetic diversity of populations based on haplotype diversity and average SNP heterozygosity. Haplotype diversity of a groups of populations and b single populations calculated through 10–250 kb sliding windows across the genome. Heterozygosity of c groups of populations and d single populations. MLS contains MLY, PML, SNI and NGO; EAS contains CHB and JPT. All groups are of equal sample size in each figure, e.g., 83 samples for each population in a, c, and 17 samples for each population, except PML (four samples) in b, d. (95 % confidence intervals error bars)

This genetic diversity pattern can be observed with estimations of N e based on the LD pattern (Fig. S3; Table S3). Great spatial and temporal variance of N e is depicted in Fig. 2a. Figure 2b shows the harmonic mean of the estimated N e , representing the long-term N e, for each population from 250 to 5 KYA. These ranged from 13,900 in YRI to 1,200 in PML, with Orang Asli groups apparently having a much smaller N e than all the other populations. Overall, our results supported the “Out of Africa” hypothesis since all the non-African populations showed a significant decline of N e (Fig. 2b) suggesting they have experienced a shared bottleneck event, although some population specific bottlenecks might also occur subsequently. As recent as 20,000 years ago, a rapid expansion of African (YRI), European (CEU), East Asian (CHB and JPT) and Malay (MLY) occurred and continued until present. However, a slight population decrease was observed in the Malaysian Orang Asli groups who numbered less than 2,500 nearly 10 to 5 KYA (Fig. 2a). Malay group may have expanded due to gene flow from Orang Asli tribes or from other European or East Asian populations. Considering the harmonic mean of N e estimate from 100 to 30 KYA, we observed an excess of N e in Negritos (NGO and JEH) compared to East Asians (CHB and JPT) (Fig. 2c). These results supported conclusions from archeological and anthropological studies that Negritos could be descendants of more ancient migrants to Asia among modern Asians. A long isolation and a reliance on hunting and gathering may have kept the Negrito populations size relatively small.

Fig. 2
figure 2

Estimation of effective population size (N e). a N e over time. N e of Malaysian Orang Asli populations decreased in the recent thousands of years. The harmonic mean of N e, b from 4–200 KYA and c from 25–100 KYA. For each population, N e is the average over recombination distance classes b from 0.005–0.25 cM and c from 0.015–0.035 cM. (Standard deviations error bars)

Genetic relationships and population structure

To understand causes of unique population genetic diversity patterns in Malaysia, we investigated genetic relationships and population structures of Malaysian populations. Initial PCA of YRI, CEU, CHB, JPT and Malaysian ethnic groups revealed continental genetic variations (Fig. 3a). The Asian cluster was divided into three main clusters (East Asian, Malays&Proto-Malays, and Senois&Negritos) on PC1, with MLY&PML lying between East Asians (CHB and JPT) and NGO&SNI, suggesting that Malays and Proto-Malays may have experienced admixture from East Asians (Fig. 3b). NGO was widely dispersed along PC2, suggesting possible substructure within it. Meanwhile, SNI was distinguished from NGO to some extent in Fig. 3b, but remained close to NGO even at higher level PCs excluding East Asians (Fig. S4), which clearly distinguished SNI&NGO and MLY&PML. Even though NGO had substructures that were perhaps caused by long subgroup isolation, on each PC level, they were clustered as a whole group with little overlap with other populations, suggesting that the difference between NGO and other populations was larger than that within NGO. Thus, combining all NGO subgroups together to compensate for the limited sample size in each subgroup was acceptable.

Fig. 3
figure 3

Population structure and genetic relatedness of eight populations. a PCA of 538 individuals representing eight world-wide populations showing that Malaysian populations are genetically closer to East Asians. b PCA of 310 individuals representing 6 Asian populations showing a separation among East Asian, Malay&Proto-Malay, and Senoi&Negrito along PC1. c Population structure inferred from STRUCTURE shown in the plot adjacent to the corresponding branch in the tree. The number of clusters is indicated by the value of K

A neighbor-joining tree was constructed based on the global F ST of each pairwise population (Fig. 3c; Table S4). Similar genetic relationships among these populations were revealed as that depicted in PCA, and the geographic relationships were fully reflected in the phylogenetic relationships. According to the tree topology, MLY is closer to CHB than to PML, but the global F ST between MLY and CHB (F ST = 0.015) and that between MLY and PML (F ST = 0.016) were not significantly different (Students’ t test, T = 0.95, p > 0.1). The distance between CHB and PML (F ST = 0.027) was nearly the sum of those from MLY to CHB and to PML (F ST = 0.031), suggesting the admixture into the modern MLY from East Asians and PML.

Next, we investigated the genetic makeup of Malaysian ethnic groups using hierarchical STRUCTURE analysis. At K = 3, all individuals were segregated into three continental groups with all Malaysian populations united with East Asians by a dominant component. At K = 4, a new component, accounting for a major proportion in NGO, appeared in all Malaysian ethnic groups, decreasing from SNI&NGO to MLY&PML, then to East Asians. From this gradient, we can infer a decrease of East Asian genomic contribution from MLY&PML to SNI&NGO in Malaysia, which was confirmed in the population structure analysis for only Asian populations (Fig. S5a), and only Malaysian ethnic groups (Fig. S5b). With increasing K (from K = 4 to K = 8, data not shown), additional components emerged in NGO. These observations suggest that Malaysian population shares a common origin with East Asians.

Population divergence time of Malaysian ethnic groups

To understand how a general genomic landscape of relationships among all sample groups was established, we estimated the divergence time of each pair of groups with an N e- and F ST-based method (see “Materials and methods”). Divergence time (T) estimation was shown in Table 1. After the “Out of Africa” event, non-Africans diverged into Europeans and Asians, with the divergence time of 60 KYA between Africans and Europeans, and 64 KYA between Africans and East Asians. Then various populations derived from the Asian group: the divergence time between Malaysian populations with Africans, Europeans and Asians declined (T YRI-Malaysian ≈ 60 KYA, T CEU-Malaysian ≈ 20 KYA, T CHB-Malaysian ≈ 10 KYA). In these four Malaysian ethnic groups, the shortest divergence time was that between SNI and NGO (4 KYA). And SNI seemed to be closer to MLY&PML than NGO did, with the T estimated as 5–8 KYA, which was smaller than the divergence time between NGO and MLY&PML around 6–10 KYA. Collectively, compared to the time of the splits of different continental groups brought about by “Out of Africa”, the divergence within Malaysian population occurred more recently, and Malaysian ethnic groups shared an evolutionary branch with East Asians.

Table 1 Divergence time of each pair of populations

Population admixture

Southeast Asia is a region of great historical and genetic diversity and it has undergone extensive population admixture events (The HUGO Pan-Asian SNP Consortium 2009). Consistently, with population structure analysis (Fig. 3c), admixture patterns were observed in Malaysian populations, even though Malaysians shared overall similarity with East Asians. To illustrate, at every level of K, Malaysian ethnicities, especially MLY and PML, harbored a small proportion of European and African components. To confirm the admixture in Malaysian groups, we applied the 4 Population Test based on the phylogenetic tree (Fig. 3c). In these analyses, to minimize the inter-population effect between our target populations and others, we selected San and Papuan that are indigenous populations in South Africa and New Guinea, respectively, as reference populations because they are both distant from other world-wide populations (Magalhães et al. 2012).

We performed the 4 Population Test on San, YRI/CEU, Malaysian ethnic groups and Papuan, which was denoted as f 4 (San, X; M, Papuan) with M representing one of the four Malaysian ethnic groups, and X as YRI or CEU. Inconsistent with the tree, the values were all negative for MLY, PML, SNI and NGO (p < 0.01) whether considering YRI or CEU (Table S5), implying possible gene flow from both Africans and Europeans to the four ethnic groups. So the phylogenetic tree reconstructed as such directly from modern genomic data did not provide an excellent fit to the evolutionary relationships of these populations. Some admixture events did happen between Malaysian populations and other continental groups as indicated by STRUCTURE analysis.

To estimate the admixture proportion and date, we examined the LD decay in the four Malaysian admixed groups. Based on the fact that LD decays at a constant rate as recombination breaks down the contiguous chromosomal blocks inherited from the ancestral mixing populations, we could date the admixture events by admixture LD between SNPs (Chakraborty 1988) (See “Materials and methods”). East Asians contributed most to the Malaysian populations compared to Africans and Europeans, reaching 12–29 % in MLY, 12–17 % in SNI and 19–26 % in NGO. Southern Chinese ethnic groups, e.g., Dong, Li, Yao and Zhuang, had more interactions with Malaysian populations due to proximity to Malaysia (Table 2). Genetic admixture between East Asians and Malay occurred ~8–18 generations (200–450 years) ago, which was later than the admixture between Malay and Orang Asli groups which was estimated to be around 21–31 generations (525–775 years) ago, and also much later than that between Orang Asli groups and people with East Asian ancestry which happened ~23–40 generations (575–1,000 years) ago, assuming a generation time of 25 years which was applied throughout this study. These results indicated that Malaysian Orang Asli groups had an even earlier interaction with East Asian ancestry than ancestors of Malay people, which was not expected considering the fact that geographical location of Malay is much closer to where the Orang Asli groups residing. However, it was unlikely that the East Asian ancestry in Orang Asli groups was via gene flow from Malay, since the estimation of gene flow from East Asians to Orang Asli groups (about 20 %, Table 2) was much higher than that from East Asian to Malay (about 10 %, Table 2).

Table 2 Measurement of gene flow from East and South Asians, Europeans and Africans to Malaysian populations, and those among Malaysian populations

Taking YRI as the ancestral population, we observed some gene flow with a proportion around 1–2 % in MLY, SNI and NGO. The dates are 40, 28 and 46 generations (1,000, 700 and 1,150 years) ago, for MLY, SNI and NGO, respectively (Table 2). The European mixture proportions are larger than those of Africans, ranging from 3 % in SNI to 8 % in MLY. Our estimates for the dates are 36 generations (900 years) ago for MLY, 48 generations (1,200 years) ago for SNI, and 46 generations (1,150 years) ago for NGO. The European contribution to Malaysians could have mainly resulted from more recent colonization events, starting from 1600s beginning with the Portuguese, followed by the Dutch and eventually the British. Because ancient literature suggested interactions of Middle East and India with the Malay land (Andaya 2002), we replaced CEU with Arabs and Iranians from Kuwait and Indians, and obtained a relatively more obvious admixture estimate (Table 2). However, we failed to measure the gene flow to PML, which could not be large according to the STRUCTURE analysis. Overall, although Malaysian populations are closer to other Asian populations genetically, they harbored a fraction of African or European genomes, resulting presumably from recent admixture with Africans and Europeans.

Although we could not detect genetic interactions within Malaysia because of the lack of proper reference populations, we documented gene flow among Malaysian populations. On one hand, they are geographically close enough to create frequent admixture, and on the other hand, some admixture patterns were easily found in the PCA and population structure analyses (Fig. 3). Because ALDER is unsuitable for estimating the mixture between populations that are genetically too close, we simply measured the gene flow from MLY and PML to SNI and NGO. We found that MLY and PML have the largest mixture proportion (14–31 %) with SNI and NGO among all the analyzed populations, and that they mixed most recently, around 21–31 generations (525–775 years) ago.

Our estimation of admixture time indicated the genetic contribution of western Eurasian ancestry (from European) to Malaysian people could have started more than 1,000 years ago, which was much earlier than historical records (Andaya 2002). However, it might be not that beyond expectation since the genetic admixture often occurred before it was seen and recorded. To this end, our analysis identified recent gene flow to Peninsular Malaysia from western Eurasian and African populations, and these results suggested a great impact of population admixture on the genetic makeup of human groups in Peninsular Malaysia and surrounding regions.

Signatures of local adaptation in Malaysian ethnic groups

Besides the unique genetic structure and human migration, local adaptation contributes to the genetic diversity of Malaysians, because it allows allele accumulation and thus provides competitive advantages in specific environments of different geographic regions. The four Malaysian ethnic groups in this study differ significantly in appearance (Ang et al. 2012), and the most distinctive phenotypes are height, skin color, and hair. More interestingly, the average blood pressure was higher in NGO compared to PML and SNI (p < 0.0001), with the highest average blood pressure reaching 140/91 mmHg in Bateq, an NGO subgroup. The average blood pressure of NGO was also higher than the world-wide normal blood pressure according to WHO guidelines (120/80 mmHg is normal, and ≥140/90 mmHg is high blood pressure; from http://www.euro.who.int/en/home).

To identify signatures of local adaptations in Malaysians, we applied both allele frequency-based (F ST and LSBL) and haplotype-based (iHS) methods to search for footprints of natural selection in genomic data (see “Materials and methods”). We took F ST as the primary statistic in our study to utilize the most intuitive and direct information in the data, since the accuracy of iHS may be affected by fluctuations in sample size and SNP density. The significant terms in which the candidate genes identified by F ST-based method enriched (Benjamini FDR corrected p < 0.05) are summarized in Table S6. We found that candidate genes in all the comparisons were more significantly related to nervous system development, which mainly involved neuron differentiation and projection, and synapse organization and transmission, reflecting the neural adaptation caused by various stimuli in different environment. Also, many genes were enriched in the annotation cluster for ion transport. For instance, all Malaysian ethnicities differ substantially from East Asians and Europeans with respect to the calcium homeostasis, which could affect many aspects of human development and metabolism (formation of bones/teeth, muscle fiber contraction, and hormone secretion). Particularly, upon comparing NGO with other populations (PML, CEU, CHB and JPT), we found signatures playing functions on sodium transport. Sodium transport, specifically the Na+–K+ pump, helps to maintain blood pressure, and the increase of sodium in the extracellular space may lead to hypertension (Llaurado 1983; The DASH-Sodium Collaborative Research Group 2001). This may explain the high blood pressure observed in NGO samples. Interestingly, signals of some immunoglobulin-related genes showed in Malaysian ethnicities, probably due to exposure to a distinct microbial environment. In addition, genes associated with cell adhesion and epidermal growth factor (EGF) were also enriched in the Malaysian ethnic groups.

Nearly 70 % of the candidate genes identified in the F ST-based analysis were confirmed by the LSBL-based method, and this proportion was much higher than that of candidate genes identified in both F ST-based and iHS-based analyses, i.e., only 30 % of the candidate genes identified by F ST-based method were replicated in iHS-based analyses (Fig. S6). However, we did obtain a list of genes underlying possible natural selection which have been identified by all the three methods. Our functional annotation showed that these genes played various roles in different biological processes. Some candidate genes can support functional adaptations of indigenous populations to their life in the rain forests. First, due to the specific diets of Orang Asli groups, especially hunter-gatherers, several genes contributing to nutrient uptake and even those related to type 2 diabetes and obesity had significant natural selection signatures, such as LRP2 in SNI and NGO, DLC1 in NGO, ZMAT4 in SNI, and TCF7L2 in PML. Second, tropical forest environments that have distinct temperatures and ultraviolet radiation (which can produce unique external injuries) contribute to differences in indigenous populations, especially with regard to genetic variants associated with UV and temperature stimuli responses, wound healing, sensory and visual perception, and behavioral response. For example, in candidate genes of SNI, PTPRK and XYLT1 are related to abiotic stimulus responses, and NRG1 is related to wound healing. In PML, KCNMA1 and TTC8 function in sensory organ development. We also identified genes on some hunter-gatherer-specific phenotypes (height and facial morphology). Notable examples are PNPT1 and ETV6 in NGO, and MAML2 in SNI, all of which are related to adult human height, and PLEKHA1 coding for facial morphology identified in SNI. Moreover, some genes we identified were candidate genes in previous genome-association studies, mainly for neurological control and cancer. For instance, RGL1 and A2BP1 were thought to be related to attention deficit hyperactivity disorder (ADHD) (Anney et al. 2008; Sonuga-barke et al. 2008), and they were identified in MLY and NGO here, respectively. Other significant genes repeatedly identified in F ST-based and LSBL-based analyses, and in F ST-based and iHS-based analyses are summarized in Table S7 and Table S8, respectively.

Candidate genes for complex traits

Although we identified genes with significant natural selection signatures related to many important biological processes, background noise prevented identification of genes contributing to the phenotypic diversity we observed in Malaysian ethnic groups from de novo screening. These promising signatures could have been masked by: (1) the long history of the Orang Asli groups; and (2) the complex demographic events, especially population admixture, in Peninsular Malaysia. Therefore, we applied a GWAS gene scan in each population to determine if we could identify previously reported genes related to the distinct traits in Malaysian populations, including skin pigmentation, hair and facial morphology, height, immune response, and blood pressure. In the compiled GWAS gene list (see “Materials and methods”), 377 of the 619 genes were related to human height, comprising the majority of this gene list, followed by 107 genes related to blood pressure (Fig. 4a). Mapping of GWAS genes included: (1) mapping GWAS genes related to seven traits in appearance, malaria, and immune response in all the four Malaysian populations; and (2) mapping blood pressure-associated genes and cardiovascular genes in NGO.

Fig. 4
figure 4

GWAS genes identified as signatures of local adaptation. a GWAS genes are from published GWASs via the National Human Genome Research Institute (NHGRI) GWAS catalog available online, and related to 11 complex traits with the number of genes in each trait showed in the pie. b Proportion of GWAS genes that are located in the regions with evidence of local adaptation. The number of genes identified is shown below each bar

First, a small proportion of the total candidate genes in each population were related to those complex traits, with the largest proportion (3.7 %) in MLY and the least (3.3 %) in SNI. Meanwhile, GWAS genes identified as significant in our study could comprise only a small proportion of the GWAS gene list, most of which were height-related genes, followed by immune response-related genes, and a few genes related to hair, skin, and facial morphology in different populations (Fig. 4b). The top ten most significant GWAS genes identified in each population are summarized in Table 3. Despite their statistical significance, they did not rank high when compared with other candidate genes.

Table 3 Top 10 GWAS genes identified in 4 Malaysian ethnic groups by F ST-based analysis

Of the 56 candidate GWAS genes in NGO, 58 in SNI, and 61 in PML, a height-related gene, SYN3, was remarkable in both F ST and LSBL analyses (Tables 3, S9). In addition, SYN3 fell in the region with ~30 % SNPs showing |iHS| >2 in NGO and SNI. In PML, the region encompassing it harbored 93.3 % significant SNPs. PNPT1, which is also associated with height, had an even stronger signature than SYN3 in NGO according to F ST, LSBL and iHS analyses. For MLY, a notable example showing signatures of potential natural selection is RUNX1, which is responsible for the immune response. With the above observations, we suggested that genetic adaptations of appearance such as height might have occurred in Malaysian indigenous groups.

Next, identification of genes linked to high blood pressure and related cardiovascular disorders in NGO individuals was performed. Screening the compiled gene list revealed that several blood pressure-related genes were significant on F ST in NGO (Table 4), such as CDH13 and PAX5. CDH13 encodes a member of the cadherin superfamily that protects vascular endothelial cells from apoptosis due to oxidative stress and is associated with resistance to atherosclerosis (Ivanov et al. 2004). PAX5 encodes a member of the paired box (PAX) family of transcription factors and PAX5 proteins are important regulators in B cell differentiation, neural development and spermatogenesis (Adams et al. 1992; Emelyanov et al. 2002).

Table 4 Blood pressure-related genes detected by F ST analysis in Negrito

Discussion

Southeast Asia covers an area of 1.6 million square miles and is home to 600 million people. A wide variety of ethnic groups exists, accounting for its diverse range of languages and religions. Peninsular Malaysia is located in the heart of Southeast Asia, and is very diverse in both geography and culture. The entire Malaysian population comprises numerous ethnic groups, including Malays, Chinese, Indians and different indigenous groups, and as such is a complex multilingual and multicultural society. The wealth of human diversity in Malaysia makes it a representative region in Southeast Asia. Taking advantage of the high-density genotyping data in four major and representative Malaysian ethnic groups (MLY, PML, SNI and NGO), we conducted the first comprehensive genome-wide study and provided a genomic landscape of populations in Peninsular Malaysia with respect to their evolutionary history and local adaptation.

Compared with European and East Asian populations, the overall genetic diversity in Peninsular Malaysia as represented by the four major groups is much higher, but lower in each single indigenous group (Fig. 1; Table S2). This can be seen in the patterns of N e. For example, the N e of MLY reaches about 6,000, but it is only less than 3,000 in Malaysian indigenous groups, which is much lower than any other world-wide populations (Fig. 2), possibly due to: (1) the inbreeding and long isolation of indigenous groups; and (2) underestimations resulting from limited samples of Malaysian ethnic groups.

The complex patterns of genetic diversity in modern populations are results of demographic history and selective effects that have acted to adapt different populations to their environments (Balaresque et al. 2007). We therefore attempted to elucidate the unique genetic diversity patterns in Malaysian using two investigative approaches. First, we investigated the population structure and history, and then we focused on the identification of signatures of local adaptation for the different Malaysian ethnicities in this study. Generally, all Malaysian ethnic groups are more closely related to East Asians than to Europeans and Africans, among which Malays and Proto-Malays harbored more Asian ancestries than Senois and Negritos (Fig. 3). We inferred the divergence time of Africans and non-Africans as ~60 KYA (Table 1). Compared to the time of “Out of Africa”, the divergence of Malaysian populations from East Asians (~10 KYA) occurred more recently than that between Europeans or Africans and East Asians. In particular, our estimation provided genetic evidence for a split between ancestors of NGO and SNI occurring nearly 4,000 years ago, suggested by language analyses in ancient literature (Benjamin 1976). Although Malaysian population shared more recent common ancestry with East Asians, gene flows from Africans and Europeans were detected, implicating some admixture events. A small proportion of gene flows occurred from Africans and Europeans to Malaysians, and correspondingly, we observed those shared genetic components between Malaysians, especially Malays and Proto-Malays, with Europeans and Africans in the STRUCTURE analysis (Fig. 3c). In addition, we identified genes with signatures of natural selection on distinct appearance traits of different Malaysian ethnic groups (Fix 1995; Ang et al. 2012), especially height and skin pigmentation, and the high blood pressure in NGO. Notable examples, such as SYN3 and PNPT1, which are both associated with human height, and CDH13 and PAX5 related to blood pressure, were found to play some roles in local adaptations of Malaysian indigenous populations. All these observations allow us to conclude that a long period of isolation, subsequent gene flow, and local adaptations jointly shaped the genetic architectures of the present-day communities living in Peninsular Malaysia.

This study provides a comprehensive picture of Malaysian ethnic groups and explains the reasons accounting for the genetic diversity in the population. In a previous study, we examined a data set with much lower density SNPs in Malaysian Malay group and we explored the population history of Malay (Hatin et al. 2011). With respect to genetic architecture and population history of Malay, most of the results and conclusions are supported by this study based on higher density data. For instance, previously we already identified gene flow to the Malay from their surrounding populations including Chinese and Indians. Here, by the admixture analysis in the current study, we confirmed the admixture between Malaysian populations and the surrounding neighborhoods, and furthermore, we could quantitatively estimate the gene flow from East Asians (10–30 %) and Indians (14.9 %) to the Malay, respectively. In addition, we also confirmed that, compared with Negritos, Proto-Malays were more genetically related to Malays and Chinese (Fig. 3c).

Although the HUGO Pan-Asia Initiative revealed the great genetic diversity of Southeast Asian and its important role in human evolutionary history (The HUGO Pan-Asian SNP Consortium 2009; Ngamphiw et al. 2011), the population histories accounting for this variance remained poorly understood. Our findings on population admixture are consistent with the previous studies providing molecular evidence for the admixture of the three Malaysian indigenous groups with European (Ang et al. 2012). On one hand, the European gene flows mentioned in those studies are fully confirmed. But on the other, our results might have been biased by insufficient sample size and the limited number of SNPs integrated for this part of analysis. Geographical location, colonization and trade are main reasons for the population admixture in Peninsular Malaysia. Chinese traders first visited Peninsular Malaysia in the mid-fifth century, followed by Arab and Persian (known as Iranian today) merchants in the early eighth century (Andaya 2002). In the fourteenth century, the Malacca Sultanate developed international trade with the Yuan Dynasty, attracting an increasing number of traders from other countries, such as Chinese, Persians, Arabs and Gujarats. Since 1600s, Malaysia had been colonized by many European countries, such as Portugal, the Netherlands and Britain, resulting in the gene flow from Europeans to Malaysian populations. Since much higher gene flow was detected in Malaysian ethnic groups especially Malay (Table 2) from Middle East and Indian populations than that from Europeans, it is likely that the European genetic component in Malaysians was brought into Malay via people from Middle East and India. Indeed, the average genetic difference between Malaysian populations and Middle East population (F ST = 0.09) is slightly smaller than that between Malaysian populations and Europeans (F ST = 0.10). According to historical records, the first millennium and a half AD was the period of Indianization of Peninsular Malaysia (Andaya 2002). Then Indians began migrating to Malaysia as laborers in the early nineteenth century forced by the British during colonial time, which had a huge impact on the demographics, culture and society of Malaysia, and the Indian community has become one of the main ethnic groups in Malaysia nowadays (Hirschman 1975; Saw 1988). Taken together, we suggest that the Malaysian population is of great historical and genetic complexity, which might have resulted from extensive admixture events. Such complexity might cloud the estimation of divergence time, which was estimated under the assumption of no population admixture, and present difficulties with the identification of local adaptation signatures.

Local adaptations in different populations are largely driven by environmental pressures and may confer fitness advantages. For example, a strong capacity of nutrient absorption, wound healing, and sensory and visual perception may help adapt to the hunter-gathering life. Some genes related to these phenotypes have already been identified in our previous study when we compared Orang Asli populations or Southeast Asian populations with others (Qian et al. 2013), and some of them were also confirmed in this study, such as DLC, CDH13, RGL1 and A2BP1. The short stature of Negritos might be more suitable for hot tropical environments, offering better thermoregulation, reduced caloric requirements, and greater mobility (Migliano et al. 2007; Perry and Dominy 2009). Previous studies revealed the GH1-IGF1 pathway as a candidate for the short stature in hunter-gatherers from both Africa and Southeast Asia (Clavano-Harding et al. 1999; Dávila et al. 2002). Here IGF1 and GHR participating in the GH1-IGF1 pathway were identified in the analyses of F ST or LSBL. More often, height-related genes were previously identified in African pygmies only (Jarvis et al. 2012; Mendizabal et al. 2012; Lachance et al. 2012), some of which were also significant in this study of Malaysian Orang Asli groups (Table S10). Besides, previous association studies suggested that some skin pigmentation-related genes may affect the skin color of different Orang Asli groups, e.g., SLC45A2 and SLC24A5 (Ang et al. 2012). These genes were as significant as we expected in our F ST analysis. Both fell in the top 0.1 % signatures in four Malaysian populations compared with Europeans (Table S10). However, these signatures were not statistically significant in iHS analysis, suggesting that SLC45A2 and SLC24A5 could be actually resulted from selection in Europeans rather than in Malay populations. Other reported genes, such as MCIR and ASIP, did not have observable signatures likely because they were identified in association studies of populations with substantially different genetic background from those are studied here. Specifically, TYR was identified to be related to skin variations between European and South Asians (Sulem et al. 2007).

Two challenges were apparent with this study. First, our estimations are based on small samples and this may produce bias for some analyses, such as allele frequency calculation and haplotype investigation. As such, N e and divergence time of these ethnic groups might be underestimated. We randomly sampled 20 individuals from YRI, CEU, JPT and CHB, respectively, to make them comparable to Malaysian populations, and re-estimated N e and divergence time for them. Consequently, N e for the recent thousands of years is much smaller than those of non-sampled populations, and there is also an underestimation in divergence time (Fig. S7; Table S11). With small samples, we could not accurately calculate allelic frequency and gene flow. For example, with four Proto-Malay samples, we could hardly observe an exponential decay of the weighted LD curve (Loh et al. 2013) while estimating population admixture irrespective of the reference ancestral populations we used. Thus, we could not measure gene flow from other populations to Proto-Malay. However, PCA and STRUCTURE analyses were accurate; they are not so sensitive to sample size fluctuations. To minimize this kind of bias, we standardized the F ST value in the comparison of each pairwise population, accounting for their distinct sample size. A more comprehensive collection of larger numbers of samples across a wider geographic range will be necessary, especially to understand genomics of Malaysian Orang Asli groups, but it must be appreciated that some of these groups are small and on the verge of extinction. Therefore, in reality, large samples per ethic group are unavailable which probably prevent improving statistical power by increasing sample size.

The second challenge was the lack of suitable methods for the various ethnic groups with complex historical backgrounds. Our estimation of N e was based on LD pattern of a population. However, the bottleneck effect on the non-African populations might lead to the overestimation of LD before the bottleneck event(s), thus resulting in the underestimation of N e and divergence time. For example, both archeological and genetic evidence suggested the “out of Africa” occurred around 100,000 years ago (Armitage et al. 2011; Gronau et al. 2011), which is earlier than our estimation (~60 KYA), although the world-wide pattern of population genetic diversity and relationships would not be much affected. Despite the absolute divergence time could be underestimated, our analysis supported that the divergence between Malaysian ethnic groups and East Asians was much more recent. Existing methods for admixture analyses lack sufficient power to provide convincing results in the investigation of those complex groups. In the admixture analysis, on one hand, we did not detect the gene exchange within Malaysia by the 4 Population Test because these four ethnic groups are so close genetically and we could not find proper reference populations that were totally independent—but not too distant to them—to form a certain tree topology. This challenge may lead to underestimation of the admixture time. Because our estimation is based on the hybrid isolation (HI) model (Jin et al. 2012), which might not be representative of the complex real cases, and the method we used did not take more information, such as admixture proportion, into account, it is likely that real admixture events may have started at an earlier time than that we estimated. As for studying the local adaptations, the complex admixture events happened in the Malaysian ethnic groups may mask the real promising signatures of local adaptation, which could have confounded our identification for those signals. More sophisticated methods of constructing population histories and deciphering natural selection signatures will assist researchers and improve the understanding of human population diversity in the future.

With respect to genomic data, next generation sequencing (NGS) has allowed genetic studies to address biological questions at a genome-wide scale (Koboldt et al. 2013). Whole-genome sequencing data can be obtained for future investigations into the basis of genetic structure and variance in Asia. Current Malay sequence data in public repositories is from Singapore Malays (Wong et al. 2013) and the original locations for the samples are not available. Thus, it does not make much sense to integrate these data into our current analysis. Our microarray data provided information for us to understand population genomic landscapes of Malaysian ethnic groups, while NGS data can improve resolution in some particular respects in the future.

In summary, these preliminary data provide a comprehensive picture of the population genomics landscape in Malaysia, a strategic area with an important role in human migration history. Natural selection signatures identified here may offer a good candidate list for further replication and verification. Our findings advance the understanding of genetic diversity in Peninsular Malaysia and offer insight into the peopling and human migration history in Southeast Asia. These data may pave the way for future evolutionary and scientific studies.