Introduction

The number of confirmed DNA polymorphisms detectable directly in the DNA and defined in various databases has increased from fewer than 200 in (HGM6 1981) to more than nine million (of which more than four million have been validated) in 2004 (dbSNP and build 121). The majority of these are single-nucleotide polymorphisms (SNPs). The SNPs are clearly the most plentiful genetic variants in the human genome, and a large number of them have high heterozygosities making them very useful DNA markers in researching genetic structure of populations and ethnic origins of individuals in a population (Frudakis et al. 2003; Rosenberg et al. 2003).

Determination of genetic relationships and genetic similarities among populations can be based on allele frequency similarities and differences of SNPs (Osier et al. 2002; Collins-Schramm et al. 2004; Fullerton et al. 2004; Kidd et al. 2004). Many different methods exist to analyze allele frequency data on populations and to represent the resulting relationships. Here we use many different analytic approaches on a highly selected dataset designed to quantify the relative similarities of East Asian populations with a specific focus on the relative similarities of Koreans, Japanese, and Chinese. Korea represents an important region for understanding population structure and origin of East Asians because of its location in Northeast Asia between China and Japan.

There are many arguments for the origin of East Asian populations (Yao et al. 2002). Major issues include the determination of the migration routes of ancestors for modern East Asians (Tajima et al. 2002; Karafet et al. 2001), and the nature of the genetic relationships among Chinese, Koreans, and Japanese (Kim et al. 2000; Rolf et al. 1998). Chu et al. (1998) and Su et al. (1999) examined some of the East Asian populations employing nuclear microsatellites and Y-chromosome haplotypes, respectively. They identified Northern and Southern clustering patterns among the populations they studied with indications of somewhat greater genetic variation in the Southern populations. This evidence supports the model of a northward migration of peoples from Southeast Asia. Kivisild et al. (2002) examined coding and control region variation from complete mtDNA sequence from East Asian samples and found the patterns generally consistent with the Y-chromosome analyses of Su et al. (1999) and supportive of the distinction between Northern and Southern populations, but they also note the complex regional specificities found in Northern groups such as the Koreans and Japanese that indicate other waves of migration that probably occurred in more recent millennia. Jin et al. (2003) studied 11 Y-chromosome markers in males from 11 ethnic groups and interpret their findings as supportive of a dual origin for the modern Korean population—genetic contributions from Northern Asian populations and an expansion of populations from the South.

Here, we explore whether significant clustering can be detected with 43 mostly independent autosomal SNPs, typed in eight East Asian populations and one Siberian population. By selecting markers that show larger than average allele frequency variation among East Asian populations this relatively small number of markers does identify a significant clustering pattern of individuals and populations.

Materials and methods

Samples

We analyzed 386 individuals from eight East Asian populations and one Siberian population. Descriptions of the populations and the specific samples are accessible online in ALFRED (Allele frequency database; http://alfred.med.yale.edu) under their UIDs: (Chinese from San Francisco (UID=SA000009J), Chinese from Taiwan (UID=SA000001B), Hakka from Taiwan (UID=SA000003I), Koreans (UID=SA000936S), Japanese (UID=SA000010B), Ami (UID=SA000002C), and Atayal (UID=SA000021D) from Taiwan, Cambodians(UID=SA000022E), and Siberian Yakut (UID=SA000011C). Sample sizes ranged from 25 individuals in the Cambodian sample to 54 in the Korean sample, with a mean of 43 individuals per sample. The DNA was purified by phenol/chloroform extraction from Epstein-Barr Virus-transformed cell lines as described earlier (Kidd et al. 2000).

SNP markers

We searched the SNP database of Applied Biosystems (http://myscience.appliedbiosystems.com) to find 32 SNP markers with a large difference in allele frequencies (min. Δ>0.1) between the Chinese and the Japanese—the two population frequencies available in that database. We typed these SNPs on all of our samples: eight East Asian populations and Yakut. We selected a subset of 21 independent markers with the largest differences of allele frequencies among the three Chinese groups and the Korean and Japanese populations. We also searched the ALFRED database to find SNP markers from among the several hundred SNPs typed that have both a large difference of allele frequencies and high Fst among the seven of our East Asian populations for which data already existed. In this way we chose an additional 22 SNPs at independent loci and then typed the Korean samples for these markers. Thus, the combined dataset for our analyses consists of data on 43 mostly independent, diallelic loci on individuals in nine populations. These loci are listed in Table 1 along with links to their definitions in dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) and ALFRED.

Table 1 Polymorphisms studied and descriptive statistics for nine populations

SNP typing

The SNPs were typed by using AB TaqMan assays. All of the typings used 100 ng of genomic DNA and TaqMan probes, following the manufacturer’s instruction, in 3-μl reactions. The reactions were analyzed and alleles called using an Applied Biosystems 7900HT Sequence Detection System.

Allele frequencies and statistics

Allele frequencies for the 43 SNPs were determined by gene counting assuming co-dominant inheritance and no silent alleles. For each SNP the value of Fst for the nine populations (Wright 1969) was calculated as \( \sigma ^{2} \,/{\left( {\ifmmode\expandafter\bar\else\expandafter\=\fi{p}\ifmmode\expandafter\bar\else\expandafter\=\fi{q}} \right)} \) Hardy–Weinberg equilibrium was tested using the FENGEN program (Pakstis, unpublished). No significant departure from equilibrium was observed for any of the 43 markers in any of the nine populations under study. For markers less than 1 Mb apart pairwise linkage disequilibrium values were calculated as Δ2 (Devlin and Risch 1995).

Cluster analysis

Clustering individuals

We used a model-based clustering method implemented by the program STRUCTURE (Pritchard et al. 2000; Falush et al. 2003) to infer relationships among East Asian populations by assigning individuals to clusters. Previous applications of STRUCTURE were very successful in studies of humans (Rosenberg et al. 2002) and dogs (Parker et al. 2004). We ran STRUCTURE using a model with admixture, separate α for each population, and correlated allele frequencies.

Clustering populations

We used the computer program DISTANCE to calculate the pairwise τ (tau) genetic distances (Cavalli-Sforza and Edwards 1967; Kidd and Cavalli-Sforza 1974; Cavalli-Sforza et al. 1994). Principal components analysis (PCA) was based on the matrix of pairwise tau genetic distances. The relationships among populations were also represented by tree diagrams based upon those pairwise genetic distances. A neighbor-joining tree was computed using the NEIGHBOR program and drawn with the help of the DRAWTREE program; both of these programs are part of the PHYLIP software package (Felsenstein 1989, 1993). The LSSEARCH program (Kidd and Sgaramella-Zonta 1971) was used to calculate an exact least-squares solution for the population tree. Twenty-eight different trees were examined using a heuristic search to generate similar trees. The best tree from the least-squares method has the smallest length and no negative internal segments. For the bootstrap analysis, the PHYLIP SEQBOOT program was applied to generate 1,000 replicate data sets that were then used as input for the GENDIST program in order to compute Reynolds distance matrices. The CONSENSE program summarized the 1,000 neighbor-joining trees.

Results

Table 1 identifies the 43 polymorphisms studied, specifies their chromosomal locations, and supplies the average heterozygosity plus the value range and average allele frequencies for the nine populations. The Fst values calculated both for the eight East Asian populations and for the nine populations also appear in Table 1. Sample sizes and allele frequencies for all markers and populations can be found in ALFRED under the UIDs in Table 1. Frequencies of the reference allele in each of the population samples can also be found in Electronic Supplementary Material (ESM) Table 1 for each polymorphism. Expected heterozygosities in each population are given in ESM Table 2 for the 43 markers. ESM Table 3 shows how similar/different each of the 36 unique population pairs are for the 43 markers based on t-tests.

Linkage disequilibrium statistics for all markers less than one megabase apart are given in ESM Table 4. Of the 14 such pairs of markers only three marker pairs had consistent significant LD in most populations and the LD was complete (Δ2=1) for only one marker pair in one population. Thus, except for that single instance each marker is contributing some unique information on population relationships and most markers are completely independent.

Marker ascertainment

The method used to select SNPs has created a data set that is clearly biased relative to unselected markers. There is, for example, increased heterozygosity in our data set relative to random SNPs. For our 43 selected markers, 95% of the total number of polymorphisms has an average heterozygosity higher than 30%, among the eight East Asian populations (not including Koreans). If we compare these results with a set of 454 polymorphisms more randomly selected, excluding our 43 marker data set, we find that only 59% have an average heterozygosity higher than 30%, among the same eight East Asian populations.

The average Fst value among all 43 markers based on the eight East Asian populations is 0.060 and the median is 0.046 (Fst range: 0.017–0.238). Given the special selection procedure, it is not surprising that the mean and median Fst are both elevated as compared with a larger unselected set of 370 independent diallelic markers based on seven populations (same East Asian populations as previously introduced excluding the Koreans) and excluding the 43 sites studied here: average Fst=0.033 and median=0.026. Seven of the 43-SNP data set markers (CCR5, ADH7, TNFSF15, FADS2, CRIP1, DOCK9, and TST) showed very high Fst values among the eight East Asian populations. The allele frequency profiles for these polymorphisms are shown in Fig. 1 for the nine populations.

Fig. 1
figure 1

Allele frequencies for the seven SNPs (from our 43-SNP dataset) with the highest Fst values in nine East Asian populations

All of these SNPs represent old global polymorphisms that have high average heterozygosities in East Asian populations. The average heterozygosities in other regions of the world are lower than in East Asia, as expected given the ascertainment bias, but none is fixed in any geographic region, such as Africa or Europe (ESM Fig. 1), although a few are occasionally fixed in a single population. Thus, these are all old SNPs that arose in Africa prior to the expansion of modern humans out of Africa. The simplest explanation for this pattern is that random genetic drift caused the increase in heterozygosity at these SNPs in East Asia and those were the ones ascertained.

Clustering individuals

To examine the genetic structure of the individuals in our dataset, we first applied a model-based clustering method that groups individuals into a specified number (K) of clusters, each one of them characterized by a unique set of allele frequencies. In the computer program STRUCTURE that implements this algorithm, K is chosen in advance and can vary from run to run of the program. Each individual’s genotype can have a proportion of membership in each one of the K clusters, summing to one across the K clusters. These proportions can be considered as proportions of membership in the different clusters or as relative probabilities of ancestry of the individual derived from the hypothesized clusters. By assigning our sampled individuals to predefined populations, STRUCTURE also prints out the average membership proportions for these predefined populations.

Using our 43-SNP dataset among 339 individuals from eight predefined populations, SF Chinese, TW Chinese, Hakka, Koreans, Japanese, Ami, Atayal, and the Cambodians, the best result using STRUCTURE was obtained assuming four clusters (K=4). Among all eight populations, individuals coming from the same predefined population show, most of the time, a similar pattern of membership proportions among the K clusters (Fig. 2a). The Dirichlet parameters (αi) obtained for each cluster [max. (α1, α2, α3, α4)<0.26, mean (αi)=0.17] indicate that each individual of the data set has been mainly assigned to one of the K clusters (Pritchard et al. 2000). Among the eight East Asian populations, at K=4 clusters, the variance among individuals of the same cluster is low for all the four groups, and especially low for the first cluster (Fig. 2a). All three Chinese populations present a very similar pattern: individuals in these populations are assigned mainly to the first two clusters for an average of, respectively, 48 and 29% of each individual’s genotype. On average, individual Cambodians are assigned 43% to the first cluster and 23% to the fourth one. The second cluster is mainly composed of individuals in two predefined populations: the Japanese and the Koreans, with individuals assigned on average 84 and 76% to this cluster, respectively. The Ami mainly define the third cluster, with 47% of their individuals’ genotypes assigned to this cluster. Finally, the fourth cluster is defined primarily by the Atayal individuals; on average, the individuals in this population are assigned 75% to this cluster. Thus, we did identify structure among the East Asian groups (K>1), but we are still unable to differentiate between Japanese and Koreans.

Fig. 2a, b
figure 2

Estimated membership proportions in each of the K assumed populations. Each individual is plotted in a single vertical line, separated in K colored segments representing the proportion of membership in each one of the K clusters. Black lines separate individuals from two different predefined populations. a Forty-three independent diallelic loci, typed in eight East Asian populations; K=4 clusters assumed. b Forty-three independent diallelic loci, typed in nine East Asian populations; K=5 clusters assumed

If one considers the allocated percentages for an individual to be proportional to ancestry from each of the hypothesized clusters, it might be possible to include additional populations that would solidify existence of an additional cluster and alter the optimal clustering of the existing individuals. Because we were interested in relationships of Koreans to other populations, it seemed reasonable to incorporate an additional population from Northern Asia into the analysis. Including the Yakut increased the overall sample to 386 individuals and substantially modified clustering: the best result now occurred for K=5 clusters. Among all nine populations, individuals coming from the same predefined population generally show a similar pattern of proportional membership among the five clusters (Fig. 2b). With K=5, the Dirichlet parameters (α i) for each cluster [max. (α1, α2, α3, α4, and α5)<0.20, mean (αi) = 0.11] indicate that a better job has been done by STRUCTURE to assign each individual mainly to one of the five clusters, than for the previous run with eight populations and four clusters. This analysis also showed low variances among individuals belonging to each one of the five clusters, especially low in the first cluster (Fig. 2b).

In Fig. 3, the three Chinese groups cluster with genotypes of individuals assigned 52%, on average, to the first cluster. Genotypes of the Atayal are assigned 77% to the second cluster, primarily defining this cluster. The Ami genotypes are assigned 24% to the second cluster, 20% to the first cluster but 42% to the fifth cluster, primarily defining this cluster. Genotypes of individuals in the three Chinese groups and the Japanese and Korean groups only share an average of 22% for cluster 3. However, genotypes of individuals in the Korean and Japanese populations are assigned to this third cluster with respective membership proportions of 78 and 69%, on average. The Cambodians present an intermediate pattern among the East Asian populations: 47% of their individuals’ genotypes are assigned to the first cluster, 21% to the second, 16% to the third, and 12% to in the fifth cluster. The Yakut essentially define the fourth cluster with genotypes assigned to this cluster an average of 58% with 27% of each individual’s genotype assigned to cluster 3, the predominant cluster for Japanese and Koreans.

Fig. 3
figure 3

Average estimated membership proportions among each one of the nine predefined populations plotted in pie charts, for K=5 clusters assumed

Even though we improved the clustering by including the Yakut, especially for the Taiwanese populations, we are still not able to distinguish well between Koreans and Japanese. However, we observe four patterns for East Asia: a Southern (shown by the three Chinese populations, and the Cambodians); a Northern pattern (shown by the Japanese and the Koreans); a Siberian pattern (shown by the Yakut population), and distinct patterns for the two aboriginal Taiwanese populations. We note an increasing average assignment of individuals in the second cluster, going from China (10% on average among individuals from the three populations), south to Cambodia (on average 21%) and Northeast to Taiwan (25% on average for Ami and 77% on average for Atayal individuals) (Fig. 3).

Clustering populations

A PCA based on a tau genetic-distance matrix that includes all nine populations (Fig. 4) shows a clustering pattern consistent with the results shown by the STRUCTURE program. The first PC accounts for more than 81% of the total variation among all nine populations and maximally separates the Atayal from the Japanese. The second PC accounts for less than 14% of the total variation and maximally separates the Yakut and Japanese. The three Chinese populations and the Cambodians are clustered together in the PCA. The first PC separates these four populations from a group composed of the Japanese and Korean populations. We also observe that the distance between the Japanese–Korean grouping and the Yakut is along the axis defined by the second PC with only a small percentage of the total genetic variation accounted for by this component.

Fig. 4
figure 4

The PCA plot based on a tau genetic-distance matrix for 43 independent diallelic markers typed in nine East Asian populations. This map presents the first two axes, accounting for more than 95% of the total genetic variation

These results reflect the same geographical pattern for continental East Asia as the results from STRUCTURE. We also can identify here a Northern and a Southern pattern separated by large genetic distances between Japanese–Koreans and the three Chinese populations. The Southern pattern is consistent with the Taiwanese aboriginal populations (Ami and Atayal) originating from an expansion out of the Southeast Asia into the Pacific (Chu et al. 1998). It is interesting to note that we are still not able to distinguish clearly between Japanese and Koreans in the PCA. However, the Koreans do tend to be more intermediate between the Japanese and the three Chinese populations using this method. (In ESM Table 3, the Koreans are also intermediate relative to the Chinese and Japanese groups in how frequently they are significantly different from one another pairwise across the 43 SNPs.)

We evaluated by exact least-squares analysis 28 of the more than 135,000 different tree structures possible for the nine East Asian populations. The best tree found (Fig. 5) is consistent with the results shown by the PCA and STRUCTURE analyses. The Japanese and Korean populations group together apart from the Chinese and Taiwanese groups, and the three Chinese groups are very close along the same segment. Notice that the greatest distance between two populations along the tree is between the Yakut and the Atayal. The SF Chinese and the TW Chinese populations are very similar, while the PCA differentiated them somewhat more. Except for the branch between the SF and TW Chinese populations, this configuration is strongly supported by very high bootstrap values based on 1,000 replications. The bootstrap value supporting the separation between the Northern cluster (Japanese, Koreans, and Yakut), and the Southern is 99.2%. For randomly selected markers the branch lengths would be proportional to time—in generations divided by twice the effective population size (Kidd and Cavalli-Sforza 1974)—for the correct structure. That cannot be the case for this tree, however, because of the strong bias in selecting markers that were known to have a large allele frequency difference between the Chinese and the Japanese populations.

Fig. 5
figure 5

Least-squares tree for nine East Asian populations and 43 independent diallelic markers, based on a tau genetic-distance matrix. This best tree among those examined was the one with the shortest length and no negative internal segments. The SF Chinese and the TW Chinese cannot be distinguished along the central segment of the tree. Bootstrap values are based on 1,000 replications

Discussion

We found that with a small, carefully selected set of SNPs we can identify genetic substructure among our existing set of East Asian populations, consistent across three clustering methods: STRUCTURE, PCA, and a least-squares tree search. The northern/southern pattern in East Asia had already been observed using Y-chromosome haplotypes and autosomal microsatellite information (Chu et al. 1998; Su et al. 1999). We now have strong statistical support for the existence of this pattern using autosomal SNPs. Though the markers were selected to emphasize the difference between Chinese and Japanese (see ESM Table 3 for statistical confirmation), the clusterings of the additional populations confirm the North-South pattern. High Fst among East Asian populations could be another plausible criterion for selecting a marker set. We empirically observed that this criterion was doing a worse job differentiating East Asian populations than the large allele frequency differences criterion presented here (data not shown).

The SF Chinese population shows a STRUCTURE clustering pattern slightly closer to the Northern populations than to the rest of the Chinese groups. While probably not statistically significant, it may be due to a handful of individuals that, unlike the rest of the sample, originate in the Northern part of China.

It is difficult to choose the appropriate number of clusters (K) for modeling the data with the STRUCTURE program. We presented here the “best K”, considering the results of a series of independent runs of STRUCTURE for different values of K, and the other external information we had concerning our samples. Now that we have established that a small set of independent SNPs could significantly differentiate among some very similar populations, it would be interesting to extend our study to other populations (Africans, Europeans, and Amerindians), in order to estimate relative contribution from these populations to the East Asian genetic pattern and reconstruct the history of the peopling of East Asia. This set of markers, with readily available TaqMan assays, is an excellent set for others to study on additional East Asian populations.

This study also demonstrates that one can affect the outcome of population studies by selecting markers that show a specific pattern of allele frequencies. Thus, it is not a surprise that Chinese and Japanese are distinguished by these analyses since the markers were explicitly chosen to show that difference and we did verify the differences by replication in independent samples. The highly non-random markers studied give a different pattern of relationships than a random set of loci. For the eight populations for which prior data exist (this excludes Koreans) we can evaluate the relationships based on the 80 loci with over 600 independent alleles used for the 37 population tree shown in Tishkoff and Kidd (2004). The PCA of these data (ESM Fig. 2) places the Japanese centrally with the two Chinese population samples and the Hakka. The Yakut, Atayal, and Cambodians are the outliers. A least-squares analysis (not shown) likewise has different structure and bootstrap values than the tree in Fig. 5.

Though extensive SNP data on Koreans are not available, Fst analysis of 370 SNPs on the other eight populations in this study shows few loci with large frequency differences. Thus, we would expect a STRUCTURE analysis to give little clear discrimination among these populations unless the more variable loci are included.

What is especially relevant is how the other populations, especially the Koreans, relate to the populations used to select the markers, Chinese and Japanese. Unfortunately, we are still unable to distinguish clearly between Japanese and Koreans. The results in ESM Table 3 confirm this, showing that only three of the 43 gene-frequency differences are statistically different (P≤0.010) when comparing the Japanese and the Koreans. Common ancestry and/or extensive gene flow between these two populations throughout history seem likely and make it very hard to find population-specific alleles that could help differentiate them. By increasing the number of markers used, we might improve the accuracy of the estimates of proportions assigned to clusters (Pritchard et al. 2000), and therefore increase the quality of our clustering with STRUCTURE. Increasing the number of populations studied might also improve our clustering. As larger numbers of SNPs are studied on both Koreans and Japanese it should be possible to find markers that will help differentiate between them.