Introduction

China stands at a geographical crossroads between Southeast Asia, Central Asia, and Siberia, and consists of a very important region for tracing the migration and expansion of the anatomically modern human to these areas. Although a number of recent studies using different genetic markers have shed some light on the peopling in East Asia, especially in China (Ding et al. 2000; Karafet et al. 2001; Kivisild et al. 2002; Yao et al. 2000b, 2002a), debates have not been compromised. One of the major problems causing the controversy is due to the fact that the samples considered were not fully matched: populations from southern China were much more sampled than populations from northern China (c.f. Ding et al. 2000). Among the available mtDNA data of Chinese ethnic populations (Oota et al. 2002; Yao et al. 2000a, 2002a, 2002b, 2003a; Yao and Zhang 2002), relatively fewer populations from northern China were analyzed compared with those from southern and central China. To better understand the migration pattern of East Asia, more data from northern China is thus necessary.

Based on a satisfactorily resolved East Asian mtDNA phylogeny that was constructed either by combining information provided by control region and coding region (Yao et al. 2002a) or by complete sequences (Kivisild et al. 2002; Kong et al. 2003), we dissected the mtDNA lineages identified in 232 subjects in five ethnic groups sampled from northern China into different haplogroups, then compared the genetic structure of these populations on the basis of distribution frequency of each haplogroup. We followed the strategy for haplogroup classification as described in Yao et al. (2002a, 2003a) and Kivisild et al. (2002). In brief, the mtDNAs were tentatively assigned into respective haplogroups according to their specific mutations observed in the hypervariable segment I (HVS-I) and by (near-)matching with the reported data with coding region information available, then, other specific mutations in the hypervariable segment II (HVS-II) and/or coding region were typed to further characterize the haplogroup status of the mtDNAs. In addition, the region 14576–16047 [according to the revised Cambridge reference sequence (rCRS; Andrews et al. 1999)], which covering the complete cytochrome b gene (Cyt b), was sequenced in a total of 63 individuals [25 from this study and 38 from our previous studies (Yao et al. 2002a, 2002b, 2003a) as well as our unpublished data set], with an intention to learn whether this region is as informative as regions 10171–10659 and 14055–14590 in Yao et al. (2002a) for defining and/or supporting the haplogroups of East Asian mtDNAs. Our results showed that there were differences in the maternal genetic structures of the five ethnic populations considered, and the information provided by the region (14576–16047) was helpful for discerning 15 (sub-)haplogroups.

Material and methods

Sampling

A total of 232 individuals from five ethnic populations were sampled in Inner Mongolia, China: 45 Daurs from Ewenkizu Zizhiqi county, 47 Ewenkis from Xin Barag Zuoqi county, 48 Koreans from Arunqi county, 48 Mongolians from Xin Barag Zuoqi county, and 44 Oroqens from Oroqen Zizhiqi county (Fig. 1). All of the individuals were confirmed to be unrelated before sampling and were given informed consent.

Fig. 1.
figure 1

Geographic locations of the five ethnic populations in northern China

DNA amplification and sequencing

The mtDNA HVS-I sequence was amplified and sequenced in all the samples using the procedures described elsewhere (Yao et al. 2000a, 2002a, 2002b). In order to confirm the haplogroup status of some individuals, a number of characteristic mutations in HVS-II sequence and/or coding regions were further detected by direct sequencing or RFLP analysis using the same primer pairs and conditions as in Yao et al. (2002a). Moreover, all of the individuals were screened for the mtDNA 9-bp deletion in the COII/tRNALys intergenic region according to our previous studies (Yao et al. 2000b, 2001).

The region 14576–16047, which covers the whole Cyt b gene sequence and harbors two characteristic polymorphisms (14783 and 15043) of macro-haplogroup M, was amplified by using primer pair L14575 (5′-ACCCGACCACACCGCTAACA-3′)/H16048 (5′-GTCAATACTTGGGTGGTACC-3′). After being purified on spin columns (Watson BioTechnologies, Shanghai), the PCR products were directly sequenced for both strands by using the two primers for amplification and six internal primers (L14752, 5′-ACTACAAGAACACCAATGACC-3′; L14989, 5′-ATGGCTGAATCATCCGCTAC-3′; L15391, 5′-TAGGAATCACCTCCCATTCC-3′, L15598, 5′-ACACAATTCTCCGATCCGTC-3′; H15086, 5′-AGGAGGATAATGCCGATGTT-3′; H15400, 5′-TGTAGTAAGGGTGGAAGGTG-3′). The numbers in the primer names refer to the position of the 3′ end of the primer sequence relative to rCRS (Andrews et al. 1999). L and H stand for light and heavy strands, respectively.

Data analyses

The sequences were edited and aligned by DNAstar software and compared with the rCRS (Andrews et al. 1999). The length polymorphisms of the A and C stretches in region 16180–16193 (triggered by the 16189 T/C substitution) were disregarded in the analysis.

The mtDNAs were classified into the specific (sub-)haplogroups by using the strategy and haplogroup annotation system as fully described in recent studies (Richards et al. 2000; Kivisild et al. 2002; Yao et al. 2002a, 2003a; Kong et al. 2003). After each mtDNA was assigned into the most-derived named haplogroup, the haplogroup distribution frequencies in each of the five populations were then estimated. We also compared the haplogroup distribution pattern in our samples with that of the recently reported data from Siberia (Buryat and Yakut; Pakendorf et al. 2003). The haplotype diversity and nucleotide diversity (Nei 1987) in the populations were computed by using the DnaSP package (Rozas and Rozas 1999). In addition, using the information provided by the complete sequences of the Cyt b genes analyzed in this study and in the reported complete sequences (Kivisild et al. 2002; Kong et al. 2003), we reanalyzed the recently reported Cyt b data from Koreans (Lee et al. 2002) and tried to pinpoint the potential errors in their sequence data.

Nomenclature

Gene mutation nomenclature used in this article follows the recommendations of den Dunnen and Antonarakis (2001). Gene symbols used in this article follow the recommendations of the HUGO Gene Nomenclature Committee (Povey et al. 2001). The authors have made every attempt to perform the study in accordance with the recommendations made by Cooper et al. (2002).

Results

Haplogroup identification

The mtDNA sequence variation in the 232 individuals was listed in Table 1. It should be noted that all of the samples, with the exception of two individuals, could be classified into the most-derived named mtDNA haplogroups (Fig. 2). The two mtDNAs that could not be assigned further were labeled M and N, respectively. The one M mtDNA from the Korean sample (Kor92), with mutation motif 16145-16148-16188-16189-16223, might belong to a new (sub-)haplogroup of M that is still not defined. This motif can also be found in the Tibetan samples from Yunnan (Yao and Zhang 2002). The one N haplotype from the Oroqen sample (Oro16) near-matches a Han Chinese sample from Wuhan, Hubei (WH6976; Yao et al. 2002a).

Table 1. MtDNA variation in 232 individuals from five ethnic populations in northern China. Positions are numbered according to the revised Cambridge reference sequence (rCRS) of Andrews et al. (1999); the mtDNAs that have no mutations in a sequenced region compared with the reference sequence are labeled as CRS
Fig. 2.
figure 2

Classification tree of the mtDNA haplogroups identified in 232 northern Chinese samples. This tree is constructed with reference to the classification trees of Yao et al. (2002a), Kivisild et al. (2002), and Kong et al. (2003). The characteristic mutations (relative to the revised CRS; Andrews et al. 1999) considered here are indicated on the branches with an arbitrary order. The suffix indicates a transversion, d indicates deletion, and recurrent mutations are underlined. The revised CRS branches out from the R node by seven haplogroup-specific mutations at sites 73, 1438, 2706, 4769, 7028, 11719, and 14766, plus four private mutations at sites 263, 750, 8860, and 15326

Table 2 shows the sequence polymorphisms identified in region 14576–16047 in the 63 individuals. This region provides useful information in supporting the poorly characterized haplogroups that only defined by the control region motifs (Kivisild et al. 2002; Yao et al. 2002a, 2003a). For instance, haplogroup B5b, a sub-haplogroup of B5, which was formerly defined by HVS-I transitions at sites 16140, 16189, and 16243, is newly confirmed by five specific mutations (15223, 15508, 15662, 15851, and 15927) in this region. Similarly, haplogroup B5a is identified by the 15235 mutation, haplogroup B4b is supported by a transition at site 15535, haplogroup B4c is recognized by 15346 mutation, haplogroups M8a, C, and Z share a specific transversion at site 15487. Haplogroup M10, which was formerly defined by mutations 10646 and 16311 (Yao et al. 2002a), could be well recognized by mutations at sites 15040, 15071, and 15218. The newly described haplogroup G1 (Bandelt et al. 2003; Kong et al. 2003), which is a sub-haplogroup of haplogroup G, is characteristic of mutations 15323 and 15497. Mutation 15860 and the HVS-I motif (16223-16325-16362) further defined a sub-branch of G1, G1a. The region 14576–16047 also provides information for defining several European specific haplogroups (Finnilä et al. 2001; Herrnstadt et al. 2002), such as JT (characterized by mutation 15452A), T (recognizable by mutations 14905, 15607, and 15928), and H4 (characterized by mutations 14766 and 14582). Note that a subset of J is also identifiable by a transition at site 14798 (Finnilä et al. 2001; Herrnstadt et al. 2002). As a result, 14 individuals from the Daur, Ewenki, and Mongolian samples considered here could be assigned into the west European-specific haplogroups R2, H4, J1, T1, and T2 (Table 1). Our extensive searching for the reported data in Chinese showed that haplogroups J and T also occurred in samples from Liaoning, Shaanxi, Hunan, and Xinjiang Provinces (Oota et al. 2002; Yao et al. 2000a, 2002a). The question about the invasion and spread of these western-Eurasian-specific lineages across China is still unspecific, and further analysis is needed to resolve it.

Table 2. Polymorphisms in region 14576–16047 of 63 samples. The sequenced mtDNA region that is identical to the revised reference sequence (Andrews et al. 1999) is labeled by CRS. ND not determined. All mutations are transition unless a suffix (i.e., A, C, G, and T) is specified; d indicates deletion, + means insertion. Mutation in parentheses indicates the absence of this mutation at the site in the sample when compared with rCRS, and boldface is used to highlight the specific mutations of a haplogroup relative to the roots of M and N, respectively

Errors in the reported Korean data

The mtDNA coding region information has been employed in forensic science recently (Tzen et al. 2001; Lee et al. 2002). The now available data with coding region and control region information (Kivisild et al. 2002; Yao et al. 2002a; Kong et al. 2003; this study) could be used as a benchmark to check the potential reading errors or artificial recombination in the reported data set. In the 98 Korean samples that were sequenced for the complete Cyt b sequence and the two hypervariable segments of control region (Lee et al. 2002), several obvious errors caused by possible sample crossover can be easily discerned: sample H84 is a crossover of M7a1 with D4a; F531.2 is a crossover of F1b and D4a; H81 is a crossover of A5 with D4a; H98 is a crossover of A5 and M. Besides these recombination errors, there are many overlooked polymorphisms: 16290 might be missed in H98; 15326 might be overlooked in H108 and F907.1; 249d might be disregarded in SB41; in the three G1a samples, F916.2 and F844.1 all lacked 15497, while F408.2 missed 15323. The transversion at site 15487, which is shared by haplogroups M8a, C, and Z, was neglected in sample F496.1. Moreover, oversight of mutation 16362 seems to be frequent for the D4a types that were identifiable via transition at site 14979. These seemingly artificial errors caused by sample crossover or other reasons are not infrequent. The 8.8-kb length of mtDNA sequences of Native Americans reported by Silva et al. (2002) also contained such problems (Yao et al. 2003b, 2003c). Even with extreme caution during the bench work, such errors may occur (c.f. Kong et al. 2003; Yao et al. 2003a). Thus, additional quality control measures, such as independent typing for the region by different individuals (Herrnstadt et al. 2002), matching or near-matching with reliable data sets, and detecting errors by phylogenetic analysis as described recently (Bandelt et al. 2001, 2002; Yao et al. 2003b, 2003d; Yao and Zhang 2003) should be extensively employed to avoid possible errors in the data.

Haplogroup distribution

Table 3 shows the haplogroup distribution frequencies in the five northern ethnic groups, as well as in the samples reported by Pakendorf et al. (2003), from which several features can be discerned: (1) haplogroups D, D5, G, and A are distributed widely among the seven populations; (2) haplogroups G2 (including G2a), M9a, Y, and the sub-haplogroups of F1 (including F1a, F1b, and F1c) have limited distributions in these samples; (3) some north-prevalent haplogroups — namely, D, G, C, and Z (Yao et al. 2002a, 2003a) — have relative high frequencies (altogether more than 45%) in these populations; (4) the frequency of haplogroup B is quite high (more than 7.1%) in Daur, Ewenki, Korean, Mongolian, and Buryat, but is lower in either Oroqen (2.3%) or Yakut (0.0%); (5) haplogroup M7 shows high frequency in Daur (20.0%) and Korean (8.4%), which, however, were either absent or with low frequency in other populations.

Table 3. The haplogroup distribution frequencies (%) in the seven northern ethnic populations. Populations Daur, Oroqen, Ewenki, Korean, Mongolian, Buryat, and Yakut are abbreviated as DW, Oro, EWK, Kor, Mg, Bur, and Yak, respectively

Discussion

The emerging mtDNA phylogeny of East Asian mtDNAs and the available data set with coding region and control region information can serve as the foundation for the East Asian mtDNA haplogroup assignment, and this has been fully described in a series of recent studies (Kivisild et al. 2002; Yao et al. 2002a, 2003a; Kong et al. 2003). The dissection of the 232 mtDNAs from northern China in this study by the same strategy revealed that: (1) most of the mtDNAs could be classified into the most-derived named mtDNA haplogroups (Fig. 2), and (2) some haplogroups, such as D, G, C, and Z, were prevalent in these northern samples and the matrilineal genetic profile was consistent with the genetic pattern observed recently (Yao et al. 2002a).

Two mtDNA coding region segments (10171–10659 and 14055–14590) analyzed in Yao et al. (2002a) were found to be very informative for East Asian mtDNA haplogroup characterization. However, these segments provided little information in supporting the status of haplogroups G1, G1a, M8, M10, B4b, B4c, B5a, and B5b. Our analyses of the region 14576–16047 showed that it contained many characteristic polymorphisms for these haplogroups and filled the lacunae. We suggested that when discerning the haplogroup status of the major haplogroups in East Asian mtDNAs based on short coding-region segments, these three segments should be the ideal choices.

The comparison of the matrilineal genetic structure of the ethnic populations could reflect their ethnohistory more or less (Yao et al. 2002a, 2002b; Yao and Zhang 2002). The Koreans in China are mainly the descendents of migrants from the Korean Peninsula (Du and Yip 1993). Our analysis of the Korean sample revealed that it contained the specific haplogroups A5 and M7a1, which are also prevalent in South Koreans (Kivisild et al. 2002, and references therein), thus revealing a common matrilineal genetic background of the Koreans in China and in the Korean Peninsula.

The Daurs were said to be the descendants of Khitan people in Liao Dynasty (916 AD–1125 AD). Other hypotheses suggested that their ancestors could be traced back to the local people in northern Heilongjiang Province and to some tribes of the Heishui region (same province) during the Shui and Tang Dynasty (581 AD–907 AD) (Du and Yip 1993). Our results demonstrated that Daur contained a high amount of the haplogroups prevalent in northern China (>46%), thus consistent with their northern origin.

According to historical documents, the Ewenkis traced their origin to the populations who lived around Lake Baikal and adjacent eastern regions more than 2,000 years ago, and were divided into three long-separated branches (Solon, Tungus, and Yakut; Du and Yip 1993). Our results supported the suggestion that the Ewenki is a typically northern population, for more than 63% of its maternal components are composed of haplogroups D, G, C, and Z. Furthermore, the high frequencies of haplogroups C and D but lower haplotype diversity (0.956±0.011; Table 4) observed in Ewenki sample suggested that Ewenki might have undergone recurrent genetic drifts because of its small population size (about 26,000, 1990 census) and episodes of population fragmentation during its development (Du and Yip 1993).

Table 4. Genetic diversities in the seven northern ethnic populations. The genetic diversities in the populations were calculated according to the HVS-I sequence [relative to 16001–16400 in the revised reference sequence (Andrews et al. 1999)]

The Oroqens were regarded as the earliest inhabitants who lived in the Heilong River valley (Du and Yip 1993). The Oroqen population size experienced serious reduction during the past ten decades: the population size was about 18,000 in 1895, but was reduced to 2,256 by 1953. In 1990, the sample size increased to 6,965 (Du and Yip 1993). The genetic structure of our Oroqen sample showed concordant features with its small population size and recorded history: high frequencies (86.4%) of north-prevalent haplogroups, such as D, G, C, and Z, were found, but with low haplotype diversity (0.948±0.015; Table 4).

The formation and development of the Mongolian population was a complex process affected by integrating many Turkic-speaking tribes and some ethnic groups such as Han, Manchu, and Daur. Although the south-prevalent hapologroups F, R9b (formerly R10 in Yao and Zhang, 2002; c.f. Kong et al. 2003), and N9a were found with a low frequency in the Mongolian sample, the main maternal components of the population are composed of the north-prevalent haplogroups, which occupy more than 58% of the total samples.

In short, we identified another coding region segment (region 14576–16047) that is informative for discerning the haplogroup status of East Asian mtDNAs besides the previously reported ones (Yao et al. 2002a). Although the matrilineal genetic components of the five northern Chinese ethnic populations differed, the observed genetic profile was in general consistent with that of the Chinese Han regional samples (Yao et al. 2002a). The presence of northern population prevalent haplogroups D, G, C, and Z in these populations gave direct information in supporting their northern origin. Therefore, the matrilineal structures of these five northern Chinese ethnic populations reflected both the regional features and their ethnohistory.