Introduction

Analysis of the human mitochondrial DNA control region has become a powerful tool for forensic identity testing. Patterns of mutations which have accumulated over time along the transmission of mtDNA lineage give rise to the differences among individuals, and have become the basis for forensic discrimination. Other properties of mtDNA that make it valuable for human identification studies include its high copy number, maternal inheritance, and the absence of recombination [13]. Current mtDNA forensic testing consists primarily of sequence analysis of portions of the control region, and most often targets the first and second hypervariable regions (HV1 and HV2). However, HV1/HV2 haplotypes show an uneven distribution; such mtDNA haplotypes are quite common, while others are quite rare [4]. Therefore, other regions within the control region and variations within the mtDNA coding region of forensic utility have sometimes been targeted to better differentiate mtDNA [58].

On the other hand, the frequencies of certain mtDNA sequences in a given population can reflect geographic or historical origin. Recently, phylogenetic analyses based on the classifications of mtDNA lineages in East Asian populations using control region and coding region information have been reported [914], and a more reliable phylogenetic tree for mtDNA differentiation has been presented. A posteriori phylogenetic analysis of sequences can help to eliminate most systematic errors (e.g., a mix-up of site designations, base shifts or mistypings), and thus stringent quality control of mtDNA data by haplogroup-directed database comparisons has become available [15, 16].

In this study, we established a high-quality mtDNA control region sequence database for Koreans. We carried out a population study that examined sequence variations in the mtDNA control region, and evaluated the usefulness of various regions within the mtDNA control region by assessing the contributions of respective region to reduced random match probability. In addition, we classified mtDNA lineages (haplogroup) using control region information, and compared haplogroup frequencies with those reported by other East Asian studies.

Materials and methods

DNA samples

Buccal swabs or blood samples were obtained from 593 unrelated Koreans for the present study. DNA was extracted using a QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions.

PCR amplification and mtDNA sequencing

To amplify mitochondrial control region, PCR amplification was carried out using 1–2 ng of template DNA, and F15971 and R638 as primers (Table 1). Thermal cycling was conducted using a PTC-200 DNA engine (MJ Research, Waltham, MA, USA) under the following conditions: 95°C for 11 min; 35 cycles of 95°C for 30 s, 56°C for 30 s, 72°C for 90 s; and a final extension at 72°C for 7 min. The resultant amplified PCR products were purified with exonuclease I (USB, Cleveland, OH) and shrimp alkaline phosphatase (USB, Cleveland, OH, USA), and sequenced using a Big Dye Terminator Cycle Sequencing v2.0 Ready Reaction Kit (Applied Biosystems, Foster City, CA, USA). Sequencing reactions were analyzed either using an ABI PRISM 3100 or an ABI PRISM 310 Genetic Analyzer (Applied Biosystems). Primers used for sequencing, in addition to the amplification primers, were F16328, F015, F314, R16509, R240 and R569 (see Table 1). The results were analyzed using Sequencing Analysis Software Version 3.4 and Sequence Navigator 1.01 (Applied Biosystems). To ensure sequencing data quality, a redundant approach to data generation and analysis was used. Duplicate amplifications were sequenced in both the forward and reverse directions, and the resultant consensus sequences were used for further analyses.

Table 1 Amplification and sequencing primers for mtDNA control region

Identification of the major length variants of the C tracts in HV2 and HV3

To determine the major length variants of the C tracts (the major C-tract genotypes) in HV2 and HV3 region heteroplasmic individuals, PCR amplification was carried out using a previously described method [17] with 50–100 pg of template DNA and the following set of primers: F291 (5′-ATT TCC ACC AAA CCC CTC C) and R389 (5′-HEX-CTG GTT AGG CTG GTG TTA GG) for HV2 length heteroplasmy and F447 (5′-CAT TAT TTT CCC CTC CCA CTC C) and R638 (5′-FAM-GGT GAT GTG AGC CCG TCT AAA C) for HV3 length heteroplasmy. The resultant PCR products were analyzed by capillary electrophoresis on an ABI PRISM 310 Genetic Analyzer (Applied Biosystems) using GeneScan analysis software version 3.1 (Applied Biosystems).

Pairwise comparisons

Pairwise comparisons within the population were performed using the program mtDNAmanager (Shin, unpublished data), which permits hypervariable length differences in C tracts of HV2 and HV3 to be either ignored or considered in comparisons. However, length variations in the C tract of the HV1 region were ignored in all comparisons.

Genetic characteristics of the control region

The genetic diversity, random match probability and average number of nucleotide differences were calculated as described by Tajima [18], Stoneking et al. [19] and Nei [20] using the program mtDNAstar (Shin, unpublished data). Major C-tract length variants of HV2 and HV3 were either considered or ignored in terms of distinguishing haplotypes to calculate the above statistical parameters.

Haplogroup determination of Korean mtDNA

Korean mtDNA haplotypes were classified into East Asian haplogroups and sub-haplogroups based on patterns of shared haplogroup-specific or haplogroup-associated polymorphisms as reported previously using the program mtDNAmanager (Shin, unpublished data) [912, 14]. In addition, a phylogenetic analysis at the level of sub-haplogroups was performed using Network software version 4.1.0.8 (http://www.fluxus-engineering.com) [21, 22].

Results

Statistical parameters

We determined the nucleotide sequences of the complete mitochondrial DNA control region in 593 unrelated Koreans. This mtDNA haplotype database search is available at the authors' website (http://forensic.yonsei.ac.kr/). After comparing complete control region sequences with the major length variants of the C tracts in HV2 and HV3 being considered in distinguishing haplotypes, 494 haplotypes (83.3%) defined by 285 variable sites were found. Among these, 424 sequences (71.5%) were observed only once, 50 twice (16.9%), 15 three times (7.6%), 2 four times (1.3%), 2 five times (1.7%) and 1 six times (1.0%). The most frequently observed haplotype was 16187T, 16223T, 16290T, 16319A, 73G, 235G, 263G, 309.1C, 315.1C, 523d, 524d in comparison to revised Cambridge reference sequence (rCRS).

In the case of the major C-tract genotypes ignored for distinguishing haplotypes, 441 sequences (74.4%) were observed. From these, 362 haplotypes (61.0%) were observed once, 43 twice (14.5%), 17 three times (8.6%), 10 four times (6.7%), 6 five times (5.1%), and the other 3 sequences were observed six, eight, and ten times, respectively (1.0, 1.3, and 1.7%, respectively). The most abundant haplotypes (16187T, 16223T, 16290T, 16319A, 73G, 235G, 263G, 523d, 524d and 16223T, 16362C, 73G, 263G, 489C) were observed in 3.0% of the samples.

To better categorize the data obtained, several terms were used, as described previously [8]. The region spanned by 7S DNA during the replication of the H strand, situated between nucleotide positions (n.p.) 16366 and 72, was called 7S DNA spanned region (7S-SP), and the region between n.p. 341 and 576 containing HV3 was called the HV3 extended region (HV3ex). In addition, we used the HV1 extended region (HV1ex) and the HV2 extended region (HV2ex), which refer to the regions spanning n.p. 16024 and 16400, and n.p. 41 and 340, respectively.

Based on the observed frequencies of the mtDNA control region haplotypes and the data from partial areas of this region (i.e., HV1, HV2, HV3, 7S-SP, HV1ex, HV2ex, HV3ex, HV1+HV2, HV1+HV2+HV3, HV1ex+HV2ex, HV1ex+7S-SP+HV2ex, HV1ex+HV2ex+HV3CA, HV1ex+HV2ex+HV3, HV1ex+HV2ex+5SNPs, HV1ex+HV2ex+HV3CA+5SNPs, HV1ex+HV2ex+HV3+2SNPs), genetic diversity, random match probability, the average number of nucleotide differences, and the observed number of haplotypes were calculated (Table 2). Here, HV3CA refers to the CA dinucleotide repeats in HV3, five SNPs refer to n.p. 16497, 16519, 456, 489, and 499 located in 7S-SP and HV3ex with 3.0% or more transition rates, and two SNPs refer to n.p. 16497 and 16519 in 7S-SP.

Table 2 Genetic diversity, random match probability, average number of nucleotide differences and observed numbers of haplotypes in mtDNA of 593 Koreans

The genetic diversities of HV1+HV2, HV1ex+HV2ex+HV3 and the complete control region were estimated to be 0.9967, 0.9977, and 0.9983, respectively. Consideration of the major length variants of the HV2 and HV3 C tracts increased these genetic diversities to 0.9985, 0.9990, and 0.9992, respectively.

The probability of a random match between two unrelated individuals was calculated to be 1:200 (0.50%) for HV1+HV2, 1:256 (0.39%) for HV1ex+HV2ex+HV3, and 1:294 (0.34%) for the complete control region. When considering the major C-tract genotype in length heteroplasmy, this parameter decreased to 1:313 (0.32%), 1:370 (0.27%), and 1:400 (0.25%) for HV1+HV2, for HV1ex+HV2ex+HV3, and for the complete control region, respectively.

The average number of nucleotide differences increased as the combined partial control regions were expanded, up to 12.30, and in the case of the major C-tract genotype of length heteroplasmy considered in distinguishing haplotypes, this increased to 13.48.

Also, the observed number of haplotypes increased when regions were added to the calculation. The different sequences were counted as 408, 430, 438 and 440 for HV1+HV2, HV1ex+HV2ex+HV3, HV1ex+HV2ex+HV3+2SNPs, and the complete control region, respectively. The addition of the major C-tract length variant to the calculation resulted in 466, 483, 491, and 494 haplotypes for HV1+HV2, HV1ex+HV2ex+HV3, HV1ex+HV2ex+HV3+2SNPs, and the complete control region, respectively.

Variable sites in the mtDNA control region

The observed mutational events including nucleotide substitutions, insertions, and deletions compared to rCRS are shown in Table 3. The most polymorphic region was HV1, which contained 142 variable positions (49.82%), whereas the HV2 region showed 79 mutational sites (27.72%). In addition, 7S-SP and HV3ex displayed 29 and 35 variable positions, respectively (10.18 and 12.28%, respectively).

Table 3 Mutable sites observed in the sequence analysis of mtDNA control region with respect to revised Cambridge reference sequence

Substitutions were observed at 246 positions with a total of 5,487 differences. Of 246 substitution sites, transition and transversion were observed in 236 and 28 sites, respectively. Substitution sites that showed both transition and transversion reached up to 18. Of the 5,487 differences, 5,164 occurrences (94.1%) were transitions, mainly of T to C and C to T (63.0%). Transversions were relatively rare (323 occurrences; 5.9%), but A to C and C to A predominated in the HV1 region (93.5%). The high rate of A to C transversion in Koreans (236 occurrences) was mainly due to a change at n.p. 16182 (71 individuals) or at 16183 (143 individuals).

Deletions were found at 21 sites with a total of 318 occurrences mainly in the HV3ex and HV2 regions. In particular, CA dinucleotide deletion at n.p. 523/524 in the HV3 region was observed in 219 individuals.

Insertions included two types of nucleotide addition, i.e., homoplasmic and heteroplasmic. Homoplasmic insertions were found mainly in HV2 (593 occurrences) and in HV3ex regions (17 occurrences). The high frequencies of homoplasmic insertion mutations in the HV2 and HV3ex regions were mainly due to the insertion of nucleotide C at n.p. 315 (589 individuals) and CA dinucleotide insertion at n.p. 524 (15 individuals). Other insertion sites with relatively high frequencies included 309.1 (375 individuals) and 573.1 (38 individuals), which were often found to be related to length heteroplasmy.

Length heteroplasmy also occurred in the poly-C tract of the HV1 region with a T-to-C transition at n.p. 16189 and in CA dinucleotide repeats of the HV3 region. One hundred eighty-eight individuals (31.7%) and 377 individuals (63.6%) showed length heteroplasmy in the poly-C tract of the HV1 and HV2 regions, respectively. In addition, three individuals (0.5%) and 39 individuals (6.6%) displayed length heteroplasmy in the CA dinucleotide repeats and the poly-C tract of the HV3 region, respectively. Point heteroplasmies were observed in 14 individuals; at 16093T/C twice, 16103A/G once, 16129G/A once, 16172T/C once, 16176C/T once, 16342T/C once, 146T/C once, 152T/C three times, 194 T/C once, 195T/C once and 204T/C once.

In addition, all mutable nucleotide positions were ranked into three categories according to the observed mutation frequencies. In Table 3, data in italics indicate all mutable sites observed in one to five individuals (<1%). Underlined data represent mutable sites observed in six to 17 individuals (<3%), and those in bold indicate mutable sites in >18 individuals (>3%). The nucleotide positions that showed high mutation percentages in Koreans (>20%) are 16183C (24.1%), 16189C (34.2%), 16193.pC (31.7%), 16223T (73.0%), 16362C (46.2%), 16519C (51.6%), 73G (99.5%), 152C (21.6%), 263G (99.0%), 309.1 (63.2%), 315.1 (99.3%), 489C (58.3%), 523d (37.1%) and 524d (36.9%).

Haplogroup determination and haplogroup distribution

We classified 592 Korean mtDNAs (99.8%) into East Asian haplogroups or sub-haplogroups by referring to the East Asian classification tree proposed by Yao et al. [9], Kivisild et al. [10], Kong et al. [11, 12] and Tanaka et al. [14]. The phylogenic backbone used was in accord with Yao et al. [9], Kivisild et al. [10] and Kong et al. [11, 12], and the haplogoups D, G and N9 were subdivided by referring to the classification tree proposed by Tanaka et al. [14]. The basic informative SNPs in the control region used to identify haplogroups are listed in Table 4 (for more details, see ESM).

Table 4 List of important nucleotide positions that identify East Asian mtDNA haplogroups and haplogroup frequencies in Koreans

All of the present samples could be placed in macrohaplogroups M or N, where 489C in HV3 defined the haplogroup M. The observed haplogroup frequencies are shown in Fig. 1. The D4* haplogroup was observed at the highest frequency (15.7%), and other commonly observed haplogroups include D4a (5.1%), D4b2 (3.4%), M7b2 (3.7%), B4 (4.0%), B4a (3.7%), B4b1 (3.4%), B5b (3.0%), A4 (3.0%), A5 (3.2%) and N9a2 (4.7%).

Fig. 1
figure 1

Haplogroup frequencies observed in 593 unrelated Koreans. Number and letters refer to the haplogroups defined in the text

Discussion

All statistical parameters favors entire control region analysis and the consideration of the major C-tract genotype for distinguishing haplotypes. The low values of random match probability and high values for the average number of nucleotide differences were observed across the entire mtDNA control region in Koreans. However, the samples recovered from crime scenes are often highly degraded, which leads to poor PCR amplification of the larger-sized complete mtDNA control region. Therefore, the targeting and preferential amplification of mtDNA using small amplicon products and the selective scoring of highly informative variable sites is likely to be a more effective method for mtDNA analysis. The contributions of partial areas in the mtDNA control region to reduced random match probability can be assessed by comparing the values obtained from various combinations of respective areas. Statistical parameters indicated that the analysis of HV1ex+HV2ex is more effective than that of HV1+HV2, and that the analyses of HV1ex+HV2ex+HV3+2SNPs and HV1ex+HV2ex+HV3CA+5SNPs are as efficient as entire control region analysis (Table 2). Accordingly, the analyses of HV1ex+HV2ex+HV3+2SNPs and HV1ex+HV2ex+HV3CA+5SNPs can be used as efficient alternatives for the analysis of the entire mtDNA control region in Koreans. This seems to be mainly due to the relatively high genetic diversity of the CA dinucleotide repeats in the HV3 region, and the relatively high number of n.p. 16497 and 16519 changes in the Korean population. These findings imply that the analysis of some variable sites outside HV1 and HV2 can play an important role in the forensic discrimination in certain populations.

On the other hand, the cumbersome determination of the major C-tract genotype by PCR amplification with fluorescence labeled primers limits the use of length heteroplasmy in forensic discrimination in spite of their capability to increase the power of discrimination. Similarly, point heteroplasmy has some defects in a point that the observed number of point heteroplasmy tends to depend on the detection methods, and the majority of heteroplasmic variants can not be detected by direct sequencing of PCR products [23]. However, 14 point heteroplasmy (11 heteroplasmic sites) were observed from direct sequencing of the mtDNA control region in the present study, and n.p. 16093 and n.p. 152, which showed point heteroplasmy in two and three individuals, respectively, were found to be one of most frequently observed heteroplasmic positions as described in previous reports [2325].

Also, we classified Korean mtDNA into East Asian haplogroups (Table 4 and ESM). Reference to a worked-out phylogeny and comparisons with published data sets of East Asian populations were performed to avoid potential errors [9, 10, 14]. The basic informative SNPs in the control region used to identify haplogroups in the present study are similar to those used by Allard et al. [13] and Yao et al. [15], and more than 99% of the Korean mtDNA lineages can be allocated to specific mtDNA haplogroups according to their mutation motifs (Table 4). In addition to the HV1 and HV2 mutation motifs for haplogroup assignment, we collated the important mutational positions in HV3ex and 7S-SP. We also added mtDNA control region mutation motifs for the assignment of 20 another haplogroups and sub-haplogroups to those of Allard et al. [13] and Tanaka et al. [14]: D4b1, D4b2, D5b, D*, G1b, G2a1, G2a1a, G2a2, G3a, M7a, M8, M*, CZ, F1, F1ac, A4, A5, N9a1, N9a2 and N9b.

As compared with the mutation motifs presented by Yao et al. [15], we did not include 73G and 263G in all mutation motifs to define haplogroups because of their lack of discriminative power. Also, unlike Yao et al. [15], the A haplogroup mutation motif in this study did not include 152C. The site 152C was found to be variable within this haplogroup, which concurs with the findings of Allard et al. [13]. On the other hand, the M10 haplotypes in this study did not share 16519C, in contrast to the results of Allard et al. [13] and in agreement with the report of Yao et al. [15]. The F1a and F1b haplotypes found in this study shared the 16162G and 16182C–16183C mutations, respectively, in addition to the mutation motifs proposed by Yao et al. [15]. B4a haplotypes were found to share 16182C mutation in addition to the mutation motif proposed by Allard et al. [13]. Both B4 and B4b1 haplotypes in this study were also found to share 16183C in addition to the mutation motifs described by Yao et al. [15], and the 16189d mutation observed in B4 in this study occasionally alternated with 16189C (confirmed by unpublished data). 247d mutation in the Z haplogroup also alternated with 249d (confirmed by unpublished data). In addition to the mutation motif proposed by Allard et al. [13] and Yao et al. [15], B5a haplotypes shared 16129A–16187T–16519C–93G mutations. These sites possibly represent specific mutations of this haplogroup. In addition, the sequence 16093C–16188.1C–16193.1C–16362C–16390A–146C–150T–152C–182T–217C, which was reported to be found in some Japanese individuals in haplogroup D5, was observed in one Korean individual, and was assigned to haplogroup D5, as described by Maruyama et al. [26]. Also, in reference to Kong et al. [12], G4a in Maruyama et al. [26] corresponds to G1a in the present study. The distribution pattern of Korean mtDNA haplogroup frequencies generally parallels to that of the Japanese [13, 26], but showed slight differences versus that of the Chinese [9, 13]. The D4* haplogroup occurred at highest frequency in Koreans (15.7% in this study, 16.5% according to Allard et al. [13], and 31.9% by Maruyama et al. [26]) and in Japanese (19.6% Allard et al. [13] and 35.5% Maruyama et al. [26]), and was also common in Chinese (6.2% Allard et al. [13] and 14.2% Maruyama et al. [26]). However, the G haplogroup and its sub-haplogroups were observed in relatively high frequencies in Koreans (8.6% in the present study and 5.2% by Maruyama et al. [26]) and Japanese populations (10.4% Maruyama et al. [26]), but occurred sparsely in the Chinese (4% Allard et al. [13] and 3.4% Maruyama et al. [26]). Differences between the haplogroup frequencies of populations in the same geographic location shown in previous reports and in the present study are considered to be mainly due to different sample collection and classification categories. In addition, some haplogroups defined on the basis of very small number of mutation sites in control region might make their precise haplogroup determination difficult, occasionally leading to different distribution of haplogroups.

In conclusion, all statistical and phylogenetic results and values in the present study indicate that our mtDNA data are relatively well suited for application to forensic casework and that they contribute to a better characterization of Korean mtDNA types. Our examination of sequence variations in the mtDNA control region, and the evaluation of the usefulness of various partial areas within the mtDNA control region in terms of reducing random match probability, suggest that variable sites lying outside HV1 and HV2 (e.g., HV3 and five highly informative SNPs, n.p. 16497, 16519, 456, 489 and 499) can play an important role in a better differentiation of Korean mtDNA. Also, stringent quality control of mtDNA data by haplogroup-directed database comparisons confirmed the absence of major systematic errors in our data (e.g., a mix-up of site designations, base shifts, or mistypings). However, the high frequencies of haplogroup D4* in the Korean and Japanese populations, but not in the Chinese population, suggest the need for finer subdivisions within this haplogroup to better characterize Korean and Japanese mtDNA lineages.