Introduction

The analysis of mitochondrial DNA (mtDNA) has been implemented in molecular anthropology [1], evolutionary biology [2], medical genetics [3, 4], and human identity testing [5]. Strictly maternal inheritance, lack of recombination and high mutation rate make mtDNA a viable marker system for assessing genetic relationships among individuals or groups. Although having less discrimination power compared to autosomal DNA markers, the high copy number of mtDNA per cell increases the success rate of DNA typing of damaged, degraded, and low-quantity samples that fail to yield nuclear DNA profiles.

Forensic scientists have focused mainly on hypervariable regions I and II (HVI and HVII, respectively) due to high concentration of variants in those mitochondrial genome (mtGenome) regions. However, it has been reported that 75 % of the total variation within the mtGenome resides outside the control region, and therefore sequencing of the entire mtGenome increases the discrimination power and the value of generated data [6]. Sanger-type sequencing (STS) is still the main mtDNA typing technique in case work laboratories, and sequencing beyond the control region is attempted rarely as the well-established methodology is labor intensive, costly, and time consuming. In recent years, massively parallel sequencing (MPS) has been shown to be a feasible alternative to STS. With the utility of bench-top sequencers like Illumina MiSeq and Ion Torrent PGM (Personal Genome Machine), generation of whole mtGenome data is feasible for the application-orientated laboratory [610].

Population data are essential for haplotype frequency estimation. While more than 25,000 forensic mtDNA sequences and almost 35,000 mtDNA sequences in total are available currently in the EMPOP database [11, 12], these data include information only from the mtDNA control region. In addition to STS, a number of other technologies permit practical access to mtDNA coding region data through single nucleotide polymorphism assays, sequence-specific oligonucleotide probes, mass spectroscopy, and MPS technology that emphasize the importance of a whole mtGenome database [13]. Recently, 588 forensic-quality whole mtGenomes from three major US populations have been determined with Sanger sequencing and will be available for query [14]. However, with MPS, population data can be generated far more expeditiously and at a lower cost per nucleotide. Thus, more whole genome data can be generated to exploit the full power of mtDNA for forensic identity testing.

The current population size of Estonia is slightly above 1.3 million. Throughout history, the native Estonian population has been affected by migration from both east and west due to numerous conquests. Estonians have served under German, Danish, Polish, Swedish, and Russian rule. Thus, there is some expectation of genetic admixture from these populations. A European genetic map of >1500 individuals and based on ~270,000 single nucleotide polymorphism data divided the European population into four groups and placed Estonians into the Baltic region, Poland, and Western Russia group [15]. It has been reported that Estonians have a higher Y-haplotype diversity, and based on their mtDNA HVI sequence, they have higher mean pairwise differences compared with other populations in the Baltic region [16].

Despite that fact that the Estonian population data has been used for migration studies based on its mtDNA HVI region and a number of coding region single nucleotide polymorphisms [17, 18, 16], to the best of our knowledge there are no published whole mtGenome data. The objective of this study was to describe the genetic variability of mtGenome in an Estonian population sample and to compare the discrimination power of mtGenome with solely HVI/HVII data.

Materials and methods

Sample preparation and target amplification

Buccal swabs were collected from 114 unrelated Estonian volunteers according to protocols approved by the Tallinn Medical Research Ethics Committee and have been performed in accordance with the ethical standards as laid down in the 1964 Declaration of Helsinki and its later amendments or comparable ethical standards. In addition, all samples were collected in accordance with the University of North Texas Health Science Center Institutional Review Board. Buccal samples were collected with sterile Eurotubo® collection swabs (Deltalab, Rubí, Spain). The swabs were allowed to air dry and were stored at ambient temperature. DNA extraction was performed with the QIAamp® DNA Blood Mini Kit (QIAGEN, Hilden, Germany) according to the manufacturer’s protocol. The quantity of extracted DNA was determined using the Qubit® dsDNA HS (High Sensitivity) Quantification Kit and a Qubit® 2.0 Fluorometer (Life Technologies, Foster City, CA, USA). Amplification of the mtGenome was accomplished as described by King et al. [6].

Nextera XT library preparation

For library preparation, 1.0 ng of DNA was used. Libraries were prepared using Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA) according to the manufacturer’s protocol [19] except for the library normalization after bead cleanup and library preparation for sequencing. Following PCR cleanup, the libraries were quantified using the Qubit dsDNA BR (Broad Range) kit (Life Technologies, Foster City, CA, USA) and evaluated for fragment size using the High Sensitivity D1000 ScreenTape and Tape Station 2200 (Agilent Technologies, Santa Clara, CA, USA). Library normalization and preparation for sequencing was performed according to an in-house protocol. Purified libraries were normalized to 2 nM and pooled. Illumina PhiX control v3 (Illumina) was diluted to 2 nM with resuspension buffer (RSB, Illumina), and 2 μl of diluted PhiX were mixed with 14 μl of pooled library (2 nM). Further, 10 μl of PhiX-library pool were mixed with 10 μl of freshly made 0.1 N NaOH and vortexed. The library-PhiX-NaOH mix was incubated for 5 min at room temperature. Pre-chilled HT1 (980 μl, hybridization buffer, Illumina) was added to the mix. Then, 600 μl of library were mixed with 400 μl of HT1 for a final 12 pM sequencing library.

Sequencing and data generation

The 12 pM pooled library was sequenced with MiSeq v2 (2 × 250 bp) chemistries (Illumina). The MiSeq re-sequencing protocol for small genome sequencing was followed according to the manufacturer’s recommendations. Criteria used for variant calling (quality threshold, heteroplasmy threshold, coverage threshold) were as described by King et al. [6]. On-board software (i.e., Real-TimeAnalysis and MiSeq Reporter) converted raw data to Binary Alignment/Map (BAM) [20] and Variant Call Format (VCF) v4.1 files [21] using Genome Analysis Toolkit (GATK) [22]. During this process, the sequenced regions of interest (ROIs) were aligned to the revised Cambridge Reference Sequence (rCRS) [23].

Data analysis

Generated VCF files were converted into haplotypes using MitoSAVE [24]. mtDNA variants were confirmed manually using BAM files and Integrative Genomic Viewer (IGV) software. Indels at the positions 309, 315, and 16193 were not included in the analysis. Random match probability (RMP) and genetic diversity (GD) were calculated according to the methods described by Stoneking et al. [25] and Tajima [26], respectively. Mean pairwise comparison was calculated using MEGA [27]. HaploGrep software based on Phylotree 16 [28, 29] was used for haplogroup assignment.

Results and discussion

All 114 samples resulted in whole mtGenome sequences obtained by MPS (Supplemental Table 1). Of these mtDNA profiles, 100 (87.7 %) were unique within the data set, and 12.3 % of sequenced mtGenomes were observed twice. Compared to the previous whole mtGenome population studies [6, 14], the number of shared haplotypes in the current study is relatively high. This might be a reflection of lower mtGenome diversity in Estonia or just may arise from sampling variance. The majority of the samples were collected at two educational institutions, and although an effort was made to ensure that the sampled individuals were unrelated, a long-distance kinship between these individuals cannot be excluded. In total, 2663 positions were reported as variants in relation to rCRS. These variants were distributed across 512 mtDNA positions. Variants 263G, 750G, 1438G, 4769G, 8860G, and 15326G were seen in 111 of 114 samples. The detection of these variants in majority of the samples is the reflection of reference used. The remaining three samples (EST-9, EST-19, and EST-40) exhibited few differences with respect to the rCRS (≤3), of which 1–2 were local or global private mutations as defined by HaploGrep. The low haplogroup assignment quality score for sample EST-19 was noted. The limited number of variants within the sample could indicate bias previously observed in reference to Phylotree and the rCRS [30]. Therefore, the true haplogroup for EST-19 individual (haplogroup H5e with the quality score 53 % assigned by HaploGrep) is likely to be in the H2 lineage. Six variants (73G, 2706G, 7028T, 11719A, 14766T, and 16519C) were found in ≥50 % of the samples. From these variants, 73G and 11719A were haplogroup nodes for haplogroup R, variants 2706G and 7028T for haplogroup H, and variant 14766T for haplogroup HV. Position 16519 is considered a hotspot and thus not useful for haplogroup assignment. Point heteroplasmy was detected in 14 samples (12.3 %). Observed point heteroplasmy with position coverage and heteroplasmy percentage is listed in Supplemental Table 2. Three samples (EST-33, EST-81, and EST-106) exhibited two point heteroplasmy positions each. A similar extent of heteroplasmy (16.2 %), along with heteroplasmy at multiple positions per sample, has been observed previously in buccal cell mtDNA [31].

Haplogroup assignment using HaploGrep software resulted in 11 major clades and 87 distinct haplogroups. Major clades were D, HV (including haplogroup H), I, J, M, N, R, T, U (including haplogroup K), W, and X. Two clades were dominant: 54 samples (47.4 %) belonged to the haplogroup HV (including H) and 27 samples (23.7 %) pertained to the haplogroup U (including K). Haplogroups D, R, and X were seen once. Haplogroups HV and U are the most represented haplogroups in the European population with the estimated frequency of ≥50 and ≥20 %, respectively [17]. Our results are concordant with previous mtDNA control region reports conforming of the prevalence of haplogroup H and U in the Estonian population [32, 33]. While haplogroup H was introduced to Europe from the Franco-Cantabrian region, haplogroup U5, which was observed in 13 of our samples (11.4 %), is thought to have evolved in situ [32, 17]. Haplogroup U4 has been reported in the Eastern Baltic Sea region with a frequency up to 8.8 % and associated with Volga-Ural influence [16]. While haplogroup D is the second most common haplogroup in Northern Asia, haplogroup D5 has been found with a very low frequency in several European populations including Estonians [34]. The rare subhaplogroup D4e4b, observed in one of our samples, has been reported in Tatars and Russians [35].

Out of the observed 2663 nucleotide variants, 607 (22.8 %) were identified in HVI/HVII regions; accordingly, 77.2 % of variation resided in the coding region. These results are in accordance with results reported by King et al. on 283 individuals from Caucasian, Hispanic, and African-American populations [6]. As in the case of whole mtGenome data, the proportionally smaller level of variation in HVI/HVII may be a reflection of lower mtGenome diversity in Estonia or just may arise from sampling variance. Whereas 100 unique mtGenome haplotypes and 87 distinct haplogroups were observed with whole mtGenome data, only 66 unique haplotypes and 79 distinct haplogroups were observed using HVI/HVII data. Haplogroup comparison between full mtGenome and HVI/HVII data resulted in a haplogroup clade change according to HaploGrep for 1 sample (EST-59) that changed from haplogroup U5b2a1a2 (HVI/HVII; quality score 86.8 %) to haplogroup H (mtGenome; quality score 81 %). Haplogroup assignment based on mtGenome data resulted in a quality score increase for the majority of the samples that yielded a quality score less than 100 % with HVI/HVII. A lower quality score of sample EST-59 can be explained with an abundance of local and global private mutations that were not present in HVI/HVII data. GD for mtGenome and HVI/HVII was 99.67 and 95.85 %, respectively. RMP for mtGenome data was 1.20 versus 4.99 % for HVI/HVII. Compared to RMP values of other populations [14, 6, 36], the RMP results presented herein for HVI/HVII data are higher. This finding might be explained with 24 HVI/HVII haplotypes having ≥4 identical matches in the population sample. Mean pairwise difference within the Estonian population was 27 ± 11 for mtGenome data, which is slightly lower than reported by King et al. [6] for Caucasians. Mean pairwise difference for HVI/HVII data within the Estonian population sample was 7 ± 3. These results support the power of discrimination of entire mtGenome over HVI/HVII.

Conclusion

In this study, 114 mtGenome profiles from the Estonian population were generated. The results show that the use of the entire mtGenome compared to HVI/HVII data substantially improves the discrimination power of the quality of haplotypes.