Introduction

PCR-based short tandem repeat (STR) polymorphism analyses have been routinely used in forensic genetic caseworks for individual identification in past decades [1, 2]. Analyses of single nucleotide polymorphism (SNP) markers for individual determination and parentage testing also have been explored due to the molecular technological development in recent years [35]. Compared to STRs, the important advantages of SNPs include achievable analyses with small amplicon sizes (60~100 bp), highly abundant and widespread in the genome, low mutation rates, stable for population genetic analysis, with available high-throughput genotyping technologies, and with relatively simple data interpretation methods [68]. However, more than 40 SNP loci may be needed to match the informative discrimination power for genotyping provided by the routinely used STR kits in forensic casework, because the SNPs are bi-allelic markers mainly [9, 10]. In addition, allele frequencies of SNPs in general vary among different population groups.

Among the millions of SNPs of the human genome, sets of potential markers that provide individual identification (IISNPs) have been investigated, so as to reach the discrimination power (DP) of the commonly used STR marker systems such as the combined DNA Index System (CODIS). The candidate IISNPs in global panels should be unlinked, with high heterozygosity, and with low F-statistics (Fst) [5, 11]. Several panels of various numbers of IISNPs have been established and have been validated to be efficient in forensic individual differentiation with low match probabilities based on different populations in the world [6, 1117].

Asia is the most populous continent with heterogeneous genetic origins. The database of genetic variations and relationships among Asian populations is still limited. Kim et al. have developed a SNP-based individual assignment system of 30 SNP loci for Korean individuals [5]. Lou et al. reported the performance of a 44 SNPs individual identification assay for Chinese based on an IISNP panel established by Pakstis et al. [8, 14]. There are other combinations of IISNPs for different Asian populations reported [1820]. An IISNP panel for global forensic casework was established recently [21]. However, the heterozygosity of several IISNP in this panel is relative low in some Asian population based on HapMap populations of NCBI. The robust universal IISNP panel with the smallest number of SNP loci is still under investigation. Furthermore, database of more IISNPs may be helpful in developing panels for examination of DNA mixture.

Taiwan is an island located at the junction of the Western Pacific and the South China Sea. The population of Taiwan is heterogeneous and is made up of a Han population mainly, several indigenous population groups, and new immigrants or individuals from abroad currently living or working in Taiwan. Our intention in the present study was to identify some more efficient autosomal IISNPs and develop an IISNP panel for forensic casework for people living in Taiwan. We tested several SNPs for individual differentiation in five population groups living in Taiwan. The allele frequency variation of SNP loci of different population groups and the forensic genetic performance were evaluated.

Material and methods

Samples and DNA extraction

This retrospective study was approved by the institutional review board. A total of 502 DNA samples were collected from 208 apparently healthy unrelated Taiwanese Han (TWH; 119 males, 89 females), 83 Filipinos (FIL; 35 males, 48 female), 62 Thais (THA; 17 males, 45 female), 69 Indonesians (IND; 8 males, 61 female), and 80 individuals with European, Near Eastern, or South Asian ancestry (ENS; 34 males, 46 females, as an outlying control group). The blood samples and/or buccal swab samples were obtained from volunteer donors with informed consent. Standard methods of phenol chloroform isoamyl alcohol extraction and the QIAamp blood kit (Qiagen, Hilden, Germany) were used for DNA extraction from peripheral whole blood samples, and the Viogene Blood and Tissue Genomic DNA extraction Miniprep system (Viogene, Taipei, Taiwan) was used for DNA extraction from buccal cells. A total of 13 DNA samples of cell lines, including GM03657, GM06994, GM07000, GM07022, GM07056, GM07345, GM07346, GM07347, GM07357, GM09948, GM12861, GM12960 (obtained from the Coriell Institute for Medical Research; http://www.coriell.org/), and 9947A (Applied Biosystems, Carlsbad, CA, USA) were also analyzed. Nine sets of father-child-mother trios DNA samples and 15 degraded DNA samples (from putrefied bone remain) of real forensic casework were also genotyped for these SNPs.

Newly selected autosomal IISNP loci (group A)

To select candidate bi-allelic autosomal IISNPs, we used allele frequencies from four HapMap populations of NCBI (CEU, HCB, JPT, and YRI http://www.ncbi.nlm.nih.gov/snp/). All the chosen SNPs had high average heterozygosity (≥0.49) in HapMap data, low Fst among the four populations (Fst <0.06), confirmed Hardy-Weinberg equilibrium (p value >0.05), physical distance between SNP markers >100 kb (and finally deleted less heterozygous SNPs of any two SNPs with genetic distance within 1 Mb), and appropriate GC content (45~55 %) in their flanking sequences. The nucleotide positions and approximate genetic map distances were examined in the process of selecting all the SNPs. (NCBI dbSNP http://www.ncbi.nlm.nih.gov/snp/) Based on these criteria, 127 SNPs were selected as group A for analysis in our sample set, and 63 could be easily integrated into the arrays. There are limitations for design of primers for Sequenom MassARRAY System genotyping. The SNP loci with highly repeated sequences nearby, the SNP loci with less specific primers, and the SNP loci with amplicon size >100 bp were excluded in our panel.

Well-known IISNP loci (group B)

To compare with the previously established universal IISNP panel of Pakstis et al., we also tried to genotype the samples based on the 45 IISNPs in Pakstis’s panel [14]. Thirty-three of them, which could be easily integrated into our genotyping system, were analyzed.

SNP genotyping

Genotyping of the 96 SNPs (63 in group A and 33 in group B) was performed using the Sequenom iPLEX matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry technology (MALDI-TOF MS, MassARRAY, Sequenom, San Diego, CA) according to the manufacturer’s instructions, at the National Center for Genome Medicine, Academia Sinica, Taiwan. Sequenom iPLEX Gold assay (Sequenom, San Diego, CA), based on the single base extension of an extend primer into the region of DNA variation, can detect insertions/deletions and single base substitutions in amplified DNA at multiplex levels of up to 36 DNA variants loci simultaneously. In brief, custom-designed primers for locus-specific PCR and allele-specific extension were designed by MassARRAY Assay Design (version 3.1.2.5) (Sequenom). The DNA samples were amplified by primers flanking the targeted sequence, followed by dephosphorylation and allele-specific primer extension. The extension products were purified, spotted onto a 384-format SpectroCHIP and subjected to matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MassARRAY Compact Analyzer, Bruker, U.S.A.). Genotype clustering and calling were performed using the Sequenom MassARRAY TYPER (version 3.4). Quality control included repeat genotyping of 13 individual samples, revealing a 100 % agreement on genotyping calls across all the SNPs we assayed. SNPs that did not yield clear genotype clusters or with a call rate less than 99 % were excluded from statistical analysis. The primers and amplicon size of the successfully analyzed SNPs are presented in Table 1.

Table 1 Primer sequences and amplicon sizes of the 75 informative SNPs

Statistical analysis

The allele frequencies for each SNP were calculated by gene counting within each population group studied assuming each marker is a two-allele co-dominant genetic system. The genotypic data of the 13 cell lines was excluded from analysis. The polymorphisms were examined for consistence with Hardy-Weinberg ratios (in each population sample studied) by comparing the expected and observed number of individuals with each possible genotype in a simple chi-square test. The match probability that two unrelated individuals would have the same multi-locus genotype was calculated as described by Kidd et al. [12].

In order to evaluate the statistical independence of the SNPs, linkage disequilibrium values, r 2 [22], were calculated for unique pairings of the selected SNPs in the same chromosome using Haploview version 4.0 software [23].

Hardy-Weinberg ratios and the statistical independence of the loci were assumed. In order to evaluate the effectiveness of the SNPs, we ranked the loci according to their levels of observed heterozygosity. The ranks depended on observed heterozygosity for each population primarily. Based on the ranks of each SNP in each population, a new ranking was developed using the sum of the ranks of the observed heterozygosity of the particular SNP to present the effectiveness of each SNP in these populations. The SNPs with the same sum of the ranks were ordered so that the marker with the better (lower) Fst for these five population samples obtained a better (lower) rank.

Results

In all, 96 IISNPs were analyzed. Among our 63 newly selected autosomal IISNP, 46 yielded clear genotype clusters (clearly identified genotypes) (group A), and 29 of the 33 integrated Pakstis’ universal IISNPs (group B) yielded clear genotype clusters. As a result, 75 IISNPs with a call rate >99 % were chosen for further evaluation (Table 2). The chromosome nucleotide position shown in Table 2 for each SNP follows GRCh37.p13. Our additional 46 SNPs (group A) were distributed across 20 autosomes except chromosome 13 and 20. The heterozygosities of the 46 SNPs in group A is demonstrated in Supplementary Table S1. The positions, call rate, and Fst of these 75 informative SNPs based on all 502 samples are shown in Table 2. Table 1 presents the primer sequences and amplicon sizes of these 75 selected SNPs. The amplicon sizes ranged from 60 to 100 bp. All 75 SNPs met the criteria of Fst ≦ 0.05 among these five population groups (Table 2). The genotypes of these 75 IISNPs of the 13 cell-line DNA samples (GM3657, GM06994, GM07000, GM07022, GM07056, GM07345, GM07346, GM07347, GM07357, GM9948, GM12861, GM12960, 9947A) are presented in Supplementary Table S2.

Table 2 The position and average heterozygosity of the final list of 75 SNPs based on five population samples

All of the 502 DNA samples could be uniquely distinguished with these 75 SNPs. Supplementary Table S3 demonstrates the allele frequencies and forensic efficiency parameters of each SNPs in each population group. Thirty-two of the 46 newly selected SNPs of group A were polymorphic with observed heterozygosity ≥0.4 in each population, and 14 of the 29 SNPs of group B had observed heterozygosity ≥0.4 in each population (Supplementary Table S3).

Exact tests to determine Hardy-Weinberg equilibrium (HWE) were performed on all of the SNPs for each population. Most of the genotypes of the 75 SNPs were distributed as expected with Hardy-Weinberg equilibrium (p > 0.05) in these population groups. However, a total of 23 of the 375 tests (75 loci × 5 populations) for HWE indicate a possible deviation from the expected result (p < 0.05). Twenty-three SNPs showed deviation from HWE, including eight loci (rs4364205, rs6955448, rs10092491, rs3780962, rs10488710, rs1736442, rs445251, rs221956) in group B based on the previous SNP panel of Pakstis et al. (Supplementary Table S3) [14]. There were 23 instances where p values of HWE exact test were below 0.05 suggesting possible deviation from HWE among the five populations and the 75 loci (Supplementary Table S3). However, if the Bonferroni correction is applied, then only p values below 0.00067 (0.05/75) would be considered significant [24, 25]. All loci tested in this study would be deemed adequate for this examination, and no meaningful deviations from Hardy-Weinberg ratios occur for any of the 75 SNPs in these five population samples.

The 111 pairwise LD values (r 2) of two random SNPs in the same chromosomes of the 75 SNPs were calculated. The r 2 values ranged from 0 to 0.015 with the majority below 0.01. The results confirmed the variation of each SNP, and these 75 SNPs were without significant pairwise linkage disequilibrium.

The new ranking of each SNP (based on the sum of the ranks of the observed heterozygosity of particular SNP of each population) and the cumulative random match probabilities based on the observed heterozygosity of each locus in each population groups are presented in Supplementary Table S4. The combined random match probability of the best 40 SNPs was 3.16 × 10−17, 6.31 × 10−17, 7.75 × 10−17, 3.26 × 10−17, and 3.98 × 10−17 in TWH, FIL, THA, IND, and ENS, respectively, and that of the best 45 SNPs was 2.33 × 10−19, 4.87 × 10−19, 7.00 × 10−19, 2.73 × 10−19, and 2.71 × 10−19 in TWH, FIL, THA, IND, and ENS, respectively. Our panel of the best 40 or 45 SNPs resulted in unique genotypes for every one of the individuals with complete typing.

Nine sets of paternity testing cases of father-child-mother trio samples were investigated using the best 45 SNP loci, with a combined paternity index ranging from 27,614 to 4,388,060. All the conclusions were consistent with the previous results obtained with STRs using the AmpFℓ STR® Identifiler kit (Applied Biosystems, Foster city, USA). There were 15 degraded DNA samples with incomplete information (with 1 to 11 detectable STRs, alleles with amplitude >50 RFU) of the 15 STRs using AmpFℓ STR® Identifiler (Applied Biosystems) analysis (Supplementary Table S5). For these 15 fragmented DNA samples, genotypes of 32 to 75 SNPs could be successfully identified.

Discussion

Several researchers have identified SNPs with high heterozygosity in diverse sets of population [11, 13, 26]. Many research groups have found valuable universal IISNPs with good allele frequency distribution for many populations around the world. However, some of these recommended IISNPs have a low or relatively low heterozygosity in some major populations. For example, rs560681 and rs1335873 have a heterozygosity of 0.45 among Europeans (http://www.ncbi.nlm.nih.gov/snp, accessed on Feb. 9, 2015), and a low heterozygosity of 0.106 among sub-Saharan Africans, (http://www.ncbi.nlm.nih.gov/snp, accessed on Feb. 9, 2015). Furthermore, rs1736442, rs9905977, rs1498553, rs6955448, and rs6444724 have a relative low heterozygosity of 0.326,0.356, 0.372, 0.372, and 0.395 among Chinese Han, respectively (http://www.ncbi.nlm.nih.gov/snp, accessed on Jul. 24, 2015) [11, 13, 27]. The reported panels of identity SNPs may not be best suited for all major population groups. Therefore, about 95 SNPs may be necessary for present global panel for individual identification [21]. For forensic application, more candidate IISNPs need to be identified and to be evaluated on more populations [13]. By selecting other SNP markers with high heterozygosity and low Fst, fewer SNPs may be needed to be placed in a global panel. It can reduce the complexity of assay design as well as cost. In spite of that, the existing panels should be tested first for global consistency. We tested both well-known IISNPs and newly selected IISNPs in this study.

In this study, we used a custom-designed MassARRAY system to evaluate the performance of 96 IISNPs in order to find additional candidate SNPs for individual identification in Asian populations. Because we intended to identify additional appropriate SNP markers, we used the >100 kb physical distance criteria between SNPs at screening. We deleted the less heterozygous SNPs of any two SNPs of genetic distance within 1 Mb based on criteria of Kidd et al. and other previous reports [11, 28, 29]. Among the 96 IISNPs, 75 markers showed good performance, including 29 universal IISNPs reported by Pakstis et al. [14]. Among the 75 SNPs we tested, 71, 63, 69, 64, and 66 SNPs were found to be informative with high heterozygosity (≥0.4) in TWH, FIL, THA, IND, and ENS population groups, respectively (Supplementary Table S3). All of these 75 SNPs revealed low Fst (<0.05), and most of them revealed uniformly high heterozygosity. The effectiveness of these 75 SNPs for individual identification was diverse among each population. Our SNP panel can be considered in conjunction with SNPs in other panels to develop a universal IISNP panel with more markers.

The ranking of these 75 SNPs based on observed heterozygosity varies in different population groups (Supplementary Table S3). Some of the IISNPs recommended by Pakstis et al. were among the IISNPs with least observed heterozygosity in Taiwanese Han. Based on the sum of the ranks of the observed heterozygosity of the 75 IISNPs in different populations, we chose the 45 best IISNPs among these populations, including 12 IISNPs reported by Pakstis (group B) and 33 additional IISNPs in group A. These 33 additional IISNPs showed good forensic genotypic performance. Compared to the 45 IISNP panel with match probabilities in the range of 10−15 to 10−18 for individual identification in 44 population samples recommended by Pakstis et al., our panel presented a match probability of less than 10−18 when the 45 best IISNPs were analyzed [14, 16]. Our findings revealed the combined random match probability (cRMP) values of the best 45 SNPs were 2.33 × 10−19 in Taiwanese and 2.71 × 10−19 in ENS population group, respectively. It is compatible to that of a 15-STR set in Taiwanese (cRMP = 9.7 × 10−18) and in Caucasians (5.01 × 10−18) [30, 31]. The cRMP values of the best 45 SNPs were 4.87 × 10−19, 7.00 × 10−19, and 2.73 × 10−19 in Filipinos, Thais, and Indonesians, respectively. Thus, this set of 45 unlinked SNPs can serve as a panel for individual identification with low match probabilities in all five populations. The ranking result changed if expected heterozygosity was used instead, and the rank order of these markers may change as more population data is accumulated. Besides, the genotypic data of small, isolated populations should be analyzed to test the usefulness of these IISNPs in those populations.

Based on previous reports by Pakstis et al. and Kidd et al., Lou et al. developed a 44-SNP SNaPshot assay and Wei et al. developed a 47-SNP multiplex assay for individual identification genotyping [14, 16]. In the study by Lou et al., the 44 amplicons ranged from 69 to 125 bp, with 36 % less than 100 bp [8]. In the report by Wei et al., the amplicon lengths ranged from 90 to 158 bp [18]. In our study, using this custom-designed MassARRAY system, the amplicon lengths of these 75 IISNPs ranged from 60 to 100 bp, with 58.7 % (44/75) ≦90 bp. Thus, our panel may be more efficient for genotyping degraded DNA in forensic casework. All 15 fragmented forensic DNA samples analyzed using the 75 IISNPs in this study provided informative genotypes. This panel can offer supplementary data when STR analyses result in incomplete profiling because of DNA degradation, such as in decomposed corpse and ancient people remains.

Conclusion

In this study, we demonstrated the population genetic characteristics and forensic performance of 75 IISNPs in five populations. The SNPs presented here will be a useful resource for the forensic genetic community. We developed an IISNP panel using array system with high heterozygosity SNPs and a low combined random match probability. This panel can provide informative human SNP genotypes in these populations and can be helpful for forensic casework.