Introduction

Y-specific short tandem repeat (Y-specific STR) markers are haploidly inherited in a paternal lineage, and these properties make Y-STRs a useful tool in deficiency paternity cases involving male offspring and in forensic genetic cases where the analysis of autosomal STRs failed to yield sufficient probative information [13]. However, because the discriminating power of Y-STR haplotype diversity is lower than that of similar autosomal STR panels, the collection and databasing of extremely many Y-chromosomal loci is essential for estimating their reliable frequency. The International Y-STR User Group has recommended a set of Y-STR which was defined as the minimal haplotype consisting of the loci DYS19, DYS385a, DYS385b, DYS389I, DYS389II, DYS390, DYS391, DYS392, and DYS393 for forensic application. Despite their usefulness [46], however, additional Y-STR loci are required to improve the discriminatory capacity of the minimal haplotype. Although more than 200 STR loci have been described in the literature, a much more limited number of additional Y-STR loci have been comprehensively evaluated in different ethnic groups and incorporated into multiplex systems to ascertain their suitability for potential forensic use.

In this study, 20 recently described Y-STR loci with compatible annealing temperatures and appropriate amplicon size were successfully incorporated into three novel multiplex systems. The loci include DYS434, Y-GATA-A10, Y-GATA-H4, DYS438, DYS439, DYS443, DYS444, DYS446, DYS447, DYS448, DYS456, DYS458, DYS460, DYS520, DYS531, DYS557, DYS622, DYS630, DYS635 (Y-GATA-C4), and DYS709 [616]. Seven loci (Y-GATA-H4, DYS438, DYS439, DYS448, DYS456, DYS458, and DYS635 (Y-GATA-C4)) are also covered in the AmpF®STRs Yfiler™ polymerase chain reaction (PCR) amplification kit (Applied Biosystems, Foster City, CA) [17]. The results obtained from this study are a first attempt to evaluate the discriminatory power and mutation rates of the 20 Y-STRs in the southeast China population and thus help establish which loci are most polymorphic and desirable to pursue in future population studies and forensic applications.

Materials and methods

Population sample

Chaoshan is a littoral area located at the southeast of mainland China with its south facing the South China Sea and its east bordering on Fujian Province, which stands opposite to Taiwan across the Taiwan Strait. The whole area covers the cities of Shantou, Chaozhou, and Jieyang. A total of 158 unrelated, aboriginal healthy men living in the Chaoshan area were studied. Eighty father–son transmissions were investigated for analyzing mutation rate at these loci. Paternity with a probability of the more than 99.9% was confirmed by autosomal STR analysis using AmpFLSTR Identifiler PCR amplification kit (Applied Biosystems, Foster City, CA).

Extraction

DNA was extracted using the Chelex 100 [18] and proteinase K protocol. The quantity of recovered DNA was determined spectrophotometrically.

Multiplex amplification

A total of three multiplex PCR reactions for typing 20 Y-STRs were established. Primer sequences and concentrations are displayed in Table S1. Primers of DYS448 were redesigned by using Primer3 software (http://www.genome.wi.mit.edu/cgi-bin/primer/), and primers of others loci were retrieved from the genome database (GDB, http://www.gdb.org). The Multiplex I includes nine STRs: DYS434, GATA-A10, DYS438, and DYS439, DYS531, DYS557, DYS448, DYS456, DYS444; Multiplex II includes six STRs: DYS458, DYS460, DYS443, DYS447, DYS446, and DYS709; and Multiplex III includes five STRs: DYS622, DYS635, Y-GATA-H4, DYS449, and DYS630. Fluorescent dyes 6-FAM, HEX, or TAMRA were ligated to the 5′-end of the forward primers to accordingly create blue-, green-, or yellow-labeled PCR products. The reverse primers included an extra six-base tail (shown in lower case letters) to promote nontemplate addition. Each PCR reaction contained 2–10 ng DNA, 1 × Taq buffer, 1.5 mM MgCl2, 200 μM each dNTP (Pharmacia Biotech), 1.5 U Taq polymerase (Promega), and 0.15 ∼ 0.50 μM primers in a reaction volume of 37.5 μl. Thermal cycling was conducted on a PTC-200 DNA engine (MJ Research, Waltham, MA) using the following conditions:

  1. Multiplex system I

    95°C for 10 min followed by ten cycles at 94°C for 30 s, 55°C for 1 min, 72°C for 40 s, then 26 cycles at 90°C for 30 s, 56°C for 40 s, 72°C for 40 s, and a final extension at 60°C for 35 min.

  2. Multiplex system II

    95°C for 10 min followed by 32 cycles at 94°C for 30 s, 56°C for 1 min, 72°C for 1 min, and a final extension at 60°C for 35 min.

  3. Multiplex system III

    95°C for 10 min followed by 32 cycles at 94°C for 30 s, 59°C for 1 min, 72°C for 1 min, and a final extension at 60°C for 35 min.

Genotyping and sequencing

The amplified products were mixed with GeneScan-500 (ROX) size standard (Applied Biosystems, Foster City, CA) and separated by capillary electrophoresis using an ABI PRISM 310 Genetic Analyzer (Applied Biosystems, Foster City, CA) and GeneScan software 3.1 (Applied Biosystems, Foster City, CA). An appropriate matrix in this study was established with matrix standards for the four dyes 6FAM, HEX, TAMRA, and ROX. Fragment sizes were determined automatically using the GeneScan Analysis Software v.3.7 (Applied Biosystems, Foster City, CA) and by comparison with allelic ladders. Allele designation was carried out using Genotyper 2.5 software (Applied Biosystems, Foster City, CA). Sequenced allelic ladders were constructed for 20 Y-STR loci by combining all observed alleles from each locus. Each allele was purified using a QIAquick PCR Purification Kit (Qiagen, Hilden, Germany) and sequenced on an ABI 310 automated sequencer using a BigDye Terminator Cycle Sequencing v2.0 Ready Reaction kit (Applied Biosystems, Foster City, CA). The results were analyzed using Sequencing Analysis Software Version 3.4 and Sequence Navigator 1.01 (Applied Biosystems, Foster City, CA). Allele nomenclature follows Gusmao et al. [8], and Y-GATA-H4 locus is named according to the International Society for Forensic Genetics (ISFG) recommendations [8] and Mulero et al. [19].

Statistical calculations

Allele and haplotype frequencies were calculated by the gene counting method. The gene diversity and the haplotype diversity for these 20 loci were calculated in accordance with the formula [20]: \( h = n{{\left( {1 - {\sum {x^{2} } }} \right)}} \mathord{\left/ {\vphantom {{{\left( {1 - {\sum {x^{2} } }} \right)}} {{\left( {n - 1} \right)}}}} \right. \kern-\nulldelimiterspace} {{\left( {n - 1} \right)}} \), where n represents the number of individuals, and x is the allele or the haplotype frequencies in a given population sample. Standard error (S.E.) was calculated as: \( {\text{S}}{\text{.E}}{\text{.}} = {\left\{ {{\left( {2 \mathord{\left/ {\vphantom {2 n}} \right. \kern-\nulldelimiterspace} n} \right)}{\left[ {{\sum {x^{3} } } - {\left( {{\sum {x^{2} } }} \right)}^{2} } \right]}} \right\}}^{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}} \) [20]. Mutation rate was estimated as the number of mutations, divided by the number of meiosis analyzed, and statistical calculations were carried out using StatCalc 1.1 (http://www.ucs.louisiana.edu/kxk4695).

Results and discussion

Description of Y-STR multiplex systems

A total of 20 loci with compatible annealing temperatures and appropriate amplicon size were identified and successfully incorporated into three multiplex systems, which are referred to as Multiplexes I, II, and III. The PCR reaction conditions were optimized by altering the concentration of primer pairs, critical reagents, and the thermocycling conditions. These systems proved to be to be useful methods for rapid, precise, and reproducible identification of each allele at the Y-STR loci over a wide range of conditions. The allele sizes of these 20 Y-STRs loci that were found in our population range from 99 to 331 bp (see Fig. 1), allowing the efficient forensic detection of degraded DNA samples. Moreover, these loci are male-specific single copy markers and thus exhibit only a single PCR product peak. These results indicate that the Y-STR multiplexes studied here will be of great benefit in analyzing mixed DNA samples, in investigating sexual assaults, as well as in paternity testing where the alleged father is not available but other patrilineal relatives are.

Fig. 1
figure 1

Electropherogram of Multiplex I, Multiplex II, and Multiplex III

However, two PCR products amplified with both monoplex and multiplex systems were observed in a single sample at DYS557. Both the sequence of two amplicons were consistent with those of others allele at DYS557 apart from the numbers of the repeat by sequencing, which indicated that these products were two duplicate alleles at DYS557 in this sample. This sort of duplication or triplication which were observed at other Y-STR loci was described in the Y-STR Haplotype Reference Database (YHRD, http://www.yhrd.org, formerly http://www.ystr.org). Allele duplications at DYS19, DYS390, DYS385, and DYS391 have also been reported previously [2124] and has never been any report at DYS557 locus in other populations and ethnic groups were observed until now. Although such additional allelic patterns occurred in the background of rare (or unique) haplotyes in this medium polymorphic locus, further studies should be taken into consideration in this sample because they are likely to be wrongly interpreted as a mixed DNA profile. DYS557 is mapped to 21.64 Mb on chromosome Y (see Fig. 2) and is located in Yq11.223 band and the AFZc region, there are rearrangement event in AFZ region. Sequence of chromosome Y revealed that the male-specific region of chromosome Y is a mosaic of discrete sequence classes and includes massive palindromes [25, 26]. We think that the duplication at DYS557 is an isolated incident, a duplication, or inverted duplication of a small chromosomal region containing the DYS557 locus occurred in this sample or the father of this sample, simultaneously, a multi-step mutation occurred in the duplicate sequence. These two duplicate alleles should be regarded as a new allele or a single haplotype at DYS557 (see Table 1; Table S3, H157).

Fig. 2
figure 2

The positions of 20 Y-STR loci on chromosome Y

Table 1 Allele frequencies and gene diversity value at 20 Y-STR loci in a southeast China population

Allelic structure

Analyzed allelic sequences indicated that DYS434, DYS438, DYS439, DYS443, DYS446, DYS456, and DYS460 loci have simple repeat structures and that DYS444, DYS447, DYS449, DYS458, DYS520, DYS635(Y-GATA-C4), and Y-GATA-H4 are complex repeat systems in our sample population. The sequence data of the 14 previously published in our study were identical to those contained in GenBank and the results by Hou et al. [15] and Tang et al. [13], except the numbers of the core repeat motif.

Allelic sequences of the DYS448, Y-GATA-A10, and novel Y-STR loci (DYS531, DYS557, DYS622, DYS630, and DYS709) are shown in Table S2. To increase the number of Y-STR loci that could be analyzed and keep the PCR product sizes reasonably short in each dye color, we adjusted the amplicon size of DYS448 by moving the primer positions in the flanking regions surrounding the repeat segment. The resulting DYS448 amplicon was 63 bp shorter than that obtained from using the GDB primer sequences, and a total of three interruptive sequences and four repeat blocks with an AGAGAT motif were observed in its consensus sequence. The first block and the fourth block were variable in our population sample, the same as found by Tang et al. [13]. Y-GATA-A10, DYS531, and DYS709 have a simple repeat motif, which represents as (GATA)n, (AAAT)n, and (CTTT)n in GenBank, respectively. The loci DYS557, DYS622, and DYS630 have a nonvariant repeat block directly adjacent to the variant repeat motif, and the sequenced alleles of these loci showed that the variations were always in the largest repeat block in our population samples. Our nomenclature is therefore based on all repeats including the adjacent nonvariant motifs in accordance with the recommendations of the ISFG [8].

Population data

The allele frequency data and the forensic parameters of the 20 Y-STR loci in our population sample are presented in Table 1. The commonest allele for each locus was DYS434 allele 10; DYS438, 10; DYS439, 11; DYS443, 13; DYS444, 15; DYS446, 13; DYS447, 24; DYS448, 24; DYS456, 15; DYS458, 18; DYS460, 10; DYS520, 22; DYS531, 11; DYS557, 16; DYS622, 16; DYS630, 24; DYS635, 20; DYS709, 15; Y-GATA-A10, 13; Y-GATA-H4, 12. Almost all Y-STRs in this study reached the gene diversity value of >0.5 except for DYS434 (0.2506) and DYS531 (0.4601). The highest diversity value was found at the locus DYS447 (0.8034), followed by the DYS622 (0.7903).

Most allele frequencies at the Y-STR loci studied here appeared to be different from those in Europeans and Africans [2729], and for some Y-STR loci, distinct regional differences in the allele frequencies were observed between Chinese and several ethnic groups in east Asia [11, 13, 15, 3032].

A total of four alleles were observed at DYS438 in our population, and allele 8 and 13 were not found although observed by Berger et al. [33]. For DYS439, more men share allele 11 in our population in spite that the frequency of allele 12 is higher in Austrian, Hu’s, Malaysian Malay, and Chinese ethnic population [2729]. In addition, there is the puny difference between our data and Hu’s data at DYS438 and DYS439, although samples were collected from the same region [34]. The same phenomenon was also observed by Tang et al. [13] and Hou et al. [15] at DYS437(DYS457), which might result from sampling and the number of samples.

Compared with the data by Berger et al. [33] and Chang et al. [35], seven alleles at Y-GATA-A10 loci was detected totally and allele 15 was found only in our population. Moreover, the distributions of allelic frequencies were also different. While seven alleles at DYS448 were found, which were more than five alleles observed by Berger because of allele 22 and 23; furthermore, alleles 16, 18.2, and 19.4 were only observed by Chang; most men share allele 21 and 20 in our population and Austrian population, respectively. Six alleles at DYS456 were found in our population and Malaysian Chinese ethnic population, and allele 15 is common in our population, which is similar to the result surveyed by Chang. The alleles 11, 12, and 19 at DYS456 detected by Tang et al. [13] were not detected in our samples. Eight alleles at DYS458 were found, and the frequency of allele 18 is higher in our population. However the allelic distributions are different from those investigated by Berger et al. [33] and Chang et al. [35].

For DYS531, most men share allele 11 (0.7089), BUT in another Chinese Han ethnic population the frequency of allele 10 is higher [10]. Allele 11, 12 at DYS557 found by Butler were not detected in our population. Compared with the population studied by Gao et al. [9], there is some similarity of genetic polymorphisms at DYS622, but allele 20 and 28 at DYS630 were observed only in our population. For DYS460, allele 13 was only found in our population and the frequency of allele 10 is higher which were different from that the frequency of allele 11 is higher in a southeast Iberian population [16]. Most men share allele 10 at DYS434 in our population, which is similar to the result investigated by Hou [15] about the allelic distributions, but most men share allele 11, and allele 9 was not found in a southeast Iberian population [16]. There are difference of the allelic distributions between our samples and samples of Redd et al. [12] at DYS447. Compared with samples of Tang et al. [13], the similarities of allelic distributions existed at DYS443, but difference was at DYS444. Six alleles at DYS709 were found in this population, the frequency of allele 15 is higher.

The combination of the allelic states of the 20 Y-STRs allowed us to construct highly informative haplotypes. A total of 157 haplotypes were observed in 158 individuals, of which 156 haplotypes were unique and only 1 haplotypes was observed twice (Table S3). The overall haplotype diversity of the 20 Y-STR loci was 0.9997, which is higher than the values reported for the Singaporeans [32], Chinese Han [36], and other East Asian population groups [31], although it should be noted that most of these studies used a smaller range of Y-markers. Many researches reveal that there are regional differences of the distributions of haplotypes and alleles, and these regional differences indicate that more subpopulations need to be typed to obtain reliable population data for Y-chromosome markers [3640]. Therefore, our Y-STR data in southeast China could be useful for the regional specific and prerequisite reference to the forensic genetics.

Mutation rates

Knowledge about mutation rates of Y-STR used in forensic paternity and forensic analysis is crucial for the correct interpretation of resulting genetic profiles. In the 80 father–son pairs analyzed, 4 mutation events were identified at the Y-GATA-H4 (allele 11 to allele 12), DYS439 (allele 11 to allele 10), DYS456 (allele 16 to allele 15), and DYS458 (allele 17 to allele 18) loci. Sequencing of the samples showing mutations confirmed that all length changes occurred in the variable repeat region of each locus and no major difference was observed between the tendency for repeat loss or gain. For the four fathers with a mutation, their age at the child’s birth was 25–41 years [average age 32 ± 4.10 (DS) years], while for the 76 fathers without mutations, their age at the child’s birth was 20–56 years [average age 34.30 ± 8.12 (DS) years]. A nonparametric test (Mann–Whitney U test) showed that the difference in age between the two groups is not significant (p = 0.214). The average mutation rate was estimated at 0.25% per locus per generation (4/1600), with a 95% confidence limit of 0.9 × 10−3 to 0.54 × 10−2. Although accumulation of data is still not enough, there were no statistically significant differences (p > 0.05) between our mutation rate and those of reported in other Y-STR studies [41, 42], and the characteristics of the mutation events observed in this study were similar.

In conclusion, this study has revealed that the 20 Y-STR loci in the three multiplex systems represent a very valuable contribution to the analysis of Y chromosome variation for paternity analysis and forensic applications in the southeast China population. Further study should establish the discrimination probability for different ethnic groups as well as construction of new large multiplexes to enable simultaneous amplification of the better performing loci.