Introduction

The entire human mtDNA genome was first sequenced in 1981 [1] and subsequently revised with a few changes in base composition later known as the revised Cambridge Reference Sequence (rCRS) [1, 2]. The maternal inheritance, high copy number per cell, high rate of mutation, and lack of recombination have made mtDNA a valuable tool in forensic identification [3]. The mtDNA control region of approximately 1.1 kb consists of three hypervariable regions (HV1, HV2, and HV3) and was found to be highly polymorphic in humans providing a high degree of discrimination between unrelated individuals.

Peninsular Malaysia or West Malaysia is situated in the southeastern tip of the Asian mainland, bordering Thailand at the north and separated from East Malaysia (Sabah and Sarawak) by the South China Sea [4]. The Malays represent about 50.4% of the total Malaysian population and regarded as one homogenous group [5], although we have shown that they can be quite different in their genetic makeup [6]. Mitochondrial DNA data on Malay population is so far very scanty. Zafarina and Goodwin [5] have reported only the HV1 data for the Modern Malays in Peninsular Malaysia, while Wong et al. [7] have published the HV1 and HV2 data for the Malays living in Singapore. Recently, Maruyama et al. [8] have published the HV1, HV2, and HV3 mtDNA database for Malay individuals living in Kuala Lumpur. In this study, we report a comprehensive mtDNA data with higher number of samples than the previously reported data as well as taking great care in ascertaining their background for at least three generations without mix-marriage.

Methods and materials

Samples

A total of 248 Malay individuals were sampled from several locations in Peninsular Malaysia: 60 from Kelantan, 34 from Negeri Sembilan, 61 from Johor, 56 from Perak, and 37 from Kedah (Fig. 1). To ensure that their parents were of Malay origin, their family history was taken prior to blood collection. Aboriginal populations were not included in this study.

Fig. 1
figure 1

Map showing specific locations of the Malay samples collected in Peninsular Malaysia. Malaysia (modified from: commons.wikimedia.org/wiki/File:Malaysia_stat)

Genomic DNA extraction

DNA extraction was performed using QIAquick Blood kit (Qiagen Inc) according to the manufacturer's protocol.

PCR amplification of hypervariable region

Three sets of oligonucleotide primer—L15997 (5′-CAC CAT TAG CAC CCA AAG CT-3′) and H16410 (5′-GAG GAT GGT GGT CAA GGG AC-3′) for HV1, L048 (5′-CTC ACG GGA GCT CTC CAT GC-3′) and H408 (5′-CTG TTA AAA GTG CAT ACC GCC A-3′) for HV2, and L316 (5′-GCT TCT GGC CAC AGC ACT TA-3′) and H619 (5′-GGT GAT GTG AGC CCG TCT AA-3′) for HV3 were used to amplify the corresponding regions. Polymerase chain reaction (PCR) amplification was performed in 20 µl reaction mixture consisting 1× PCR buffer (NH4 (SO4)2 and 2.5 mM MgCl2), 200 mM of each dNTPs, 10 pmol of each primer, and 5 U of Taq polymerase (Invitrogen). The following PCR thermal cycle conditions were performed at 95°C for 3 min, followed by 30 cycles of 95°C for 30 s, 60°C for 30 s, 72°C for 45 s, 72°C for 5 min, and final hold at 4°C.

PCR amplification was conducted on the GeneAmp PCR System 9700 (Applied Biosystem). Amplicon were purified using QIAquick PCR purification kit (Qiagen Inc). PCR products were quantified, and 20 ng was used in each sequencing reaction.

Direct sequencing

A total of 10 µl sequencing reaction was prepared consisting of 20 ng of purified PCR product, 3.3 pmol of primers, and 1:8 ABI BigDye® Terminator versions 3.1. Amplification was performed on GeneAmp PCR System 9700 (Applied Biosystem). The following thermal cycle condition was used at 96°C for 1 min, followed by 25 cycles of 96°C for 10 s, 50°C for 5 s, 60°C for 4 min, and final hold at 4°C. The sequencing reaction was purified by ethanol precipitation prior to sequencing on ABI the 3130×l Genetic Analyzer.

Analysis of data

Each sample was sequenced in both directions 5’ and 3’ to avoid ambiguities in sequence determination. Data were analyzed using the ABI Prism Sequencing Analysis software version 5.3.1. The sequence samples were aligned and compared with rCRS [2] using a Vector NTI Advance TM 9.0 (InformaxTM, MD, USA) to determine the polymorphisms. The C-stretches region in HV1 and HV2 were verified by performing additional sequencing.

The genetic diversity and probability of random match were calculated using the following formulae \( h{\left( {1 - {{\sum X }^2}} \right)^n}/n - 1 \) [9] and \( {\sum X^2} \) [10] respectively, where \( {\sum X^2} \) is the sum of the square of the haplotype frequencies, and n is the sample size. Other statistical parameters such as nucleotide diversity and mean number of pair-wise differences were generated by Arlequin version 3.1 [11].

Results and discussions

The “Malaysian Constitution” refers to a Malay as an individual who speaks the Malay language, professes Islam, and follows the Malay customs [12]. In Malaysia, the majority of Malays reside in Peninsular Malaysia. Previous studies have shown that the “Constitutional” Malay race comprises a population admixture of different populations namely Arabs, Chinese, Indians, Siamese, and Proto-Malays. Available mtDNA data on the Malay population of Peninsular Malaysia is so far limited and was only represented by Malay individuals from a particular area (for example, Maruyama et al. [8]: studied Malays from Kuala Lumpur only). In this study, we reported on a more diverse population of the Malays, representing Malays from the East (Kelantan), the North (Kedah and Perak), and the South (Negeri Sembilan and Johor) of Peninsular Malaysia. The Malay individuals involved in this study were also carefully selected by establishing their ancestry for at least three generations of Malay without any mix-marriage. None of the previously published data has this inclusion criterion in selecting their samples [7, 8].

Haplotype determination was done by excluding the C tracts polymorphisms in HV2 region following the procedure from previous studies [13, 14]. The nucleotide polymorphisms in 248 Malay individuals were displayed in Table 1. A total of 157 HV1 haplotypes, 73 HV2 haplotypes, and 28 HV3 haplotypes were obtained. The genetic diversity calculated was 99.24% for HV1, 95.14% for HV2, and 79.93% for HV3. The random match probability was calculated to be 1.16% for HV1.

Table 1 Mitochondrial DNA polymorphisms for HV1, HV2, and HV3 and haplogroups in 248 unrelated Malay individuals living in Peninsular Malaysia

The most frequent HV1 haplotype (16140, 16182C, 16183C, 16189, 16217, 16274, and 16335) was shared by 11 individuals (4.44%), HV2 common haplotype (73, 249d, 263, 315.1C) was shared by 27 individuals (10.89%), and transition at position 489 was the most frequently observed polymorphism in HV3 (27.02%). On combination of HV1, HV2, and HV3 sequences, a total of 180 haplotypes were identified, 149 of which were unique, and 31 were present in more than one individual.

The genetic diversity calculated was 99.47%, and the probability of random match of two individuals sharing the same mtDNA haplotype was 0.93%. Statistical parameters such as nucleotide diversity and the mean of pair-wise differences for combined mtDNA haplotypes were calculated as 0.036063 ± 0.020101 and 12.544022 ± 6.230486, respectively.

The most frequent mtDNA haplotype in combined analysis of the HV regions (73, 146, 150, 195, 263, 315.1C, 16140, 16182C, 16183C, 16183C, 16189, 16217, 16274, and 16335) was shared by 11 individuals (4.44%). This haplotype was found at a frequency of 2.42% by Maruyama et al. [8] but was not observed in Vietnamese [15], Korean [16], Hong Kong [17], and northeast Chinese Han [18] populations but was relatively frequent in Island Southeast Asian populations (2.8%) [19] and Aboriginal Taiwanese (5.5%) [20]. In addition, polymorphism at positions 73, 263, 315.1C, 489, and 16223 were frequently observed in most of the samples.

The second most frequent haplotype observed in this study (73, 146, 263, 315.1C, 523d, 524d, 16182C, 16183C, 16189, 16217, and 16261), which was shared by seven individuals (2.82%) was only found in one individual by Maruyama et al. [8]. This haplotype was also widely distributed in East and Southeast Asian populations [19, 21], but was not reported in Vietnamese [15], Hong Kong [17], and northeast Chinese Han [18] populations. Comparison of the most common haplotypes observed in our study with previously published Malay data is shown in Table 2. Most of the frequent haplotypes of the combined HV regions in this study was found at lower frequencies in the previously published Malay data [7, 8]. In fact, one of our most common haplotypes (73, 199, 260, 263, 315.1C, 332, 489, 16129, 16183C, 16189, 16192, 16223, and 16297) which was shared by six individuals (2.42%), was not observed in other studies.

Table 2 Comparison of the most frequent Malay mtDNA haplotypes observed in this study compared to previous data

A point mutation heteroplasmy at nucleotide position 16230 (Fig. 2) was observed in sample 174RW. This heteroplasmy was observed as a purine transition (A → G). Since the presence of homopolymeric tract of cytosine between nucleotide positions 16184–16193 had affected the downstream sequence quality, this point heteroplasmy was detected from the sequencing of the heavy strand and was confirmed by re-sequencing [22].

Fig. 2
figure 2

Electrochromatogram showing point heteroplasmy at nucleotide position 16230 for sample number 174RW. The point heteroplasmy was detected through the heavy strand sequencing due to the occurrence of poly-C stretch from nucleotide position 16182 to 16193 on the light strand

Conclusion

We have generated the mtDNA database for HV1, HV2, and HV3 in 248 unrelated Malay individuals of Peninsular Malaysia. The database can be used for reference in forensic identification purposes as well as in evolutionary studies.