Introduction

Pakistan located in South Asia with two hundred million estimated human population. Pakistani population is comprised of diverse ethnicities. This region is characterized by the co-occurrence of several language families including, Indo-European, Sino-Tibetan as well as linguistic outliers (Qamar et al. 2002). The genetic diversity of these ethnolinguistic groups from Pakistan are represented in the worldwide DNA variation project of the Human Genome Diversity Cell Line Panel (Cann et al. 2002). The endogamy is widely practicing in this region which generated a high degree of genetic differentiation and structuring among different ethnic groups (Singh et al. 2016). A previous report about the mitochondrial DNA (mtDNA) variation in this region sketched sharp boundaries and revealed that population located at the west of the Indus basin of Pakistan mainly consist of western Eurasian mtDNA lineages with a limited South Asian contribution (Bhatti et al. 2017).

The Kho people are living in north-western Chitral district of Pakistan and speak a distinct Dardic sub-branch, Indo-Aryan language known as Khowar. This ethnic group is not presented in previous population genetic studies. The district Chitral of Pakistan is an area of 14,850 km2 populated by approximately 0.45 million Kho population inhabitants. The region is surrounded by Kunnar District of Afghanistan in the west, while in north and north-western sides the district Chitral meets Upper Dir, Swat, and Gilgit Baltistan regions of Pakistan (Fig. 1) (Nusser and Dickore 2002). The area of Chitral have come under the influences of several invaders in past include the Alexander of Macedonia, Tamerlane, Tartars Genghis Khan, and Arabs. In addition to strong endogamy been practiced since long, an extreme isolate location of Chitral may have generated a varying genetic composition of Kho group in the context of surrounding populations.

Fig. 1
figure 1

A Map of Pakistan presenting Chitral district with red color. B Detail map of study area, i.e. district Chitral, where Kho people are residing

The mtDNA has been found suitable to articulate the population evolutionary genetics and molecular ecology (Grasgruber et al. 2017). Particularly, the mtDNA control sequences also known as D-loop region has been found most suitable to profile history and geographic information of a population (Quintana-Murci et al. 2004). The D-loop region is the large noncoding segment in mitochondrial genome involves in initiation and regulation of mtDNA transcription and replication. The polymorphic sites are present in D-loop and therefore this mtDNA region is an object of many studies to find the origin, history, and evolution of population (Beha et al. 2008). The haplogroup typing based on this region is remained as a hallmark to study human evolutionary history and ancestry determination (Hagelberg et al. 1999). In the present study, we sequenced the mtDNA D-loop region of Kho population samples from Chitral to elucidate its maternal lineage distribution. Besides, we performed comparative population genetic analyses of Kho samples with other north-western Pakistani groups to understand the Kho genetic structure in the context of surrounding populations.

Materials and methods

Ethical approval and samples collection

The blood samples were collected from 16 unrelated Kho male and female volunteers following the institutional ethical guidelines. Informed consent was obtained from all the study participants. The sample donors were living in three different regions of Chitral KP i.e. Mastuj (MT), Torkhow (TK) and Mulkhow (MK). The participants were identified by maternal language, ethnicity and parent’s birthplace (Table S1; Supplementary Material). The blood samples were preserved in EDTA tubes.

DNA extraction

The DNA was extracted from the blood samples via the TIANamp Blood DNA Kit (Cat.no.DP318) according to the manufacturer’s instruction.

PCR amplification and mtDNA sequencing

Amplification of the target mtDNA region i.e., 1122 bp was performed by PCR primers set reported by Hayat et al. (2015). The 1–2 ng of genomic DNA was used during PCR amplification. The 0.05 µM primers concentration was kept during each PCR reaction. The PCR reaction was performed in 30 µL total volume using Ampi Taq Gold Master Mix. The PCR amplification condition was set according to the previously published report (Hayat et al. 2015). The Sanger sequencing of the PCR amplified products was carried using ABI 3130 genetic analyzer. The primers set used during PCR amplification were reused in DNA sequencing chain termination cycle sequencing reaction. All the samples were sequenced in triplicates (3x) to confirm the sequence variants.

Data analyses

DNA sequence reads quality check

The DNA sequence reads trace quality check was performed using the Staden Package (Staden et al. 2000). The high-quality sequence reads assembling was performed using Lasergene v. 7. 1 package (DNASTAR Inc. USA).

Nucleotide sequence polymorphism and comparative population genetic analysis

The good quality filtered sequence data was aligned and compared with the same region of the Revised Cambridge Reference Sequence rCRS. A software package DnaSP 5.10 was used to estimate parsimony informative sites and mutations (Rozas et al. 2017). The mtDNA sequences of the other north-western Pakistani ethnic groups (i.e. Pathan, Kashmiri, and Hazara) were obtained from the Genbank, National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov) (Table S2; Supplementary Material). The sequences acquired from Genbank was trimmed according to the sequence data of Kho samples and downstream comparative population genetic analyses were performed.

Haplogroup and haplotype analysis

A bioinformatics tool MITOMASTER was used to allocate haplogroups and haplotypes to the population’s mtDNA sequences data. The MITOMASTER resource makes efficient and accurate analysis of human mtDNA sequences to determine the haplogroups and haplotypes (Brandon et al. 2009).

Phylogenetic analysis

The neighbor-joining (NJ) method based phylogenetic tree analysis of Kho mtDNA sequences in comparison with surrounding populations (i.e. Pathan, Kashmiri, and Hazara) was performed via MEGA7.0.9 resource with 1000 bootstrap support replicates (Kumar et al. 2016).

Haplotypes network analysis

The haplotype network paradigm for Kho mtDNA sequences in comparison with surrounding population groups was generated via NETWORK 5.0 (Gehring et al. 2016). The median-joining method implemented in NETWORK 5.0 was adopted which primarily processed by star contraction (SC) and post-processed by maximum parsimony (MP) calculations. The star radius in SC was kept to be 5. The data generated from the MP calculations was then subjected to networking. The network was then refined and each population was specified by a separate color.

Samples sequences submission to GenBank

The high-quality mtDNA draft sequences of Kho samples were submitted to GenBank, NCBI under the accession IDs; MK124589-MK124604.

Results and discussion

The nucleotide sequence date of 706 bp was generated for 16 Kho individuals. The sequence data spanning the nucleotide position 16110-247 of rCRS mtDNA D- loop region. The polymorphic sites observed during the analysis of these mtDNA sequences is represented as haplotypes and haplogroups (Table 1). Total 49 haplotypes and 13 haplogroups of mtDNA were observed for Kho samples sequences. The 21 of these haplotypes were found only once, while 28 haplotypes were found shared by more than one Kho individuals. The haplotypes C16221d (30.61%), T16519C (26%), A73G (22.44%), and T16311C (10.20%) were found frequently. The remaining all haplotypes were present with a frequency of less than 2%. The observed mutations in these sequences compared to rCRS were; transition (69.4%), transversion (5.34%), deletion (24.4%), and Insertion (0.7%) (Table S3; Supplementary Material).

Table 1 The predicted haplotypes and haplogroups of Kho population from district Chitral, North-West, Pakistan

The western Eurasian haplogroups were observed predominantly and mostly shared in Kho samples with overall frequency of 50%. These include HV8, H19, H57, H24, C and, C4a haplogroups. The South Asian haplogroups and its relevant subgroups including U4, U4c, U6, U5a, and W were also found in Kho samples with overall 37.5% frequency (Table 2). The haplogroups and haplotypes specify the origin and linkages of an individual and population. The haplogroup analysis eventually demonstrates the western Eurasian ancestral origin of Kho samples. The previous study also inferred western ancestral origin for other Pakistani populations include Pathan and Hazara residing in the west of Indus basin of Pakistan (Ilyas et al. 2015). The haplogroup W found with 12.5% frequency in Kho samples has been reported in northern India with a frequency of 6%. Its coalescence age is younger, i.e. about 17,000 years ago. This haplogroup is though not predominant but still observed at low frequency in European populations (Derenko et al. 2007). In addition, another South Asian haplogroup, i.e. M30 was also identified for Kho samples with frequency of 6.2%. This indicates the genetic affiliation of Kho with the South Asian populations.

Table 2 The proportion of mtDNA haplogroups detected for Kho population samples

We performed additional analysis to compare the mtDNA partial control region sequences of Kho samples in the context of nearby north-western Pakistani groups (i.e. Pathan, Kashmiri, and Hazara) mtDNA sequences. Among surrounding populations, the Kho samples were found to share few haplotypes and haplogroups with Kashmiri population and do not share significant number of haplogroups and haplotypes with Pathan and Hazara populations. This somehow assumed that Kho population have preserved distinct western antecedent ancestry (Table 3). The comparative analysis revealed the Kho as a least diverse group with less genetic diversity compare to rest of north-western Pakistani populations (Table 4). This represents a high rate of endogamy in Kho tribe.

Table 3 The haplogroups and haplotypes composition of Kho and other north-western populations of Pakistan
Table 4 Population genetics comparison of Kho ethnic group with north-western Pakistani groups which mtDNA data is available in Genbank, NCBI

The Neighbor-joining (NJ) method based phylogenetic tree analysis was carried out for Kho mtDNA sequences along with the surrounding population groups to know their phylogenetic relationship. The geographically distinct population groups generated separate tree clades. The Kho samples were found to develop three different clusters in the phylogenetic tree. In addition, the Kho samples shared the phylogenetic tree clades with Kashmiri mtDNA sequences with significant bootstrap values, i.e. 93%, 80% and 46% (Fig. 2). The analysis of the present study demonstrates that three individuals of Kho population showed common maternal lineages with Kashmiri population, though significant regional differences are present between these two populations in Pakistan. The presence of shared mtDNA haplogroups (i.e., C4a and W) between Kho and Kashmiri reinforcing their close phylogenetic relationship and eventually speculate that Kho holds close maternal lineages relationship with Kashmiri population (Table 3; Fig. 2).

Fig. 2
figure 2

The neighbor-joining circular phylogenetic tree constructed on bases of mtDNA D-loop sequences from Kho and surrounding populations i.e., Pathan, Hazara, and Kashmiri. The Kho samples are highlighted in red color, Black for Pathan, Green for Kashmiri and Blue for Hazara population

We performed haplotype networking analysis to further elucidate the genetic relationship of Kho with surrounding north-western Pakistani populations. The Kho haplotypes were found in three well supported, share nodes with Kashmiri mtDNA sequences haplotypes. A Kashmiri sample based haplotype was found in mid as an ancestral haplotype for few Kho nodes. Hazara population`s haplotypes were found in a small distance of relatedness to Kho, while the Pathan population haplotypes were found scattered in the periphery of network plot and showed no close relationship with Kho samples (Fig. 3).

Fig. 3
figure 3

The phylogenetic network paradigm of Kho and nearby population groups. The haplotypes of each population in the plot are presented with different colors; Red: Kho Green: Kashmiri; Black: Pathan and Blue: Hazara. Each node represents one mtDNA sequence, the size of nodes depended on haplotypes frequency. The length of the lines connecting the haplotypes refer to the distance of relatedness

The predominant prevalence of western mtDNA haplogroups in Kho samples indicate the western Eurasian origin of Kho group. However, the presence of few South Asian haplogroups with a minor proportion revealed that Kho might be an admixed population of south and western Asian genetic components. This is congruent to the geographical location of Kho people currently residing at the crossroad of the south and central Asian corridors. The shared haplogroups, haplotype network nodes and close phylogenetic tree branches of Kho mtDNA with Kashmiri population represent a common maternal lineage between Kho and Kashmiri.

The analyses conducted led to useful information about the genetic structure of Kho ethnic group. The data generated and analyses performed will provide a framework for future forensics investigation. The genome-wide population genetic analyses of Kho individuals in the context of worldwide major ethnic groups will infer useful information about demography, origin, and natural selection events of this population tribe.