Introduction

The history of Homo sapiens occupation of East Asia can be traced back to the late Paleolithic period, and anatomically modern humans (AMHs) permanently inhabited this region 50,000 years ago (Yang et al. 2017). Evidence from aboriginal Australian genomes suggested two waves of settlements in Asia, including an initial southern Out-of-Africa migration wave approximately 60,000 years ago and a later wave approximately 45,000 years ago (Rasmussen et al. 2011). However, this model of early, independent human dispersals into Asia was questioned due to its disregard of the archaic introgression of Denisovan genetic material into modern Asian Negrito, Australian, and New Guinean populations (Mallick et al. 2016). One well-fitted phylogenetic admixture graph with two archaic admixture events further suggested that New Guineans and Australians formed the eastern Eurasian clade together with mainland East Asians, represented by Austronesian Ami and Tai–Kadai Dai (Mallick et al. 2016). In East Asia, pioneering genomic work in the HUGO Pan-Asian project that genotyped 50,000 single-nucleotide polymorphisms (SNPs) in 1900 individuals from 73 populations was finished in 2009 and provided evidence that the main Southern Migration Route played an important role in the peopling of East Asia, with little genetic contribution from the Northern Migration Route (Consortium et al. 2009). However, other uniparental evidence (based on maternally inherited mitochondrial DNA and paternally inherited Y-chromosomal genomes) illuminated the possibility of genetic contributions from northern Eurasia into East Asia via the Northern Migration Route (Su et al. 1999; Wen et al. 2004).

The ethnolinguistic and cultural diversity in eastern Eurasia and its complex history of population admixtures have promoted the exploration of genomic information from people living on the East Asian subcontinent, and the results have been widely utilized in forensics, anthropology and medical genetics, especially for the exploration of the pathogenicity of genetic variants (medically relevant novel and rare loci) in the modern precision medicine era (Stoneking and Delfin 2010; Liu et al. 2018; Cao et al. 2020). However, early comprehensive population genetic studies have mostly focused on Central/South Asian genetic diversity (Damgaard et al. 2018; Narasimhan et al. 2019), and the pattern of genetic variations in East Asia has not been fully characterized, especially for China, which has multiple language- and agriculture-oriented centers. Chinese populations have been classically categorized according to the six main language families: Austronesian, Austroasiatic, Hmong–Mien, Tai–Kadai, Sino-Tibetan and Trans-Eurasian. Fortunately, recently obtained information from ancient genomes has reshaped our understanding of the peopling of East Asia (Raghavan et al. 2014; Jeong et al. 2016; Yang et al. 2017, 2020; Lipson et al. 2018; McColl et al. 2018; Bai et al. 2020; Gakuhari et al. 2020; Ning et al. 2020; Wang et al. 2020a, b; Zhang and Fu 2020). First, genetic variations in mitochondrial and autosomal DNA from paleolithic hunter-gatherers have demonstrated differentiated genetic relationships between Upper Pleistocene East Asians and modern people (Zhang and Fu 2020); 40,000-year-old Tianyuan people formed a deeply northern East Asian lineage but made limited genetic contributions to modern East Asians (Yang et al. 2017); and the Ancient Northern Eurasian clade (24,000-year-old Mal’ta) contributed one-third of its genome variations to indigenous Americans but also made only limited genetic contributions to modern East Asians (Raghavan et al. 2014). In contrast, the 11,000-year-old Longlin and Qingshuiyuan people exhibit maternal genetic continuity with modern southern East Asians (Bai et al. 2020).

Second, the analyses of Holocene East Asian or Southeast Asian ancient genomes have documented multiple waves of migration of Chinese farmers and subsequent admixture events with resident groups in Southeast Asia (Lipson et al. 2018; McColl et al. 2018), lowland East Asia (Ning et al. 2020; Yang et al. 2020), highland Qinghai–Tibet Plateau (Jeong et al. 2016), and the Japanese Archipelago (Gakuhari et al. 2020), among others (Wang et al. 2020a), which are also referred to as Holocene expansion events. McColl et al. explored the genetic prehistory based on the genome-wide polymorphisms of 25 late Neolithic-to-historic Southeast Asians and found that the first batch of southward migration of Yangtze rice farmers disseminated the proto-Austroasiatic language and mixed with the local Southeast Asian indigenous hunter-gatherers (represented by 7,950-year-old Hoabinhian genomes) to produce the modern Austroasiatic speakers distributed fragmentarily in Southeast Asia, such as the Mlabri and Htin (McColl et al. 2018). The genetic legacy of this first layer showed a strong genetic affinity with Andamanese and Japanese Jomon, which also possessed an ancient genetic connection between coastal Southeast Asia and the Japanese archipelago (Gakuhari et al. 2020). Later, multiple southward migrations from South China to the islands of Southeast Asia during the Bronze Age were associated with Austronesian expansion via the coastal expansion route and with mainland Southeast Asia associated with Tai–Kadai/Hmong–Mien speakers via the inland expansion route (McColl et al. 2018). Lipson and his colleagues at Harvard Medical School sequenced eighteen Neolithic-to-Iron Age Southeast Asians and reconstructed the phylogenetic lineage of these ancient populations (Lipson et al. 2018). They also found that indigenous Southeast Asian hunter-gatherers, as a highly diverged Eastern Eurasian lineage, mixed with early farmers from South China to form Austroasiatic-related ancestral populations (Man_Bac). In addition, this study also recorded the ancient DNA signature of southward migration of Tibeto-Burman ancestors in the late Neolithic/Bronze Age Oakaie population (Lipson et al. 2018). In China, 10 early Neolithic genomes from the Yellow River Basin and 16 Neolithic-to-historic genomes from Fujian and the surrounding region were used to reconstruct the landscape of North-to-South population stratification from the early Neolithic period. Stronger northern East Asian affinity in late Neolithic Fujian populations demonstrated that the main migration direction of East Asians in the Holocene period was from north to south (Yang et al. 2020). The closer genetic relationship between late Neolithic Tanshishan/Xitoucun populations, 3000-year-old Vanuatu samples and modern Austronesian Ami and Atayal suggested that the initial common origin of Austronesian speakers was sited in the coastal region of South China. Coastal population intercommunications between ancient populations from Vietnam, China, Japan and far eastern Russia showed that the coastal migration route was a convenient and important gateway for ancient population movement and mixture (McColl et al. 2018; Gakuhari et al. 2020; Wang et al. 2020a; Yang et al. 2020). In addition, 55 ancient genomes from the Yellow River, western Liao River and Amur River regions also showed archeologically supported changeable subsistence strategies that allowed northern millet farmers to adapt to climate change, which was achieved via population movements in the inland migration corridor between the three river regions (Ning et al. 2020). Comprehensive analysis of ancient and modern genome data from Tibetan individuals via Wang et al. illustrated that multiple Paleolithic and Neolithic migration events have participated in the peopling of the Tibetan Plateau and revealed the different population stratifications among culturally diverse Tibetans (Wang et al. 2020b): a major influx of 2700-year-old Chokhopani ancestry into Ü-Tsang Tibetans, an additional western Eurasian influx into Ando Tibetans, and a southern East Asian influx into Kham Tibetans (Wang et al. 2020b). Furthermore, three Holocene expansion events partially or completely associated with dissemination of the Trans-Eurasian, Sino-Tibetan, and southern language macrofamilies (Austronesian, Tai–Kadai, and Austroasiatic) have also been comprehensively documented via ancient population genomics (Wang et al. 2020a).

These intensified North-to-South population interactions in China have been well documented; however, east-to-west transcontinental communications need to be comprehensively characterized from a genetic perspective. Trans-Eurasian cultural exchange has recently been extensively documented via historic records and archeological findings (Dong et al. 2020). In addition, the corresponding transcontinental population movement during the Bronze Age to Iron Age has been confirmed in the core regions of Siberia (Damgaard et al. 2018). However, the mechanisms underlying the spread of the archeologically supported western Bronze Age package in China (i.e., assimilation of ideas or movement of people), as well as the association between the westward spread of millet agriculture and eastward spread of barley/wheat farming technology and human movement, need to be genetically explored, especially in the Hexi Corridor and the surrounding regions in northwestern China. Among the modern gene pool of northwestern East Asians, population genetic studies have focused on the topics of origin, migration, admixture, and substructure based on lower-density genetic markers or limited sample sizes (He et al. 2018b; Wang et al. 2018b; Li et al. 2020). However, a comprehensive survey of the genetic diversity and finer-scale structure of northwestern East Asians is in its infancy, and much work based on different genetic markers (single-nucleotide polymorphisms (SNPs), short tandem repeats (STRs) and so on) and denser anthropological sampling should be carried out. Here, we conducted a comprehensive population genetic survey based on the genetic variations of genome-wide STRs/SNPs in northwestern East Asia and investigated their forensic features, genetic diversity, population substructure, and phylogenetic relationships based on Paleolithic-to-modern East Asian genetic variations.

Materials and methods

Samples, DNA preparation and PCR amplification and profiling

We collected samples from 599 unrelated healthy individuals (152 males and 447 males) in Gansu Province in Northwest China (Figure S1) with written informed consent. Our study and corresponding protocols were reviewed and approved via the Medical Ethics Committee of Xiamen University (XDYX201909). In addition, we also followed the recommendations of the Declaration of Helsinki (Nicogossian et al. 2014) and the regulations of the Human Genetic Resources Administration of China (HGRAC). We extracted genomic DNA using the QIAamp DNA Mini Kit (Qiagen) and quantified the extracted DNA materials via a NanoDrop-2000c instrument (Thermo Fisher Scientific). All prepared DNA templates were preserved at − 20 °C until the next step of DNA amplification. We used the Huaxia Platinum PCR amplification kit and ProFlex 96-well PCR System (Thermo Fisher Scientific) following the kit recommendations to amplify the targeted 26 genetic loci in 549 individuals. We used the Applied Biosystems 3500XL Genetic Analyzer to electrophorese and separate the amplified DNA products and GeneMapper ID-X v.1.4 (Thermo Fisher Scientific) to call and check the obtained genotype data. We used the Infinium® Global Screening Array (GSA) to genotype approximately 700 K SNPs across the whole genome in 50 male individuals. We considered the site missing rates per person or per SNPs and Hardy–Weinberg disequilibrium in the quality control with the following parameter settings (mind: 0.01, geno: 0.01, --hwe 0.001 and --maf 0.01) using PLINK 1.9 (Chang et al. 2015).

Population database

To conduct a comprehensive population genetic survey, we employed five different reference databases to make population comparisons: two STR genotype-based datasets, one STR allele frequency-based dataset, and two high-density SNP-based datasets. The first comprised genotypes from 23 autosomal STRs in 12,960 individuals (549 genotypes first reported here and 12,411 genotypes collected from the public database) from 17 Eurasian populations and was referred to as the 23STR genotype dataset. This dataset comprised seven Sinitic populations (He et al. 2018a, 2018c; Wang et al. 2018a; Liu et al. 2019; Pengyu Chen et al. 2019d, a, b, c; Li et al. 2020) (Han populations from Gansu, Chengdu, Hainan, Shanxi, Shaanxi and Zhujiang and one Wuzhong Hui), four Turkic-speaking populations (Jin et al. 2017; Chen et al. 2019b, c; Liu et al. 2019) (one Kyrgyz from Akto, three Uyghur from Artux, Urumqi and across Xinjiang), five Tibeto-Burman-speaking populations (Wang et al. 2018a; Liu et al. 2019) [one Yi group from Liangshan, two Tibetans from Tibet and two Sichuan Tibetans (Liangshan and Chengdu)] and one Central Asian population from Quetta Hazara (Chen et al. 2019a). The second dataset included 14,365 individuals from the aforementioned 17 populations and 3 Western Eurasian populations (Estonian, Polish, and Saudi Arabian) based on the STRs identified in both of 2 different forensic PCR amplification systems (Sadam et al. 2015; Alsafiah et al. 2017; Ossowski et al. 2017), which was referred to as the 20STR genotype dataset. The third included 57 worldwide populations with allele frequency distributions of 20 STRs except for D6S1043, Penta D, and Penta E, which was referred to as a frequency-based dataset (Gaviria et al. 2013; Park et al. 2013, 2016; Fujii et al. 2014; Almeida et al. 2015; Hossain et al. 2016; Wang et al. 2016; Zhang et al. 2016a, b; Choi et al. 2017; Guerreiro et al. 2017; Moyses et al. 2017; Ossowski et al. 2017; Taylor et al. 2017; Wu et al. 2017; Yang et al. 2018; Chen et al. 2019a). The other two SNP-based datasets were formed by merging our newly genotyped data with the publicly available Human-Origin and 1240 K datasets and then used to conduct the population genomics analyses for East Asian populations (Patterson et al. 2012; Jeong et al. 2019; Wang et al. 2020a). The basic dataset of Eurasian modern and ancient reference populations were collected from Reich Lab (https://reich.hms.harvard.edu/downloadable-genotypes-present-day-and-ancient-dna-data-compiled-published-papers). Additional modern and ancient reference population data of East Asians from China, Japan, Mongolia and Nepal were collected from recent publications (Jeong et al. 2016; Ning et al. 2020; Wang et al. 2020a; Yang et al. 2020).

STR-based statistical analysis

We first estimated the statistical parameters related to forensic genetics (personal identification and parentage testing) in Gansu Han. We used the online tool STRAF (Gouy and Zieger 2017) to estimate the forensic parameters [matching probability (PM), discrimination power (PD), typical paternity index (TPI), power of paternity exclusion (PE), gene diversity (GD), polymorphism information content (PIC)] and pairwise Fst genetic distances (Weir and Cockerham 1984). We subsequently used Arlequin 3.5.2 to test the Hardy–Weinberg equilibrium (HWE) with 100,000 Markov chain steps and 100,000 dememorization steps via the type of locus by locus and linkage disequilibrium among all pairs of the included 23 STR loci with 10,000 permutations and 2 initial conditions of expectation maximization (Excoffier and Lischer 2010). Arlequin 3.5.2 was also used to calculate the observed heterozygosity (Ho) and expected heterozygosity (He). We used Phylip software (Cummings 2004) to calculate Cavalli-Sforza and Nei genetic distances based on the allele frequency distributions. Multivariate Statistical Package (MVSP) software 3.22 (Kovach 2007) was used to perform principal component analysis (PCA) among 57 worldwide or 28 Eurasian populations based on allele frequency distribution, and the cmdscale function in R was used to generate multidimensional scaling plots (MDS) of the worldwide or Eurasian populations based on different genetic distance matrixes. Phylogenetic frameworks were constructed via the neighbor-joining (NJ) algorithm in MEGA 7.0 (Kumar et al. 2016). The individual ancestry composition of the two genotype-based datasets was determined using STRUCTURE (Evanno et al. 2005).

Genomic-based statistical analysis

We used two higher-density datasets to reconstruct the deep population genetic history of northwestern Han. We used the Smartpca program built into EIGENSOFT v.6.1.4 to perform PCA (Patterson et al. 2006) and ADMIXTURE v.1.3.0 (Alexander et al. 2009) to carry out the model-based ancestral composition dissection based on the Human-Origin-merged dataset following our previous default settings (He et al. 2020a). Ancient individuals were projected onto the top two principal components (numoutlieriter: 0 and lsqproject: YES). We used PLINK v.1.9 (Chang et al. 2015) to prune SNP data with strong linkage disequilibrium with the following parameters (--indep-pairwise 200 25 0.4). We used the qp3Pop program of ADMIXTOOLS to conduct admixture-f3(Source1, Source2; Han_Lanzhou) to explore the admixture source proxies and perform outgroup-f3(Source, Han_Lanzhou; Mbuti) to explore the shared genetic drift. We further used the f4-statistics in the forms f4(Eurasian1, Eurasian2; Han_Lanzhou, Outgroup) and f4(Eurasian1, Han_Lanzhou; Eurasian2, Mbuti) to study genetic affinity, continuity and admixture. We also used qpWave/qpAdm and qpGraph (Haak et al. 2015) to estimate the admixture proportion and phylogenetic splits and admixture events based on the 1240 K-based merged dataset. Eight worldwide representative populations were used as outgroups, which included five modern populations (Mbuti, Papuan, Australian, Mixe and Onge) and three Eurasian ancient people (Ust_Ishim, Kostenki14 and MA1_HG). We finally used ALDER to estimate the time of North-to-South and West-to-East admixtures with 28 years as one generation length (Loh et al. 2013).

Results

Forensic features and genetic diversity of northwestern Han Chinese individuals

Gansu Province, with a population size of over 26 million, is located between the Qinghai–Tibet Plateau and Loess Plateau in Northwest China and serves as an important corridor for the prehistoric human occupation of the Qinghai–Tibet Plateau (especially for Yellow River millet farmers). Moreover, the Hexi Corridor passed through Gansu Province, suggesting that this region also played a key role in the Trans-Eurasian exchange of genetic materials, culture, crops, livestock and technology (Dong et al. 2020). We successfully obtained genotype data for 23 autosomal STRs in 549 Gansu Han individuals and merged them with publicly available reference data. All results from STR-based population genetic analyses showed that the northwestern Han Chinese population is homogenous and has high genetic diversity. All 23 STR loci were found to be in HWE and LE after conducting Bonferroni correction (Tables S1–2). We observed 277 alleles with allele frequency distributions ranging from 0.0009 to 0.5237 (Table S3 and Figure S2). As shown in Table S3, Penta E had the maximum number of alleles (23), followed by FGA (20), and TH01 had the fewest alleles (6), followed by TPOX. GD varied from 0.6180 (TPOX) to 0.9231 (Penta E), and Ho ranged from 0.5847 to 0.9362. PD varied from 0.7887 to 0.9868, PE varied from 0.2729 to 0.8699, and PM varied from 0.0132 to 0.2113. We also found that PIC ranged from 0.5545 to 0.9169 and that TPI ranged from 1.2039 to 7.8429. For the combined powers for forensic practice, we estimated the values of two combined forensic indexes: the combined power of discrimination (CPD) and the combined probability of exclusion (CPE). The CPD and CPE in Gansu Han were 7.827E-28 and 0.9999999998, respectively. The observed highly polymorphic and informative forensic statistical indexes showed that this 23-STR-based PCR amplification system was suitable for forensic identification of individuals and parentage testing. All included statistical parameters consistently demonstrated that northwestern Han individuals possessed high genetic diversity.

Population genetic analyses among Eurasian/worldwide populations via STR genetic markers

To investigate the genetic relationships between Gansu Han and other reference populations, we merged genotype data of the 23 autosomal STRs in the Han population with data from 16 other Central Asian or East Asian populations and calculated the pairwise Fst genetic distances. A total of 12,960 genotype data points were collected, and we found that Gansu Han exhibited the closest genetic relationship with Shanxi Han (0.0002), followed by Han Chinese groups from Zhujiang and Shaanxi and the Hui group from Wuzhong (Table S4). We next performed MDS among 17 Central or East Asian populations and found three genetic clusters. Turkic-speaking populations (Uyghur and Kyrgyz) were grouped with a Central Asian Hazara and localized in the right position in the MDS plot, forming one western Eurasian affinity cluster. Southern Han clustered with northern Han (including Gansu Han) and Hui in North China (Wuzhong) to form the second cluster (Sinitic-speaking cluster), localized in the top left position in the MDS. The remaining populations, localized in the bottom right position, clustered as the Tibeto-Burman-speaking cluster (Figure S3A). Patterns of genetic affinity were further confirmed by cluster analysis via heatmap (Figure S3B), NJ-based phylogenetic relationship reconstruction (Figure S3C) and STRUCTURE-based ancestral composition (Figure S3D). We further explored the genetic similarities and differences between Gansu Han and relatively closely related reference populations (Fig. 1) and merged genotype data from Central/East Asia and three populations from western Eurasia. Finally, we obtained a new dataset with genetic variation data for 20 autosomal STRs in 14,365 from 20 populations. As our outgroup, the three western Eurasian populations possessed the most distant genetic relationship with Gansu Han, which was inferred from the Fst genetic distance (Table S5). The three aforementioned clusters and one western Eurasian cluster were observed in the new MDS plot (Fig. 1a), and similar patterns of genetic relationships were also confirmed via the results from heatmap, phylogenetic relationship and STRUCTURE (Fig. 1b–d).

Fig. 1
figure 1

Genetic similarities and differences among 20 ethnically/geographically diverse Eurasian populations based on the genetic variations of 20 overlapping STRs. a Multidimensional scaling plots (MDS) showed genetic similarities between Lanzhou Han and other 19 Eurasian populations. b Genetic affinity among 20 populations inferred from the heatmap. c Phylogenetic relationship among the 20 populations reconstructed by the Fst genetic distance matrix. d Ancestry composition among 20 populations inferred from model-based STRUCTURE

A denser sampling of reference population-based allele frequency distribution could be collected from publicly available population genetic data reports. Thus, we further performed comprehensive population genetic analyses based on the newly merged dataset with greater global population representation. We merged allele frequency data for 20 autosomal STRs in the Gansu Han population with data from 56 other reference populations. PCA based on allele frequency distribution was performed, and we found that a total of 48.484% of the variance in this population represented variation among worldwide populations and 52.590% of the variance occurred within the East Asian population, showing a strong population genetic affinity within geographically close or ethnically close populations distributed in different continental regions (Fig. 2a). Here, we could also observe the genetic affinity within the linguistically close populations within East Asians, especially for Tibeto-Burman and Sinitic speakers, and Gansu Han clustered most closely with the geographically close Wuzhong Hui. Genetic similarities between Gansu Han were also observed in the estimated pairwise Cavalli-Sforza genetic distances (Table S6 and Figure S4) and its heatmap (Figure S5), which showed the most shared ancestry (the closest genetic distance) between Gansu Han and the geographically close Shaanxi Han. We subsequently carried out MDS among 57 populations based on pairwise Nei genetic distances and observed similar patterns of genetic relationships (Fig. 2b). The Nei-based phylogenetic relationships showed a clear association between genetic affinity and geographical adjacency or linguistic affinity (Fig. 2c).

Fig. 2
figure 2

Patterns of genetic relationships between Lanzhou Han in Gansu province, Northwest China. a Result of principal component analysis (PCA) based on the allele frequency distribution. b Plots of multidimensional scaling analysis showed the genetic relationship among newly studied and the included reference populations based on the pairwise Cavalli-Sforza genetic distances. c The phylogenetic relationship between Gansu Han and worldwide reference populations was reconstructed based on the pairwise Nei’s genetic distances

Population genomic analyses revealed the genetic affinity and ancestral makeup of northwestern Han Chinese

We additionally generated genome-wide data from 50 modern Han Chinese individuals from Lanzhou in Gansu Province, northwestern China. We first conducted genome-wide data-based PCA to explore the genetic structure of Lanzhou Han under the genetic background of modern and ancient East Asians. We projected publicly available data from ancient individuals from Nepal, China, Japan, Mongolia and southern Siberia into modern PC plots. Lanzhou Han occupied the intersection of the Sino-Tibetan cline and northern Mongolic/Tungusic cline and deviated slightly toward southern Siberian populations (Fig. 3a). The previously reported middle/lower Yellow River Basin farmers during the early Neolithic-to-Iron Age were projected to be close to Lanzhou Han; however, geographically nearby upper Yellow River Basin farmers from the Ganqing region (late Neolithic Lajia and Jinchankou, and Iron Age Dacaozi) were slightly shifted to ancient Tibetan from Nepal and modern Tibetans. We further focused on the genetic backgrounds of Sino-Tibetans and other southern modern and ancient East Asians and performed one subregional East Asian PCA (Fig. 3b). We observed a clear Sino-Tibetan cline running between the highland modern Tibetan lineage represented by Lhasa Tibetan and the intersection region of the Hmong–Mien cline and Tai–Kadai cline, and Lanzhou Han partly overlapped with modern northern Han Chinese and ancient Yangshao and Longshan people from Henan Province, which indicated that northwestern Han Chinese, represented by Lanzhou Han, showed genetic affinity to ancient Yangshao/Longshan-related people and modern northern Han Chinese from the Shanxi, Henan and Shandong Provinces. Here, we could also identify the genetic separation of human populations from southern East Asia and Southeast Asia.

Fig. 3
figure 3

Relationship between Lanzhou Han and modern/ancient Eurasians based on the genome-wide genetic variations. a, b Principal component analysis (PCA) focused on populations from the eastern Eurasian (a) and southern East Asia (b). Ancient individuals were projected on the modern genetic backgrounds. c Ancestry composition among the newly genotyped and reference populations showed the genetic similarities and differences

Population clustering patterns inferred from the ADMIXTURE results among non-Africans showed four main ancestry components in Lanzhou Han. The minimum cross-validation value is 0.6701 when we predefined 11 ancestral population sources in the model-based clustering analyses (k = 11). As shown in Fig. 3c, we observed 2 dominant northern East Asian ancestries, represented by the Jomon lineage (0.106) and ancient Tibetan lineage (2125-year-old Mebrak, 0.423), and 2 southern East Asian ancestries, represented by the coastal proto-Austronesian lineage maximized in Taiwan Hanben (0.100) and the inland proto-Hmong–Mien lineage enriched in Hmong (0.218), with 11 predefined ancestral populations. In addition, we identified a small genetic contribution from the Baikal ancient lineage represented by early Bronze Age Ust-Belaya and proto-Austroasiatic Htin/Mang lineage (0.045) into northwestern Han. A model-based cluster of 12 ancestral sources also confirmed that Lanzhou Han derived their primary ancestry from Mebrak (0.417), Jomon (0.105), Hmong (0.220), and Hanben (0.099). Interestingly, we identified a small proportion (0.021) of ancestry descended from Basque or western Eurasian steppe pastoralist-related populations, as well as low amounts of gene flow from Htin (0.043) and Transbaikal Evenk (0.069) populations.

Consistent with the shared mosaic genetic components observed in the ADMIXTURE results and patterns of genetic variations in PCA, the shared genetic drift among modern Eurasian populations determined via outgroup-f3(Source1, Han_Lanzhou; Mbuti) showed that Lanzhou Han had a close genetic affinity with modern northern Han Chinese from geographically different regions (Fig. 4a). The stronger genomic affinity was further confirmed via outgroup-f3(Ancient Eurasian, Han_Lanzhou; Mbuti), which pointed out that middle Neolithic-to-Iron Age people from the Yellow River Basin in northern East Asia showed the most shared genetic drift with Lanzhou Han, especially for Luoheguxiang people in Henan (Fig. 4b). To further explore plausible ancestral sources for Lanzhou Han, we calculated admixture-f3 statistics in the form f3(Source1, Source2; Lanzhou Han) using 70,949 SNPs in 192 Eurasian modern and 177 ancient populations. After excluding 7401 ancient source pairs with fewer than 10,000 overlapping SNPs, we found that 8873 out of 66,519 pairs displayed significant admixture signals (negative-f3 values with Z-scores less than -3). In detail, the composition of northern and southern East Asians always produced the most negative admixed signatures, pointing to the main ancestral sources of Lanzhou Han from two lineages related to northern and southern East Asians, respectively, which may be associated with millet farmer predecessors from the Yellow River Basin and rice farmer predecessors from the Yangtze River Basin (Fig. 4c–e). Geographically distinct Han Chinese populations combined with one of the western Eurasian populations produced the most negative-f3 values, for example, f3(Chuvash, Han_Nanchong; Han_Lanzhou) = − 17.620*SE. We also identified negative-f3 values for combinations of one western modern Eurasian and one East Asian population, such as steppe pastoralists (Yamnaya, Afanasievo, Okunevo, and Andronovo) combined with Asians or ancient Yellow/Yangtze River Basin populations combined with western Eurasians (Fig. 4f–h).

Fig. 4
figure 4

source for Lanzhou Han when we assumed ancestral northern East Asians (c, d), ancestral southern East Asian Hanben (e), Western Eurasian steppe pastoralist as one of the source populations (f–h)

Results of three-population tests. a, b The shared genetic drift between Lanzhou Han and modern Eurasian reference populations (a) and ancient populations (b). ch The plausible ancestral

Ancestral origins and genetic history reconstruction of northwestern Han via modern/ancient DNA perspectives

We performed affinity-four-population statistics to study the asymmetric genetic relationship between Lanzhou Han and other modern/ancient Eurasians in the form f4(Modern/ancient Eurasian reference1, Modern/ancient Eurasian reference2; Han_Lanzhou, Mbuti). The included reference populations could be categorized into six groups (Fig. 5a) based on the patterns of shared derived alleles. We observed significant negative-f4 values when we used group1 as reference population1 (populations listed on the left of the heatmap), including paleolithic East Asian lineage (40,000-year-old Tianyuan), Bronze Age western Eurasian pastoralists (Sintashta, Andronovo, Afanasievo, and Srubnaya people), Xinjiang ancient Shirenzigou people and modern Uyghur, Nepal Kusunda and the deeply diverged southern Eurasian lineage of Onge, which pointed to stronger northern East Asian affinity of Lanzhou Han. Compared with all other reference populations, we observed statistically positive f4 values in f4(Reference group6, Modern/ancient Eurasian reference2; Han_Lanzhou, Mbuti), indicating that Lanzhou Han shared the most derived alleles with group6, which comprised lowland Sino-Tibetan and ancient northern East Asians. We subsequently calculated the f4(Reference population, Han_Lanzhou; Ancestral source candidates, Mbuti) to validate the genetic continuity and admixture of northwestern Han. We assumed that if one ancestral source candidate A is the direct ancestor of Lanzhou Han, more shared alleles (negative-f4 values) between them should be observed. If ancestral source candidate A is the only direct ancestor of Lanzhou Han, there should be nonsignificant f4 values in f4(Ancestral source candidate, Han_Lanzhou; Reference population, Mbuti). As shown in Fig. 5b, we observed strong affinity signals in f4(Reference population, Han_Lanzhou; Ancestral source candidate, Mbuti); when we used ancient East Asians or modern northern East Asians as possible ancestral source candidates, we could observe the negative-f4 values, including the geographically close late Neolithic Shimao people from Shaanxi, late Neolithic Qijia people (Jinchankou and Lajia) and Iron Age Dacaozi from Gansu Province. The validation test of the unique ancestral population of Lanzhou Han was further validated using the 1240 K-based merged dataset, and no additional gene flow events were identified based on our included reference populations, which suggested a stronger genetic affinity between northern Ancient East Asians and modern Lanzhou Han. Marginal Z-scores could be produced when we used the Ami (Z-Score: − 2.692), Thai (− 2.507), Dai (− 2.383), Atayal (− 2.224), Sintashta (− 2.007) and Shirenzigou (− 1.835) as references and Lajia as an ancestral source, suggesting that compared with late Neolithic people in Gansu, modern people may have additional small contributions of genetic materials from southern East Asians and western Eurasians. These additional admixture signals were further evidenced via f4-statistics based on the Human-Origin-merged dataset (Table S7–8).

Fig. 5
figure 5

Genomic affinity between Lanzhou Han and other modern and ancient Eurasians. a Results from f4(Reference population1, reference population2; Lanzhou Han, Mbuti) showed the unequal derived alleles between Lanzhou Han and other modern/ancient populations compared with other East Asians. b Genetic admixture and continuity between Lanzhou Han and other modern and ancient source candidates inferred via f4(Reference population1, Lanzhou Han; Reference population2, Mbuti).

The results from the qpWave focused on Lanzhou Han showed that it could be fitted via the two-way admixture model and could be modeled as the admixture result of 0.862 ± 0.038 Lajia-related ancestry and 0.138 ± 0.038 Hanben-related ancestry (p_rank1: 0.1393), 0.883 ± 0.047 Shimao-related ancestry and 0.117 ± 0.047 Hanben-related ancestry (p_rank1: 0.5014), or 0.861 ± 0.049 Miaozigou-related ancestry and 0.139 ± 0.049 Hanben-related ancestry (p_rankl: 0.1068). To elucidate the genetic affinity between Lanzhou Han and western Eurasians, we also applied qpAdm modeling and a three-way admixture model to quantify the proportion of Western Eurasian ancestry in northwestern Han Chinese. When we used a French population as the western source proxy, Lanzhou Han was observed to be better fitted as an admixture of 0.815 ± 0.063 ancestry related to Iron Age Luoheguxiang people, 0.163 ± 0.056 ancestry related to Hanben, and 0.022 ± 0.014 related to French (p_rank2: 0.377). This three-way admixture model could also be well fitted if the middle and late steppe pastoralists of Andronovo were used as the western Eurasian source (0.816, 0.163, and 0.021; p_rank2: 0.343). In addition, three-way admixture models of Longtoushan–Hanben–French (0.739–0.244–0.016) and Longtoushan–Hanben–Andronovo (0.734–0.248–0.018) could also provide a good fit for Lanzhou Han’s admixture history. We also explored the phylogenetic relationships between Lanzhou Han and the surrounding modern and ancient Eurasian populations with the events of population splits and gene flow using graphics-based qpGraph modeling (Fig. 6). The two best-fitted qpGraph models showed the close genetic affinity between modern Lanzhou Han and ancient northern East Asian lineages represented by millet farmers in the Yellow River Basin. When we used the late Neolithic Shimao people as the proxy for northern sources, we were able to model 8% of northwestern Han derived from western Eurasians (Fig. 6a). When we considered the archaic genetic materials in the Non-African and Australasian groups, the northwestern Han could be modeled as a mixture of southern East Asian ancestry related to Hanben and a northern lineage close to northern East Asians and the southern Siberian lineage (Fig. 6b), which may contain western Eurasian admixture-derived alleles. Here, the pattern of genetic structure illuminated the admixture processes of primary ancestry derived from the admixture event between the southern East Asian lineage and northern East Asian or Siberian lineage and indicated minor genetic contributions from western Eurasia. To comprehensively characterize the formation of the gene pool of modern northwestern Han Chinese, we used ALDER to date the North-to-South and West-to-East admixture events based on the decay of admixture-induced linkage disequilibrium (Fig. 7). We detected an ancient admixture between southern East Asia and northern East Asia estimated at approximately the middle to historic Neolithic period with different ancestral source candidates (approximately 5000 BCE–1500 CE). We also tested the French, Basque and Greek populations as the western source and obtained a time of admixture of 30 generations ago (approximately 1000 years ago, during the Tang and Song Dynasties). We obtained contiguous intervals ranging from 24 to 30 generations ago for different western sources.

Fig. 6
figure 6

Admixture graph model of northwestern Han Chinese with the order of population splits and positions of admixture events. The largest deviation between empirical and theoretical f-statistics less than |Z|= 2.748, which indicated a good fit considering the large number of the f-statistics analyzed. Admixture events were labeled as dotted lines with corresponding admixture proportion and parameters of branch length were marked as 1000 times of f2 values

Fig. 7
figure 7

Admixture time of the formation of Lanzhou Han with both western and eastern Eurasian sources. Here, we used 28 years as the length of one generation. The labeled years were calculated via the following formula = 1950–28*(Generations-1)

Uniparental genetic landscape of northwestern Han Chinese

We successfully obtained uniparental haplogroups for 49 male individuals (Table S9). For the mtDNA haplogroup, we assigned 49 mitochondrial genomes into 36 terminal haplogroups with frequencies ranging from 0.0204 to 0.0612 (F1a1,3), and D4 (14/49) was the dominant maternal lineage in the northwestern lineage. For the male-inherited Y chromosome, we obtained 38 terminal Y haplogroups with frequencies ranging from 0.0240 to 0.0816 (Q1a1a1a1a ~). We found that southern Siberian-dominant lineages (C2b, C2c, N1, and Q1a) appeared in our northwestern Chinese Han. The patterns of genetic diversity also showed the multiple genetic sources of modern northwestern Han Chinese.

Discussion

The Hexi Corridor and its surrounding regions are well known for the famous Majiayao culture in middle and late Neolithic times and subsequent control by the Rongdi tribe before the Han dynasty (Dong et al. 2017). In addition, this region was the main region of intersection of the eastward spread of barley/wheat agriculture and westward spread of millet technology in the Neolithic (Leipe et al. 2019). The westward migration of Han Chinese and their ancestors mainly occurred within the historic period, and the descendants of these migrants resided here permanently. Northwest China is the cradle of Trans-Eurasian cultural and genetic exchange (Dong et al. 2020). However, the genetic diversity, fine-scale genetic structure, and western Eurasian admixture signal of northwestern populations should be fully surveyed. Here, we conducted one genetic survey based on autosomal STR and genome-wide SNP analyses among 599 individuals to reconstruct the population genomic history of northwestern Chinese populations.

First, due to the controversy about the origin of Han Chinese and the model of the formation of this population, we uncovered that the main lineage of northwestern Han descended from the ancient northern East Asian lineage related closely to middle/upper Yellow River millet farmers or hunter-gatherers from the Mongolian Plateau (Ning et al. 2020; Wang et al. 2020a; Yang et al. 2020). This genomic affinity between modern northwestern Han and ancient northern East Asians supports North China as the origin of Sinitic-speaking populations. This is consistent with the hypothesis of co-origination of the Sino-Tibetan language family in North China, evidenced via the shared cognates and common homeland in the Bayesian-based phylogenetic relationship reconstruction (Sagart et al. 2019; Zhang et al. 2019). We also identified the gene flow from southern East Asian populations into the northwestern Han based on f-statistics and qpAdm/qpWave admixture models, suggesting a complex admixture pattern underlying the formation of the modern Han. Previous paternal/maternal DNA-based findings have demonstrated that the demic diffusion model promoted the formation of modern observed genetic diversity and variation in Han Chinese individuals (Wen et al. 2004). The fine-scale genetic structure presented here further suggests a revised model of North China origin and range expansion with local admixture, emphasizing the incorporation of genetic material from additional ancestral populations during the process of Han Chinese expansion.

Second, focusing on the West-To-East genetic connection, we identified a western Eurasian admixture signature in northwestern Han via f3/f4-statistics, which was confirmed in the qualitative analyses via qpAdm/qpGraph models. Here, we found that modern northwestern Han Chinese populations were derived from three ancestral populations: two major eastern Eurasian components (ancient northern East Asians related to the Yellow River millet farmers and ancient southern East Asians related to the Yangtze rice farmers) and one western Eurasian component. This complex pattern of ancestral admixture is especially interesting in that it is significantly different from the two-way admixture model of southern Han (He et al. 2020a, b). However, the most proximate ancestral sources of western Eurasian sources and corresponding admixture dates remain controversial. Previous ancient Neolithic genomes from Baikal Lake (de Barros Damgaard et al. 2018; Sikora et al. 2019) and the Yellow River Basin (Ning et al. 2020; Wang et al. 2020a; Yang et al. 2020) have demonstrated limited genetic contributions of western Eurasia to eastern Eurasia. Different from the complex mixing pattern observed in Europe (three-way admixture model of local hunter-gatherers, incoming Anatolian farmers and westward-spreading Yamnaya pastoralists) (Lazaridis et al. 2014), eastern Eurasia possessed relatively high genetic stability during the Neolithic revolution. Here, we also provide one possible process describing the introduction of western Eurasian ancestry into the northeastern Han. We suspect that this low level of Western Eurasian signal may have been introduced into North China after the Bronze Age via extensive population interactions during the globalization process. Indeed, our estimated date of western-eastern admixture mainly spanned from 1500 years ago to 500 years ago in historic time. It is known that the ALDER admixture time was estimated based on single admixture events; however, actual mixing events are continuous (Loh et al. 2013). Thus, these estimated dates are later than the initial admixture time. In addition, recent ancient genomes from Shirenzigou in Xinjiang Province identified significant western Eurasian Yamnaya ancestry 2000 years ago (Ning et al. 2019). The extent of the influence of Yamnaya-related ancestry on inland populations needs further genetic testing to determine. In total, clearer processes of the introduction of Western Eurasian ancestry into North China will be fully illuminated via population-scale analyses of additional ancient genomes to characterize temporally different populations from the Hexi Corridor in the future, although our current estimated date of admixture is consistent with the development of communication on the Silk Road.

In summary, our results falsified the hypothesis that the historic Trans-Eurasian populations around the Hexi Corridor during the Silk Road development period interacted with Han Chinese populations via cultural diffusion alone. Instead, the homogeneous genetic structure observed in modern northwestern Han Chinese harbors some extent of western Eurasian admixture (2% via qpAdm-based or 8% in qpGraph-based models) dating to approximately 1000 CE, suggesting that local populations (Ancestral Han) mixed with incoming western Eurasians, along with the adoption of technology and culture. In addition, no direct genetic continuity between geographically close-knit late Neolithic-to-historic Ganqing ancient populations (Lajia, Jinchankou and Dachaozi) was identified. The genomic affinity between northwestern Han and ancient northern East Asians demonstrated that the primary ancestry of northwestern Han Chinese populations was derived from ancestral populations related to northern millet farmers in the Yellow River Basin, suggesting their common origin in North China and recent northwestward expansion. Stronger affinity with southern modern and ancient East Asians compared with northern Neolithic populations, revealed via f3/f4, qpAdm/qpWave, and qpGraph, suggested the northward migration of ancient southern rice-farmer-related ancestral populations and their genetic contributions to modern northwestern Han Chinese populations. In conclusion, our results suggested that modern northwestern Han derived their ancestry from three different ancestral sources: two major East Asian groups, associated with millet and rice farmers, and one minor Western source.