Genetic admixture of Chinese Tajik people inferred from genome‐wide array genotyping and mitochondrial genome sequencing

Chinese Tajiks are an Indo‐Iranian‐speaking population in Xinjiang, northwest China. Although the complex demographic history has been characterized, the ancestral sources and genetic admixture of Indo‐Iranian‐speaking groups in this region remain poorly understood. We here provide the genome‐wide genotyping data for over 700 000 single‐nucleotide polymorphisms (SNPs) and mtDNA multiplex sequencing data in 64 Chinese male Tajik individuals from two dialect groups, Wakhi and Selekur. We applied principal component analysis (PCA), ADMIXTURE, f‐statistics, treemix, qpWave/qpAdm, Admixture‐induced Linkage Disequilibrium for Evolutionary Relationships (ALDER), and Fst analyses to infer a fine‐scale population genetic structure and admixture history. Our results reveal that Chinese Tajiks showed the closest affinity and similar genetic admixture pattern with ancient Xinjiang populations, especially Xinjiang samples in the historical era. Chinese Tajiks also have gene flow from European and Neolithic Iran farmers‐related populations. We observed a genetic substructure in the two Tajik dialect groups. The Selekur‐speaking group who lived in the county had more gene flow from East Asians than Wakhi‐speaking people who inhabited the village. These results document the population movements contributed to the influx of diverse ancestries in the Xinjiang region.


Introduction
Known as a "crossroads of civilizations," the history of East-West communications in Central Asia stretches back for millennia (Barthold, 1962;Cilli et al., 2019).This vast area contains a high linguistic, genetic, and ethnic diversity (Comas et al., 1998;Chaix et al., 2007), indicating a complex network of settlement and occupation.Among all the Central Asian ethnic groups, the Tajiks are considered an ideal choice to study broader patterns in Central Asian history.With a global population of 15-20 million, the Tajiks are one of the ancient indigenous peoples of Central Asia.Tajik people live mainly in Tajikistan, Afghanistan, Uzbekistan, and China's Xinjiang Uygur Autonomous Region.The population is generally divided into highland Tajiks and lowland Tajiks (Li & Bi, 2015;Palstra et al., 2015).The highland Tajiks mostly occupy the Hindu Kush Mountains and Pamir Plateau regions.At the same time, lowland Tajiks reside in the Central Asia plains, including the urban centers of Herat, Balkh, Bukhara, and Samarkand.Highland Tajiks primarily engage in animal husbandry, while their advanced handicraft industry and commerce mark lowland Tajiks in agricultural production.Both the highland and lowland Tajiks languages belong to the Indo-Iranian group of the Indo-European language family.
Chinese Tajiks belong primarily to the Tajik highland subdivision and are among 56 officially recognized ethnic groups in China.The most recent census data shows China's present-day Tajik population is around 51 000.The Chinese Tajik language has absorbed many Uyghur and Chinese vocabulary due to long-term interaction between Chinese Tajik and surrounding ethnic groups (Li & Bi, 2015).About 80% of the Chinese Tajiks have been residents in the Tashkurgan Tajik Autonomous County (average altitude >4000 m) for more than three generations.Tashkurgan sits on the highest elevation of the Pamir Plateau.The "father of ice peaks," Mustagh Peak, stands north of Tashkurgan, while K2, the second-highest mountain on earth, straddles the China-Pakistan border to the south.This unique natural setting has allowed the Tajiks of Tashkurgan to retain numerous indigenous linguistic and cultural attributes and largely prohibited frequent interethnic marriages (Gao, 1996;Zeng, 2005;Yan et al., 2006;Liu et al., 2010;Malyarchuk et al., 2013;Khitrinskaia et al., 2014;Li et al., 2014;Li & Bi, 2015;Palstra et al., 2015;Peng et al., 2018).
Many previous studies have suggested extensive genetic admixture between East and West Eurasians in northwest China (Ning et al., 2019;Wang et al., 2019Wang et al., , 2021;;Adnan et al., 2020Adnan et al., , 2021;;Wen et al., 2020;Zhao et al., 2020;He et al., 2021;Ma et al., 2021;Yang et al., 2021;Yao et al., 2021;Zhang et al., 2021), but a few focus on Tajiks.A 2018 paper by Min-Sheng Peng et al. used mitochondrial genome analysis to understand the maternal origin of multiple ethnic groups on the Pamir Plateau.Their study pointed to substantial genetic differentiation among different highland Tajik populations and the complex history behind the peopling of the Pamirs (Peng et al., 2018).However, current studies on the Tajiks have mainly focused on the Central Asian Tajiks.The genetic structure and population admixture of Chinese Tajiks largely remain unknown.Therefore, we carried out this research on the Tashkurgan Tajik population in China to shed more light on the genetic diversity and admixture pattern of the Tajik community and Chinese Tajiks.

Sample collection
We collected saliva samples from 64 male Tajiks from the Taxkorgan Tajik Autonomous County in Xinjiang Uygur Autonomous Region, northwest China.These samples included two dialect groups, 21 Wakhi-speaking (living exclusively in Darbudar village) and 43 Selekur-speaking individuals.Detailed sample information is listed in Table S1.The geographical distribution of sampling sites is marked by the black triangle in the red area in Fig. 1.We collected the samples randomly from unrelated participants whose parents and grandparents had married in a nonconsanguineous fashion within the same ethnic group for at least three generations.Our study and sample collection were reviewed and approved by the Medical Ethics Committee of Xiamen University (Approval Number: XDYX2019009) and were in accordance with the recommendations provided by the revised Helsinki Declaration of 2000.The participants provided their written informed consent before they were invited to have participated in this study.

Data preparation
The genomic DNA of 64 samples was extracted using LD664 or LD607 Kits (Enlighten Biotech, Shanghai, China).Genotyping was performed on the Affymetrix Inc. 23MF_v2 highdensity SNP arrays at the 23 Mofang Laboratory, Chengdu.Filtering strategies were conducted with parameters: -hwe 1e-6, -geno 0.05, -mind 0.05 to obtain the quality-controlled raw data via the Plink 1.9 (Purcell et al., 2007).Finally, we had 705 767 SNPs for the subsequent analysis.We merged our newly generated data with publicly available data (Human Origin array and 1240K data set) curated by the David Reich lab (Patterson et al., 2012;Lazaridis et al., 2014;1000Genomes Project Consortium, 2015;Mallick et al., 2016) based on their overlapping SNPs to perform the corresponding population genomic analyses.The link can find the detailed information for the curated data set: https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadablegenotypes-present-day-and-ancient-dna-data.The list of the modern and ancient populations co-analyzed with new data is in Table S10.The mtDNA libraries were prepared by using an MtDNA Library Preparation Kit 2.0 (Enlighten Biotech) and a WhoChrMT kit (Enlighten Biotech).The optimized multiplex polymerase chain reaction (PCR) amplification reaction conditions were according to the published reference (Wang et al., 2022).

Analysis of genetic relationship and population structure
We performed principal component analysis (PCA) at the individual level using smartpca, part of the EIGENSOFT package (Patterson et al., 2006).We employed the default parameters where two additional parameters were used: numoutlieriter: 0 and lsqproject: YES.Ancient reference populations were projected onto the two-dimensional plots based on their genetic variations to modern people.ADMIXTURE (Alexander et al., 2009) analysis was carried out after pruning SNPs exhibiting strong linkage disequilibrium using PLINK tools with the parameters "-indeppairwise 200 25 0.4."There were 77 503 SNPs left after LD pruning.We then ran the unsupervised ADMIXTURE with a K value (number of assumed ancestral components) ranging from 2 to 10, using 100 bootstrap iterations with different random seeds.Qp3pop package in ADMIXTOOLS (Patterson et al., 2012) was used to carry out outgroup f 3 -statistics (Source1, Source2; Mbuti) and admixture f 3 -statistics (Source1, Source2; Target population).The outgroup f 3statistics were used as an index for evaluating the shared genetic drift, and the admixture f 3 -statistics were used for exploring admixture signatures with different ancestral sources.QpDstat package in ADMIXTOOLS (Patterson et al., 2012) with f 4 Mode (f 4 : YES) was used to explore the derived-allele sharing relative to their reference pairs.

Streams of ancestry and inference of admixture proportions
We used qpAdm and qpWave (Haak et al., 2015) as implemented in ADMIXTOOLS to determine the number of ancestry sources and estimate the admixture proportions of Chinese Tajiks.Outgroups were listed in Table S5A and B, which had no recent genetic admixture and were differentially related to the ancestral sources of Chinese Tajik people.

Uniparental haplogroup assignment and mtDNA multiplex sequencing analysis
We assigned Y chromosomal haplogroups using the captured SNPs on the Y chromosome.We determined haplogroups by identifying the most derived allele upstream and the most ancestral allele downstream in the phylogenetic tree in the ISOGG version 11.89 (http://www.isogg.org/tree/).The mtDNA haplogroup assignment was performed using the online tools available with HaploGrep version 2.4.0 software (Kloss-Brandstätter et al., 2011).
MtDNA multiplex sequencing data were aligned and assembled in contigs, and consensus sequences were compared to the revised Cambridge Reference Sequence (Andrews et al., 1999).The quality of the sequences was examined manually, and two analysts independently annotated deviations from the reference sequence.For all analyses, cytosine (C) insertions/deletions at positions 16 193, 309, and 315, as well as adenine (A) to C transversions at positions 16 182 and 16 183, were ignored.

Treemix and Admixture-induced Linkage Disequilibrium for Evolutionary Relationships (ALDER)
We used TreeMix (Pickrell & Pritchard, 2012) to study population splits and gene flow events.ALDER (Loh et al., 2013) was used to estimate the time of potential ancestral source admixture events with 28 years per generation.

Genetic structure of Chinese Tajik people
We merged our newly generated genome-wide data with the publicly available modern and ancient reference genomes.We obtained two different data sets, a low-density one with 94 755 overlapping SNPs, and a high-density one including 243 549 overlapping SNPs.We used these two data sets to reconstruct the Chinese Tajiks' population history.
We explored the patterns of genetic relationship among Chinese Tajiks and Central Asian, Western Eurasian, ancient Xinjiang and East Asian populations via PCA and unsupervised model-based ADMIXTURE clustering.Ancient individuals were projected onto the modern genetic landscape.In Fig. 2A, we observed three main genetic clines, which were in accordance with the geographical and linguistic divisions.These included the South China cline extending from Hmong Mien to Austronesian and ancient southern populations, the Central Asian/West Eurasian cline associated with not only Central Asian and Western Eurasian populations but also ancient Xinjiang groups, Sinitic/Tibetan Burman cline including Sinitic and Tibetan Burman populations as well as Yellow River farmers.We observed that Chinese Tajiks were located in the Central Asian/West Eurasian cline and separated from East Asian-related clusters.When focused on the Central Asian/West Eurasian cline (Fig. 2B), Chinese Tajiks clustered next to ancient Xinjiang people and were close to Western Eurasian and Central Asian populations.In addition, Chinese Tajiks did not cluster together with Tajiks in Tajikistan, which suggested there was a genetic substructure between Tajiks from different locations.
We ran the unsupervised model-based ADMIXTURE with 716 individuals from 78 modern and ancient non-African populations.We found Chinese Tajiks had more Western Eurasian-related ancestral components (blue and red bands) as shown in Fig. S1.Under the best fitting model at K = 6 with the smallest cross-validation errors, we observed Chinese Tajiks had most of the light purple ancestral components (Fig. 2C), which was maximized in the Western Eurasian group (Anatolia_N).Additionally, when K = 7, we noticed that Chinese Tajiks had a large proportion of red ancestral components, which were maximized in the Neolithic Iran farmers, Uzbekistan_BA_Bustan and Caucasus huntergatherers (CHG).Other ancestral components (K = 6) aligned with Tarim_EMBA1 (deep purple) and East Asian (orange) only accounted for a small fraction.In general, Chinese Tajiks had the most significant Western Eurasian-related ancestral components, followed by a small amount of East Asianrelated ancestry, the patterns of which were similar to ancient Xinjiang populations and Tajiks in Tajikistan.

Population continuity and admixture in Chinese Tajik people
We used the ancient and modern reference populations from the Human Origin data set (Patterson et al., 2012;Lazaridis et al., 2014) as the plausible source proxies to perform the f 3and f 4 -statistics.There were a total of 94 755 SNPs left after data merging.We first conducted the outgroup f 3 (Mbuti; Y_facet, Tajik_China) to further confirm the genetic affinity between Chinese Tajiks and Eurasian populations.We showed populations with the top 40 f 3 values in Fig. 3A and listed the more comprehensive results in Table S2.We observed that Chinese Tajiks shared more genetic drift with Early-Middle Bronze Age Tarim people (Tarim_EMBA1), other ancient Xinjiang groups, Central Asians and Steppe populations.
We then explored the admixture signals (Z-scores < −3) in Chinese Tajiks using admixture-f 3 (Source1, Source2; Target).The negative f 3 values mean the allele frequencies are intermediate between two potential source populations.We listed all the potential related sources with the most significant negative f 3 values in Table S3.Most of the statistically significant source candidates were pairs from one of the Western populations, while the other was from East Asians.The maximum negative f 3 -value was observed when French was the Western Eurasian ancestral source and In addition, motivated by the PCA, ADMIXTURE, and f 3statistics results, Chinese Tajiks showed a close affinity with ancient Xinjiang populations, so we performed f 4 -statistics in the form of f 4 (Mbuti, reference populations; ancient Xinjiang populations, Tajik_China) to investigate the relatedness between them.In Fig. 3B, we observed that most ancient Xinjiang populations presented significant negative values (Z-scores < −3) marked with blue color, which suggested that these ancient Xinjiang people had more gene flow from East Asians than Chinese Tajiks.The detailed results were listed in Table S4.Furthermore, there were no significant values (−3 < Z-scores < 3) between Chinese Tajiks and Xinj_HE3 and Xinj_IA11_aSte, suggesting that Chinese Tajiks had a close relationship with these two ancient Xinjiang groups.
Consistent with f 3 statistics, potential admixture was further demonstrated by the allele frequency-based TreeMix analysis.Among 28 populations in Fig. 3C, Chinese Tajiks clustered with ancient Xinjiang, Iranian, Tajikistan Tajiks, French, and English populations while separated from East Asian-related clades, suggesting that Chinese Tajiks showed a closer affinity with ancient Xinjiang and Western populations.
To further test the homogeneity between Chinese Tajiks and ancient Xinjiang populations, we performed a qpWave analysis using Mbuti, Ami, Anatolia_N, Iran_GanjDareh_N, Israel_Natufian, Italy_North_Villabruna_HG, Mixe, Onge, Russia_Kostenki14, Tianyuan, and Ust_Ishim as outgroups.The results (Fig. 3D) showed that a significant P value for rank = 0 (P > 0.05, indicated with double+) was presented between Chinese Tajiks and Xinj_HE3, suggesting those two groups were genetically homologous, which was also demonstrated in f 4 -statistics above.

Genetic substructure between Tajik dialect groups
There are Wakhi and Selekur-speaking groups in Chinese Tajiks.We performed a series of analyses to explore the possible genetic substructure between these two groups.From the PCA and ADMIXTURE perspectives, we have not observed obvious genetic differences at the individual level (Figs.S2a, S2b).However, we found the significant negative values (Z-scores < −3) marked with blue color when we performed f 4 -statistics in the form of f 4 (Mbuti, reference populations; Tajik_Selekur, Tajik_Wakhi), which suggested that Tajik_Selekur had more gene flow from East Asians than Tajik_Wakhi (Fig. S2c).To further confirm the genetic difference between Tajik_Selekur and Tajik_Wakhi, we conducted qpWave analysis together with relevant reference populations.There was no significant P value (P > 0.05, indicated with double+) between Tajik_Selekur and Tajik_-Wakhi shown in Fig. S2d, which is consistent with the above f 4 -statistics.

Genetic sources of Chinese Tajik people
To further investigate potential ancestral sources and genetic variations of Chinese Tajiks, we modeled the well-fitted twoand three-way admixture models in qpAdm.We used the published ancient Xinjiang populations (Kumar et al., 2022), millet farmers in the Yellow River basin, western and eastern Steppe populations and Neolithic Iran farmers as the potential sources to explore genetic admixture patterns of Chinese Tajiks (Figs. 4A-4J; Table S5).In the two-way admixture models (Figs.4A-4C), Chinese Tajiks could be modeled as an admixture of Xinjiang_HE3 (80.4%-85.9%)and Turkmenistan_Gonur_BA (19.6%) or Iran_GanjDareh_N (16.8%) or CHG (14.1%).When we used other ancient Xinjiang groups (Xinj_IA7_aEA and Xinj_HE5) as sources, the admixture proportions were down to 56.1%~61.5% (Figs.4D,  4E), indicating a closer affinity of Chinese Tajiks with Xinjiang_HE3 than with other ancient Xinjiang populations, which was also attested in f 4 -statistics (Fig. 3B) and qpWave analysis (Fig. 3D).Moreover, in the three-way admixture models (Figs.4F-4J), Chinese Tajiks could be modeled as ancient Xinjiang groups with additional ancestry from Turkmenistan_Gonur_BA, Russia_Andronovo, Russia_ML-BA_Sintashta or Iran_GanjDareh_N.We also observed that the admixture proportions from Yellow River farmers (YR_MN) were less than 9% (Fig. 4H) in Chinese Tajiks.
When we focused on Wakhi and Selekur-speaking groups of Chinese Tajiks separately (Fig. S3a), Tajik_Selekur could be successfully modeled as the ancient Xinjiang population (Xinj_IA11_aSte, 93.6%-93.9%)and Yellow River farmers (6.4%) or Chinese Han (6.1%).However, Tajik_Wakhi failed in the two-source admixture models.In the three-source admixture models, there were slightly different between Tajik_Selekur and Tajik_Wakhi, indicating that Tajik_Selekur had a little more gene flow from Yellow River farmers than Tajik_Wakhi with a proportion of 5.3% and 3% respectively.

Admixture time estimation via the decay of linkage disequilibrium
We used different eastern and western ancestral sources to assess their possibilities of the admixture process in the formation of Chinese Tajik people by estimating ALDERbased admixture time.Considering the observed genetic structure, we used East Asians, Central Asians, Iranian and Western Eurasians as Eastern and Western ancestral proxies (Table S6).We obtained 22 pairs of best-fitted models after excluding failed models or models with lower Z-scores/ P-values.Chinese Tajiks could be modeled as the admixture between ancient Xinjiang populations around 577.08 ± 187.32 to 1615.6 ± 511.56 years ago (ya).Chinese Tajiks also could be modeled as the admixture of ancient Xinjiang populations and Chinese Han around 695.8 ± 73.64 to 1141.28 ± 110.04 ya.In addition, we observed that Chinese Tajiks also could be modeled as a mixed group from ancient Xinjiang and European groups (Anatolia_N and CHG) from 1034.6 ± 194.04 to 2954.28 ± 509.32 ya, as well as Kyrgyz_Tajikistan (720.72 ± 196.28 ya) and Iranian (~551.04 ± 93.8 ya).When we used Anatolia_N and Tarim_EMBA1 as the two sources, the estimated admixture time became older (~4720.52 ± 595 ya).

Y chromosomal haplogroup analysis
Chinese Tajiks exhibited high diversity in their paternal gene pool (Fig. 4k; Table S7).A high percentage of Y-chromosome haplogroup R1a1-M17 was found among Chinese Tajiks, accounting for 53.13% of the total.In previous studies, the R1a1-M17 primarily appeared in high frequencies throughout western Xinjiang groups in Iron Age and Historical Era and Steppe_MLBA populations (Wells et al., 2001;Sharma et al., 2009;Kumar et al., 2022).In addition, the derived branch R1a1a1b2a2-F2935 is impressively enriched among the Bashkirs (Karmin et al., 2015), accounting for 14.06%, as shown in Table S7.The Bashkirs belong to one of the indigenous Russian groups speaking various Turkic languages.The other derived branch, R1a1a1b2a1-L657.1,accounting for 7.81%, has a large proportion among Uyghur and Kazakh populations (Bergström et al., 2020).Moreover, haplogroup R2a1-L263 is widely distributed among the Burusho living in the mountains of northern Pakistan.Ethnographic studies have previously found similarities between Burusho and Plateau Tajiks regarding appearance, dress, customs, and music (Bergström et al., 2020).Besides, the minor haplogroup J-M304, accounting for 12.5%, was suggested to being evolved in western Asia and at high frequencies in the Middle East, North Africa, the Horn of Africa, and Caucasus (Semino et al., 1996(Semino et al., , 2000;;Quintana-Murci et al., 2001).Haplogroup Q1b-M346 (6.25%) is mainly distributed among the Hazara of Afghanistan (Bergström et al., 2020), a group formed by mixing European and Mongolian ancestry.Haplogroup C2a-L1373, D-CTS3946, O2a-M324, and their subclades are common among East Asian populations (Bergström et al., 2020;Wang & Li, 2013;Yan et al., 2014) but at tiny proportions in Chinese Tajiks.In this study's Selekur-and Wakhi-speaking groups, we found a low frequency of the haplogroups associated with East Asian populations, such as C2a1a3a1-FGC16594, D-CTS3946 and O2a-M324.
From the perspective of Y-chromosome haplogroups above, we also observed the East-West admixture signals in Chinese Tajiks.The frequent haplogroup R1a1 was also predominant in western Xinjiang groups in Iron Age and Historical Era as well as Steppe_MLBA populations.However, the typical East Asian-related haplogroups were at low frequencies in Chinese Tajiks.Additionally, there was no significant difference in the paternal genetic profile between Selekur-and Wakhi-speaking groups of Chinese Tajiks.
Motivated by investigating maternal lineages from the genome-wide array genotyping data above, we subsequently performed the mitochondrial multiplex PCR amplification.We used Arlequin software (Excoffier et al., 2005) to calculate the genetic distance (Fst) based on the mtDNA multiplex sequencing data.We compared the two dialect groups of Chinese Tajiks (Selekur and Wakhi dialects) with relevant populations from East Asia, Central Asia, and Eurasia.A heatmap based on the Fst values was shown in Fig. S3b, and detailed information was listed in Table S9.In Fig. S3b, the blue color indicated a close genetic distance, while the red showed a distant genetic distance.We found that both Selekur-and Wakhi-speaking groups showed a close genetic distance from each other and with published Tajik people in Taxkorgan and Pamiri, which was consistent with geographical locations.These two Chinese Tajik groups also had a close affinity with Israel and modern Iranian groups (marked by the blue wireframe).

Discussion
In this study, we analyzed genomic data of Chinese Tajiks with modern and ancient relevant Eurasian populations to trace the population genetic history.We found Chinese Tajiks contained two major ancestries, one was from ancient Xinjiang populations (Kumar et al., 2022), and the other was related to Western Eurasians probably from Europeans and Neolithic Iran farmers.The finding further supports the movement and admixture of Steppe, Central Asian, and East Asian people into the Xinjiang region increased in the Bronze Age and were still prevalent in both the historical period (HE) and present-day Xinjiang populations (Kumar et al., 2022).Specifically, Chinese Tajiks showed a large-scale genetic continuity with some ancient Xinjiang populations in historical periods (e.g., Xinj_HE3; Figs.3B, 3D).The proportions of Xinj_HE3-related ancestry reached up to 80.4%-85.9% in Chinese Tajiks (Figs. 4A-4C), consistent with the admixture time evaluation that the admixture events between Chinese Tajiks and ancient Xinjiang populations occurred within the historical period (Table S6).
The genome-wide research on the Chinese Tajiks helps us understand the Chinese Tajik population's genetic diversity and admixture patterns.It provides clues for further exploring population dynamics in western China, which is in the exchange of material culture, agriculture, and technology between the West and East Eurasian populations.Generally, the diffusion of culture is not always accompanied by population movements (Posth et al., 2018).However, our findings give genomic evidence for introducing the East-Iranian language, one of the Indo-European languages, along with the population movements and genetic admixture.The Selekur-and Wakhi-speaking groups of Chinese Tajiks are major Indo-Iranian-speaking populations in Xinjiang.Although these two groups showed a great affinity with the local people in Xinjiang (Figs. 2-4), there are slight differences between them.The Selekur-speaking inhabitants in Taxkorgan County of Xinjiang were characterized by slightly more gene flow and admixture from East Asian populations than Wakhi-speaking inhabitants in Darbudar Village of Taxkorgan County (Fig. S2c).This inference was also supported by maternal haplogroup analysis that these East Asian-related haplogroups were only found in the Selekurspeaking group and did not appear in Wakhi-speaking people.This suggested that the closed-off rural lifestyle might reduce their contact with surrounding populations, while the broader demographic communications would aid in the population admixture.
In Xinjiang, there is a complex demographic history and the coexistence of populations with diverse cultural, linguistic, and genetic backgrounds.This study documented the dynamic interactions of Indo-Iranian languages in the Xinjiang region and uncovered similar genetic admixture patterns with surrounding populations.We observed admixed ancestries related to Xinjiang, European, Central and East Asian populations and Neolithic Iran farmers in present-day Chinese Tajik people, suggesting the formation of Tajik was accompanied by widespread population movements.These inferences have provided important information further to understand the Indo-Iranian languages in the Xinjiang region.However, the absence of female samples in our sample collection may result in potentially underestimating the X-chromosome genetic diversity.Further sampling of female populations will be necessary to characterize the formation of Chinese Tajiks' genetic makeup.

Fig. 1 .
Fig. 1.Geographical distribution of sampling sites marked by the black triangle in the red area.

Fig. 2 .
Fig. 2. Overview of genetic structure.A, Principal component analysis (PCA) of Chinese Tajiks with Central Asian, Western Eurasian, ancient Xinjiang and East Asian populations.Ancient individuals were projected onto the modern genetic landscape.B, PCA analysis focused on Central Asian and Western Eurasian populations.C, ADMIXTURE analysis of newly generated Chinese Tajik data (Tajik_China) together with worldwide representative modern and ancient non-Africans.

Fig. 3 .
Fig. 3. Signals of admixed sources and the shared genetic drifts revealed from the outgroup-f 3 statistics, f 4 -statistics, TreeMix and qpWave analysis.A, Top admixture signals for ancient and modern Eurasians as possible sources via outgroup-f 3 (Mbuti; Eurasians, Tajik_China).B, Investigation of the relatedness between Chinese Tajiks and ancient Xinjiang populations in f 4statistics in the form of f 4 (Mbuti, reference populations; ancient Xinjiang populations, Tajik_China).C, TreeMix analysis to further explore the potential admixture.D, Test of the homogeneity between Chinese Tajiks and ancient Xinjiang populations by qpWave analysis.

Fig. 4 .
Fig. 4. Potential ancestral sources and paternal and maternal lineages inferred from the pairwise qpAdm, Y-chromosome and mitochondrial DNA haplogroups.A-J, Potential ancestral sources and genetic variations of Chinese Tajiks with well-fitted twoand three-way admixture models in qpAdm.K, Y-chromosome haplogroups distributed in the Chinese Tajik lineages.L, MtDNA haplogroups distributed in the Chinese Tajik gene pool.