Published September 20, 2025 | Version v1
Report Open

Origins of the Arain (Punjabi) Tribe: A Genetic Perspective

  • 1. ROR icon Government College University, Lahore

Description

 

Abstract

The Arain are a historically agricultural, predominantly Muslim Punjabi community concentrated in central and southern Punjab (present-day Pakistan and India). Historical narratives propose diverse origins — indigenous agriculturalists, migrants from the Iranian plateau, or absorbed groups from diverse South Asian lineages — but these claims have often relied on ethnohistorical and colonial-era sources rather than genetics. This paper synthesizes published genetic data (uniparental markers and autosomal studies) that include Punjabi subgroups and, where available, Arain-specific data, to evaluate the genetic origins of the Arain. Uniparental mtDNA analyses demonstrate a mixed maternal legacy composed of South Asian haplogroups (predominantly branch M lineages) together with West Eurasian haplogroups, consistent with admixture in Punjab. Autosomal STR and forensic datasets from Arain individuals show high diversity and statistically significant differentiation from geographically distant groups but cluster broadly with other Punjabi populations. Y-chromosome surveys across Pakistan and Punjabi male samples reveal predominance of haplogroups common in northern South Asia (including R1a and other West Eurasian–affiliated lineages) alongside locally common South/Central Asian lineages. Taken together, genetic evidence supports that the Arain are not a genetically isolated founder population but rather part of the wider Punjabi genetic continuum formed by long-term local ancestry plus episodes of gene flow from West Eurasian and Central Asian sources. We discuss limitations (sample sizes, marker types, lack of dense genome-wide data explicitly sampled from Arain subpopulations) and propose a focused sampling program using genome-wide SNP arrays and/or whole genomes that can resolve fine-scale ancestry and demographic history.

1. Introduction

Ethnographic and historical literature describes the Arain (Persian/Urdu آرائین / Arāin) as an agriculturist community concentrated in the Punjab region. Colonial sources and later social historians variously interpret their origins: some narrative traditions link them to ancient agrarian groups of the Indus and Ghaggar plain, while others suggest migratory origins or social mobility during the early medieval and colonial periods (Ahmed 2016; Jaffrelot 2004). These narratives, however, offer limited power to resolve genetic ancestry. Modern genetic tools — uniparental markers (mtDNA, Y-chromosome), autosomal STRs and high-density genome-wide SNPs — provide a means to test competing models: indigenous South Asian ancestry versus significant recent input from West/Central Asian gene pools, or a mosaic of multiple inputs. This review integrates the available genetic literature that explicitly samples Punjabi subgroups — and Arain-specific forensic and uniparental studies where available — to provide an evidence-based genetic perspective on Arain origins. PubMed+2PubMed+2

2. Materials and methods — sources and analytical scope

This paper synthesizes peer-reviewed genetic studies sampling Punjabi populations and Arain-specific datasets. Key sources include: (1) a mtDNA control-region study that included caste groups within Punjab (Arain and Gujar) and sequenced complete control regions (Bhatti et al. 2018); (2) a forensic autosomal STR dataset of 121 Arain individuals reporting allele frequencies and forensic parameters (Javed et al. 2022); (3) broader Y-chromosome and population genetic surveys across Pakistan; and (4) comparative genome-scale resources that include Punjabis (PJL) from public datasets such as the 1000 Genomes Project and regional population studies of South Asia (1000 Genomes / PJL, Reich lab studies). These studies employ standard molecular genetic protocols (Sanger sequencing for mtDNA control region, PCR-based STR typing, SNP genotyping arrays or Y-SNP/STR panels) and population genetic analyses (haplogroup assignment using phylotree/ISOGG, pairwise differentiation, PCA, and phylogeographic interpretation). We emphasize qualitative synthesis rather than novel reanalysis, because raw genotype data were not available for integrated reprocessing in this review. PubMed+2PubMed+2

3. Results

3.1 Maternal lineages (mtDNA)

Bhatti et al. (2018) sequenced the full mitochondrial control region for 100 maternally unrelated Punjabis sampled from Arain and Gujar caste groups. They observed 58 distinct mtDNA haplotypes and a mixture of haplogroups: predominantly South Asian M-lineages alongside notable frequencies of West Eurasian haplogroups. The authors concluded that Punjabi castes (including Arain) show a composite of South Asian, East Asian and West Eurasian maternal lineages; overall mtDNA phylogeography of Punjab appeared relatively homogeneous, clustering with other South Asian populations in PCA analyses. Bhatti et al. also reported high haplotype diversity and suggested demographic expansion and extensive historical admixture in the maternal gene pool. PubMed

Interpretation: The maternal signal indicates substantial indigenous South Asian ancestry (M lineage diversity is a hallmark of South Asian matrilines) combined with gene flow from West Eurasian sources. Maternal profiles alone do not support an exclusive outside origin (for example, recent wholesale replacement), but rather long-term local continuity with episodic admixture.

3.2 Paternal lineages (Y-chromosome)

Comprehensive Y-chromosome surveys in Pakistan show that major Y haplogroups found across Pakistani populations include R1a, R1b, L, J2 and others, with varying frequencies across ethnic groups (Qamar et al. and follow-up surveys summarized in nationwide reviews). These studies point to a mix of lineages associated both with ancient South Asian lineages and West/Central Eurasian male-line ancestry. While no large published Y-SNP study focuses solely on Arain males at genome-wide resolution, Y-STR/Y-SNP sampling across Punjabi males — and localized community projects (genealogical Y-DNA projects) — indicate that Arain men carry Y haplogroups common in northern South Asia, including R-lineages. Overall, the paternal evidence is consistent with the pattern seen in the rest of Punjab: pronounced diversity with components connected to Steppe-related lineages and to older South/Central Asian substrates. PMC+1

3.3 Autosomal markers and forensic STRs

Autosomal STR data for 121 Arain individuals (Javed et al. 2022) provide population genetic parameters used in forensic genetics: high locus-by-locus heterozygosity, combined power of discrimination of ~0.9999999999999999925, and evidence of genetic differentiation when compared with geographically distant groups. These statistics confirm that the Arain sample is genetically diverse and informative for identity testing; they also show that, at forensic marker resolution, Arain differ from some other populations — likely reflecting fine-scale regional structure — but largely fall within the Punjabi genetic cluster in broader comparisons. PubMed

3.4 Genome-wide context from regional datasets

Large genome-wide datasets that include Punjabi samples (e.g., PJL in the 1000 Genomes Project, and several South Asian genomic surveys) consistently place Punjabis intermediate between South Asian (Ancestral South Indian / ASI) and West Eurasian/Steppe-related ancestry components, with variable proportions among individuals and subgroups. Studies of South Asian population structure show that modern South Asian genomes arise from at least two deeply diverged ancestral sources (sometimes described as Ancestral North Indian [ANI] and Ancestral South Indian [ASI], with subsequent refinements adding Steppe, BMAC/Zagros-related, and East Asian contributions). Punjabis typically show higher ANI/West Eurasian-related ancestry than many southern Indian groups, but significant local South Asian ancestry remains pervasive. In aggregate, genome-wide work indicates that Punjabi communities — including those that would encompass Arain samples — are products of complex demography: ancient local continuity + stepwise admixture events over millennia. BioMed Central+1

4. Summary table of key genetic findings

Table 1. Selected genetic studies relevant to Punjabi and Arain ancestry (summary).

Study (year)

Marker(s)

Sample (population)

Key findings

Bhatti et al. (2018)

mtDNA (control region)

N=100 Punjabi castes (Arain, Gujar)

58 haplotypes; mixture of South Asian M haplogroups and West Eurasian lineages; Punjab clusters with South Asia in PCA; high haplotype diversity. PubMed

Javed et al. (2022)

Autosomal STRs (15 loci)

N=121 Arain

High diversity, combined PD ≈ 1.0; significant differentiation vs. distant groups; forensic baseline for Arain. PubMed

Pakistan Y-chromosome surveys (review)

Y-SNP/STR

Multiple Pakistani groups

Mixed Y haplogroups across Pakistan including R1a, J2, L etc.; regional heterogeneity. PMC

1000 Genomes (PJL) & regional SNP studies

Genome-wide SNPs

Punjabis from Lahore and related datasets

Punjabis intermediate between South Asian and West Eurasian/Steppe components; multiple admixture episodes. BioMed Central+1

5. Discussion

5.1 What do genetics say about Arain origins?

Taken together, the genetic evidence supports the following inferences:

  1. Local South Asian substrate: The presence and diversity of mitochondrial M haplogroups and other local mtDNA lineages indicate substantial indigenous South Asian maternal ancestry in Punjabi castes sampled (including Arain). This argues against a model in which the Arain are uniformly recent immigrants replacing local populations. PubMed
  2. West Eurasian / Steppe-related paternal and autosomal signals: Y-chromosome data across Pakistan and genome-wide studies of Punjabis reveal significant representation of haplogroups and autosomal ancestry components associated (in ancient DNA studies) with West Eurasian and Steppe-related migrations. This pattern is consistent with documented demographic events (e.g., Bronze Age steppe influences, later historical gene flow). Arain male lineages are reported to include the same common Punjabi haplogroups; genealogical Y-DNA projects and regional studies do not show a unique, exclusive Arain haplogroup. PMC+1
  3. Community as part of Punjabi genetic continuum: Autosomal STRs and the mtDNA results place Arain within the broader Punjabi genetic cluster rather than as an outlier. High diversity and lack of a reduced founder effect indicate that the Arain are not an isolated founder population; rather they reflect admixture and demographic processes typical of Punjabi communities. PubMed+1
  4. Heterogeneous substructure likely: Anecdotal and small-scale community sampling (genealogical posts, small datasets) suggest internal heterogeneity among Arain subgroups across rivers/regions (e.g., Ghaggarwal, Sutlejwal, Jhelum/Mirpuri subgroups mentioned in community forums). This is plausible given historical endogamy, local admixture, and geographic discontinuities within Punjab. However, community forum data are not peer-reviewed and should be treated cautiously. Reddit+1

5.2 Limitations of current evidence

  • Sparse genome-wide sampling specific to Arain: While Punjabis are represented in large datasets (e.g., PJL), explicit genome-wide sampling focused on Arain subpopulations is limited or absent. Existing Arain data are primarily from forensic STR and mtDNA studies (useful but low resolution for deep ancestry inference). This constrains precise admixture timing estimates and fine-scale ancestry deconvolution.
  • Marker limitations: mtDNA and Y-chromosome markers reflect only maternal and paternal lines respectively and can miss the majority of ancestry present in autosomal DNA. STR panels (forensic markers) are optimized for identity, not ancestry inference.
  • Sample sizes and geographic coverage: Many published Arain datasets are modest (e.g., N≈100–121), and geographic sampling may not capture subpopulation structure across all regions where Arain reside. Larger, geographically stratified sampling is necessary.

5.3 Recommendations for future research

To resolve outstanding questions (e.g., proportions of Steppe/Zagros/Iranian/Indus-valley ancestry and admixture timing), a targeted genomic study should be undertaken with the following features:

  1. Dense genome-wide SNP genotyping (or low-coverage WGS) on a representative Arain cohort (N≥200, stratified by subregion). These data enable ADMIXTURE/qpAdm/ALDER/DATES analyses to quantify ancestry components and estimate admixture dates.
  2. High-resolution Y-SNP and full mtDNA sequencing to refine paternal/maternal haplogroup assignments and coalescent ages.
  3. Integration with ancient DNA references (e.g., Indus Periphery, Steppe, Zagros, BMAC) for formal admixture modeling.
  4. Ethnohistorical and geographic meta-data to correlate genetic structure with known migration, conversion, and social-mobility events.

Such a program would allow distinguishing among models (local continuity vs. recent migration vs. admixed origin) with higher confidence.

6. Conclusion

Current genetic evidence positions the Arain community within the broader Punjabi genetic landscape: a composite ancestry formed primarily through long-term indigenous South Asian substrate complemented by West Eurasian and Steppe-related gene flow. Uniparental and autosomal forensic datasets show high diversity and lack of a unique, exclusive founder signature. Definitive, fine-scale resolution of Arain demographic history requires dedicated genome-wide sampling and analyses explicitly focused on Arain subgroups, together with integration of archaeological and historical records.

Figures

Figure 1 (schematic). Schematic ancestry model for Punjabi populations including Arain.

  • Box A: Indigenous South Asian (ASI/ancient local) — dominant maternal signal (mtDNA M lineages).
  • Box B: West Eurasian / Steppe-related input — contributes to Y-chromosome and autosomal variation (R1a etc.).
  • Arrows indicate multiple admixture episodes during Bronze Age and historical periods.
    (This is a conceptual figure synthesizing patterns reported across mtDNA, Y, and genome-wide studies rather than new plotted data.) PubMed+1

Figure 2 (schematic). Distribution of forensic STR diversity in Arain vs. other Punjabi groups.

  • Bar indicates high combined power of discrimination and low random match probability reported for Arain STR panel (Javed et al. 2022). PubMed

Tables

(See Table 1 above. Below is an additional summary with specific forensic stats.)

Table 2. Select forensic statistics from Arain autosomal STR study (Javed et al. 2022).

Parameter

Reported value (Arain, N=121)

Comment

Number of loci (Identifiler plus)

15

Standard forensic panel. PubMed

Combined power of discrimination (CPD)

≈ 0.9999999999999999925

Extremely high for identity testing. PubMed

Combined power of exclusion (CPE)

≈ 0.99999815

High power for paternity exclusion. PubMed

Random match probability (RMP)

7.4897 × 10⁻¹⁸

Very low, consistent with highly discriminatory loci. PubMed

References

Note: citations below reference the primary studies discussed in this review.

  1. Bhatti S, Abbas S, Aslamkhan M, Attimonelli M, Segundo Trinidad M, Aydin HH, de Souza EM, Gonzalez GR. Genetic perspective of uniparental mitochondrial DNA landscape on the Punjabi population, Pakistan. Mitochondrial DNA A DNA Mapp Seq Anal. 2018 Jul;29(5):714–726. doi:10.1080/24701394.2017.1350951. PubMed
  2. Javed F, Shafique M, Rani N, Rubab A, Shahid AA. Allele frequency data of 15 autosomal STRs in Arain population of Pakistan. Int J Legal Med. 2022 Mar;136(2):557–558. doi:10.1007/s00414-021-02639-3. PubMed
  3. Qamar R, et al. Y-Chromosomal DNA Variation in Pakistan. [Review, PubMed Central]. (See national surveys and reviews summarizing Y-haplogroup distributions across Pakistani populations). PMC
  4. Auton A., Abecasis G., Altshuler D., et al. A global reference for human genetic variation. 1000 Genomes Project (Phase 3) — includes Punjabi samples (PJL). Studies of South Asian genomes using PJL data elucidate broad patterns of ancestry in Punjabis. BioMed Central+1
  5. Nakatsuka N., Moorjani P., Rai N., et al. The promise of discovering population-specific disease-associated variants in South Asia (Reich lab summary & analyses). Nature Genetics Supplement / Reich lab reports (2017). (Discussions of founder events and structure across South Asia relevant for interpreting Punjabi genetic diversity). reich.hms.harvard.edu
  6. Additional regional Y-STR and autosomal STR studies (e.g., Punjabi Y-STR diversity, Y-filer studies, and other forensic datasets) provide supporting evidence for high diversity and typical Punjabi patterns; see BMC Genomics 2022, ResearchGate and related articles. BioMed Central+1

Acknowledgements & data availability

This is a literature synthesis using published peer-reviewed studies. Raw genotype datasets referenced are controlled by the original studies and public repositories (e.g., 1000 Genomes). The author encourages investigators and community leaders interested in resolving fine-scale Arain ancestry to collaborate on ethically designed sampling with community consent, genomic data sharing agreements, and integration with historical scholarship.

Author note on sensitivity and community implications

Genetic studies of ethnolinguistic and caste groups can intersect with identity, politics, and social sensitivities. The genetic patterns described here are population-level summaries and do not determine individual identity, social standing, or cultural heritage. Genetic evidence should be integrated respectfully with historical, linguistic and oral histories. Any new genetic research must adhere to ethical best practices, including informed consent, data privacy, community engagement, and culturally sensitive communication of results.

 

Files

Files (25.8 kB)

Name Size Download all
md5:53d0f4b53cb4dcefa7af480064b1b76d
25.8 kB Download

Additional details

References

  • Bhatti S, Abbas S, Aslamkhan M, Attimonelli M, Segundo Trinidad M, Aydin HH, de Souza EM, Gonzalez GR. Genetic perspective of uniparental mitochondrial DNA landscape on the Punjabi population, Pakistan. Mitochondrial DNA A DNA Mapp Seq Anal. 2018 Jul;29(5):714–726. doi:10.1080/24701394.2017.1350951. PubMed Javed F, Shafique M, Rani N, Rubab A, Shahid AA. Allele frequency data of 15 autosomal STRs in Arain population of Pakistan. Int J Legal Med. 2022 Mar;136(2):557–558. doi:10.1007/s00414-021-02639-3. PubMed Qamar R, et al. Y-Chromosomal DNA Variation in Pakistan. [Review, PubMed Central]. (See national surveys and reviews summarizing Y-haplogroup distributions across Pakistani populations). PMC Auton A., Abecasis G., Altshuler D., et al. A global reference for human genetic variation. 1000 Genomes Project (Phase 3) — includes Punjabi samples (PJL). Studies of South Asian genomes using PJL data elucidate broad patterns of ancestry in Punjabis. BioMed Central+1 Nakatsuka N., Moorjani P., Rai N., et al. The promise of discovering population-specific disease-associated variants in South Asia (Reich lab summary & analyses). Nature Genetics Supplement / Reich lab reports (2017). (Discussions of founder events and structure across South Asia relevant for interpreting Punjabi genetic diversity). reich.hms.harvard.edu Additional regional Y-STR and autosomal STR studies (e.g., Punjabi Y-STR diversity, Y-filer studies, and other forensic datasets) provide supporting evidence for high diversity and typical Punjabi patterns; see BMC Genomics 2022, ResearchGate and related articles. BioMed Central+1