Published March 12, 2025 | Version v1
Other Open

Data from: Jointly representing long-range genetic similarity and spatially heterogeneous isolation-by-distance

  • 1. University of Chicago
  • 2. University of Bologna

Description

Isolation-by-distance patterns in genetic variation are a widespread feature of the geographic structure of genetic variation in many species, and many methods have been developed to illuminate such patterns in genetic data. However, long-range genetic similarities also exist, often as a result of rare or episodic long-range gene flow.  Jointly characterizing patterns of isolation-by-distance and long-range genetic similarity in genetic data is an open data analysis challenge that, if resolved, could help produce more complete representations of the geographic structure of genetic data in any given species.  Here, we present a computationally tractable method that identifies long-range genetic similarities in a background of spatially heterogeneous isolation-by-distance variation. The method uses a coalescent-based framework, and models long-range genetic similarity in terms of directional events with source fractions describing the fraction of ancestry at a location tracing back to a remote source. The method produces geographic maps annotated with inferred long-range edges, as well as maps of uncertainty in the geographic location of each source of long-range gene flow.  We have implemented the method in a package called FEEMSmix (an extension to FEEMS from Marcus et al 2021), and validated its implementation using simulations representative of typical data applications.  
We also apply this method to two empirical data sets. In a data set of over 4,000 humans (Homo sapiens) across Afro-Eurasia, we recover many known signals of long-distance dispersal from recent centuries. Similarly, in a data set of over 100 gray wolves (Canis lupus) across North America, we identify several previously unknown long-range connections, some of which were attributable to recording errors in sampling locations. Therefore, beyond identifying genuine long-range dispersals, our approach also serves as a useful tool for quality control in spatial genetic studies.

Methods

  • The wolf data set (wolvesadmix_corrected) consists of 108 individuals and 17,729 SNPs. For this study, we correct the locations of two individuals based on an analysis of the sample meta data and remove three individuals with ambiguous locations from the original data set of 111 wolves compiled in Schweizer et al 2016 (data available here:https://doi.org/10.5061/dryad.p8cz8wb18). 
  • The human data set (c1global1nfd_public) consists of 4,070 individuals and 19,954 SNPs. For this study, we subset to individuals with public sharing permissions from the larger data set of 4,697 individuals in Peter et al 2020. (data available on Zenodo as 'Supplemental information'). 

Files

papers.txt

Files (21.6 MB)

Name Size Download all
md5:2597b8fb49629df848721ceabc084fc7
20.3 MB Download
md5:717f727032ef3d8b386d9a39e9481503
585.2 kB Download
md5:cc4f3848959ed8d2a535a8cae4ef26c3
204.1 kB Download
md5:076a3d0185bf10fe00d9c43e2985faf0
144.7 kB Download
md5:23aa1fe2f4086812215f0e58ba574578
290.9 kB Download
md5:53053ceb4b81339982ef7dae0da8d9d1
65.5 kB Download
md5:4af88798c95a43de092f29b3de097132
5.8 kB Preview Download

Additional details

Related works

Is cited by
10.1101/2025.02.10.637386 (DOI)
Is derived from
10.5061/dryad.p8cz8wb18 (DOI)