Published August 6, 2020 | Version v1
Dataset Open

Data from: Chromosome-scale inference of hybrid speciation and admixture with convolutional neural networks

  • 1. University of Arizona

Description

Inferring the frequency and mode of hybridization among closely related organisms is an important step for understanding the process of speciation and can help to uncover reticulated patterns of phylogeny more generally. Phylogenomic methods to test for the presence of hybridization come in many varieties and typically operate by leveraging expected patterns of genealogical discordance in the absence of hybridization. An important assumption made by these tests is that the data (genes or SNPs) are independent given the species tree. However, when the data are closely linked, it is especially important to consider their non-independence. Recently, deep learning techniques such as convolutional neural networks (CNNs) have been used to perform population genetic inferences with linked SNPs coded as binary images. Here we use CNNs for selecting among candidate hybridization scenarios using the tree topology (((P1,P2),P3),Out) and a matrix of pairwise nucleotide divergence (dXY) calculated in windows across the genome. Using coalescent simulations to train and independently test a neural network showed that our method, HyDe-CNN, was able to accurately perform model selection for hybridization scenarios across a wide-breath of parameter space. We then used HyDe-CNN to test models of admixture in Heliconius butterflies, as well as comparing it to a random forest classifier trained on introgression-based statistics. Given the flexibility of our approach, the dropping cost of long-read sequencing, and the continued improvement of CNN architectures, we anticipate that inferences of hybridization using deep learning methods like ours will help researchers to better understand patterns of admixture in their study organisms.

Notes

CSV Files for HyDe-CNN Tests

[hyde-cnn_tests.tar.gz] -- This archive contains the the CSV files (there are 12, three for each model) with the results of testing the trained HyDe-CNN architecture using 10,000 additional simulated data sets for each of the four models at each of the branch scaling factors. Each CSV file has the parameters used to simulate each image with msprime, the predicted best model, the best model weight, and the summary statistics calculated for training the random forest classifier.

Random Forest Classifier Results

[RF_classifier_results.txt] -- Raw output of the random forest classifier trained on introgression-specific summary statistics.

Trained Models for the HyDe-CNN Architecture

[trained_models_hyde-cnn.tar.gz] -- This archive contains the trained models for all of the neural networks f the HyDe-orCNN architecture.

Trained Models for the Flagel et al. Architecture

[trained_models_flagel.tar.gz] -- This archive contains the trained models for all of the neural networks for the Flagel et al. architecture.

hyde_cnn_*_data_*.npz

Nine compressed numpy arrays with the input images split into training, validation, and testing sets. Each file has the data for all combinations of input type (min, mean, min+mean) and branch scaling in coalescent units (0.5, 1.0, 2.0).

HyDe-CNN Code Archive

[hyde-cnn_code_archive.tar.gz] -- Archived versions of all Python and R scripts used to generate, process, and analyze data in the paper. All of these scripts are also available on GitHub.

Heliconius Chromosome Five VCF and Recombination Map

[heliconius_data.tar.gz] -- VCF file containing variants on chromosome five for Heliconius samples as well as the recombination map for simulating chromosome five.

Trained Models for Heliconius

[trained_models_heliconius.tar.gz] -- This archive contains the trained models for all of the neural networks f the HyDe-orCNN architecture.

Heliconius Resampling Results

[heliconius_res.tar.gz] -- CSV files with the predicted model weights for all 100 bootstrap replicates for the three different input types (min, mean, min+mean). 

heliconius_*_data.npz

Compressed arrays with the input images split into training, validation, and testing sets for the Heliconius example. Each file has the data for the different input types (min, mean, min+mean).

Heliconius Code Archive

[heliconius_code_archive.tar.gz] -- Contains the code for simulating data to train, validate, and test a CNN, as well as a Jupyter Notebook that was used to process the observed data from chromosome five. This code is also on GitHub.

Funding provided by: National Science Foundation
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000001
Award Number: IOS-1811784

Funding provided by: National Institutes of Health
Crossref Funder Registry ID: http://dx.doi.org/10.13039/100000002
Award Number: R01GM127348

Files

RF_classifier_results.txt

Files (12.6 GB)

Name Size Download all
md5:fb7978f89d3baaa414e24de5da61407e
50.7 kB Download
md5:f41eaecf4b28fa2cd8c16f10005d7dff
18.8 MB Download
md5:b4f34ba9e8545760bf138d2ead8aba08
1.9 GB Download
md5:5a2563ae73b13169f074c57798462b89
2.7 GB Download
md5:69a44ec3d0bda90b393d2c037211289d
662.7 MB Download
md5:aae986c46ef5b25accc75901741a14d2
9.9 kB Download
md5:2d78ee2de10c4bccd121b060a3c49449
11.5 kB Download
md5:0cb96fbefb5cc18868ba2a31997c1fcf
12.8 MB Download
md5:02e16b46556820a869f90533e5d9d4b7
791.5 MB Download
md5:87eb347ad889c31b0d7137f992ca5beb
815.7 MB Download
md5:0950c68b076493d249d92962a39d65bf
856.5 MB Download
md5:0e4f9337ac51c42f5e0f61050b77f571
1.1 GB Download
md5:182c69adc421ecf74335c9efbbc4a97e
1.2 GB Download
md5:fa7568dfb5bb08c8a72ae96acc009301
1.3 GB Download
md5:eb2b39c9d77e91b34e34b0dcdefe64c4
300.6 MB Download
md5:05d018a391f1d9b54847b323d57abf88
376.3 MB Download
md5:2738727e671e2def30e0c2a1b055f80d
452.7 MB Download
md5:774831f482183436ec3537d9487f1e8d
10.7 kB Preview Download
md5:537dc9a75b8a232e5118513aea583fcd
17.3 MB Download
md5:42e94cd3c5be7e45961f42cf2b7d04e3
11.6 MB Download
md5:2984c3b918bf0d24f55a43121ba3aa29
36.2 MB Download