Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species

Lind, Brandon; Lu, Mengmeng; Obreht Vidakovic, Dragana; Singh, Pooja; Booker, Tom; Aikten, Sally; Yeaman, Sam

doi:10.5061/dryad.k0p2ngf7w

Published July 18, 2021 | Version v1

Dataset Open

Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species

1. University of British Columbia
2. University of Calgary

Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31Gbp) that create hurdles to producing high coverage SNP datasets that capture diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies creates further challenges in accurately and reproducibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong positive correlation with those estimated by sequencing the same population as individuals (r > 0.948), on par with such comparisons made in model organisms. Further, we highlight the utility of haploid megagametophyte tissue for identifying sites that are likely due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > 0.973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralog sites, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.

Notes

All code to analyze these files is also attached.

Each file of code is saved as jupyter notebook format (.ipynb) and as .html. HTML can be used to view the notebook without launching a jupyter kernel.

Files

haploid_pipeline_datatable.txt

Files (2.3 GB)

Name	Size	Download all
DF_i52_filtered_concatenated_snps_max-missing_table_biallelic-only_p000_translated.txt.gz md5:83ae3d9e96f225cafe80156f287562d0	163.4 MB	Download
DF_mega-varscan_all_bedfiles_SNP_paralog_snps.txt.gz md5:20b6b80f29521284543aa179cf4664b2	7.8 MB	Download
DF_p52-varscan_all_bedfiles_SNP.txt.gz md5:cac5440f363a93988a02610d3380af89	42.2 MB	Download
DF_snc_trans.fna.gz md5:ac6786da6860ee4897cbcbd3a092d30b	12.2 MB	Download
haploid_pipeline_datatable.txt md5:e7e0c9e2eba8999ca9bd353d5d1a29d2	1.6 kB	Preview Download
JP_i101_filtered_concatenated_snps_max-missing_table_biallelic-only_translated.txt.gz md5:a9d5486daae287437d27afbf7c72243a	39.6 MB	Download
JP_pooled-varscan_all_bedfiles_SNP_translated.txt.gz md5:2050b96383c48e4b74ce67269bd831dd	2.0 GB	Download
JP_RFmg7-varscan_all_bedfiles_SNP_paralog_snps_translated.txt.gz md5:ee90cf18a7925d478a23492f8636bf49	1.0 MB	Download
JP_transcriptome_for_probe.fasta.gz md5:e6f382b3fe66566421223c5cf1564449	25.7 MB	Download
pooled_individual_pipeline_datatable.txt md5:029bc3cf5e9d69a06e25f0f17d2c0a92	14.0 kB	Preview Download
README.txt md5:4526296716b11a94fe80ff9d3d409974	1.8 kB	Preview Download
rna-seq_biosample.txt md5:a5900566a6638037a1724c12973dbab1	1.1 kB	Preview Download
rna-seq_sra_doc.txt md5:1068034ed4a0e13611c9ae93899ee4a5	1.2 kB	Preview Download
testdata_biosample.txt md5:b7a04929328834693e5ecbb69853c881	12.4 kB	Preview Download
testdata_sra_doc.txt md5:936d261ebcb646017971c6457abc17ca	20.6 kB	Preview Download

Additional details

Is cited by: 10.1101/2020.10.07.329961 (DOI)

	All versions	This version
Views	118	117
Downloads	380	380
Data volume	61.7 GB	61.7 GB

Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species

Authors/Creators

Description

Notes

Files

haploid_pipeline_datatable.txt

Files (2.3 GB)

Additional details

Related works