Published July 18, 2021 | Version v1
Dataset Open

Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non-model tree species

  • 1. University of British Columbia
  • 2. University of Calgary

Description

Despite their suitability for studying evolution, many conifer species have large and repetitive giga-genomes (16-31Gbp) that create hurdles to producing high coverage SNP datasets that capture diversity from across the entirety of the genome. Due in part to multiple ancient whole genome duplication events, gene family expansion and subsequent evolution within Pinaceae, false diversity from the misalignment of paralog copies creates further challenges in accurately and reproducibly inferring evolutionary history from sequence data. Here, we leverage the cost-saving benefits of pool-seq and exome-capture to discover SNPs in two conifer species, Douglas-fir (Pseudotsuga menziesii var. menziesii (Mirb.) Franco, Pinaceae) and jack pine (Pinus banksiana Lamb., Pinaceae). We show, using minimal baseline filtering, that allele frequencies estimated from pooled individuals show a strong positive correlation with those estimated by sequencing the same population as individuals (r > 0.948), on par with such comparisons made in model organisms. Further, we highlight the utility of haploid megagametophyte tissue for identifying sites that are likely due to misaligned paralogs. Together with additional minor filtering, we show that it is possible to remove many of the loci with large frequency estimate discrepancies between individual and pooled sequencing approaches, improving the correlation further (r > 0.973). Our work addresses bioinformatic challenges in non-model organisms with large and complex genomes, highlights the use of megagametophyte tissue for the identification of paralog sites, and suggests the combination of pool-seq and exome capture to be robust for further evolutionary hypothesis testing in these systems.

Notes

All code to analyze these files is also attached.

Each file of code is saved as jupyter notebook format (.ipynb) and as .html. HTML can be used to view the notebook without launching a jupyter kernel.

Files

haploid_pipeline_datatable.txt

Files (2.3 GB)

Name Size Download all
md5:83ae3d9e96f225cafe80156f287562d0
163.4 MB Download
md5:20b6b80f29521284543aa179cf4664b2
7.8 MB Download
md5:cac5440f363a93988a02610d3380af89
42.2 MB Download
md5:ac6786da6860ee4897cbcbd3a092d30b
12.2 MB Download
md5:e7e0c9e2eba8999ca9bd353d5d1a29d2
1.6 kB Preview Download
md5:a9d5486daae287437d27afbf7c72243a
39.6 MB Download
md5:2050b96383c48e4b74ce67269bd831dd
2.0 GB Download
md5:ee90cf18a7925d478a23492f8636bf49
1.0 MB Download
md5:e6f382b3fe66566421223c5cf1564449
25.7 MB Download
md5:029bc3cf5e9d69a06e25f0f17d2c0a92
14.0 kB Preview Download
md5:4526296716b11a94fe80ff9d3d409974
1.8 kB Preview Download
md5:a5900566a6638037a1724c12973dbab1
1.1 kB Preview Download
md5:1068034ed4a0e13611c9ae93899ee4a5
1.2 kB Preview Download
md5:b7a04929328834693e5ecbb69853c881
12.4 kB Preview Download
md5:936d261ebcb646017971c6457abc17ca
20.6 kB Preview Download

Additional details

Related works

Is cited by
10.1101/2020.10.07.329961 (DOI)