Analysis of paralogs in target enrichment data pinpoints multiple ancient polyploidy events in Alchemilla s.l. (Rosaceae)

Morales-Briones, Diego F.; Gehrke, Berit; Huang, Chien-Hsun; Liston, Aaron; Ma, Hong; Marx, Hannah; Tank, David; Yang, Ya

doi:10.5061/dryad.cc2fqz660

Published May 17, 2021 | Version v1

Dataset Open

Analysis of paralogs in target enrichment data pinpoints multiple ancient polyploidy events in Alchemilla s.l. (Rosaceae)

1. University of Minnesota
2. University of Bergen
3. Fudan University
4. Oregon State University
5. Pennsylvania State University
6. University of New Mexico
7. University of Idaho

Target enrichment is becoming increasingly popular for phylogenomic studies. Although baits for enrichment are typically designed to target single-copy genes, paralogs are often recovered with increased sequencing depth, sometimes from a significant proportion of loci, especially in groups experiencing whole-genome duplication (WGD) events. Common approaches for processing paralogs in target enrichment data sets include random selection, manual pruning, and mainly, the removal of entire genes that show any evidence of paralogy. These approaches are prone to errors in orthology inference or removing large numbers of genes. By removing entire genes, valuable information that could be used to detect and place WGD events is discarded. Here we used an automated approach for orthology inference in a target enrichment data set of 68 species of Alchemilla s.l. (Rosaceae), a widely distributed clade of plants primarily from temperate climate regions. Previous molecular phylogenetic studies and chromosome numbers both suggested ancient WGDs in the group. However, both the phylogenetic location and putative parental lineages of these WGD events remain unknown. By taking paralogs into consideration and inferring orthologs from target enrichment data, we identified four nodes in the backbone of Alchemilla s.l. with an elevated proportion of gene duplication. Furthermore, using a gene-tree reconciliation approach we established the autopolyploid origin of the entire Alchemilla s.l. and the nested allopolyploid origin of four major clades within the group. Here we showed the utility of automated tree-based orthology inference methods, previously designed for genomic or transcriptomic data sets, to study complex scenarios of polyploidy and reticulate evolution from target enrichment data sets.

Notes

This package contains the supplemental figures and tables referenced in the main text - Supplemental_material.zip

Supplemental_material.zip

Supplemental figures S1-S12
Supplemental tables S1-S6

It also contains the data and software outputs (i.e. fasta filtes, alignments, trees, logs, etc) - Analyses_data.tar.gz

Analyses_data.tar.gz

1_original_fasta_files: Unaligned fasta files (*.fa)
2_raw_homologs: raw homolog trees. Cleaned alignments with Phyx (*.aln-cln) and RAxML bipartition labeled trees (*.tre)
3_final_homologs: filtered homologs (monophyletic clades and paraphyletic grades of same species masked, spurious tips removed with TreeShrink)
4_orthologs
- 1_MO (monophyletic outgroups):
  - 1_fasta_to_trees: fasta files, alignments, ortholog trees
  - 2_analyses: ASTRAL, WGD mapping, Phyparts, QuartetSampling, RAxML
- 2_RT (rooted ingroups):
  - 1_fasta_to_trees: fasta files, alignments, ortholog trees
  - 2_analyses: ASTRAL, WGD mapping, Phyparts, QuartetSampling, RAxML
5_analyses_with_homologs
- 1_ASTRAL-pro: : multilabeled homologs tres, ASTRAL-pro, Phyparts
- 2_grampa: GRAMPA analyses with complete and reduced trees
- 3_Phyparts_Informative_trees: MO and RT with all homologs and only with longest homologs
6_cpDNA: fasta files, alignment, QuartetSampling
7_Ks_plots: CDS and PEP files, Ks values (within-species and between-species)
8_other_files: HybPiper references, gene list, list of longest exons, gene annotations

If you have any question, please do not hesitate to contact me dfmoralesb@gmail.com

Diego F. Morales-Briones

Files

README.txt

Files (341.0 MB)

Name	Size	Download all
Analyses_data.tar.gz md5:4c82ac069378e2fe1a555401fd9c878d	339.0 MB	Download
README.txt md5:60209b03d3cafaed1db22aae338418c2	1.8 kB	Preview Download
Supplemental_material.zip md5:5a855b7b0d60d82489c2d2cf1d967402	1.9 MB	Preview Download

Additional details

Is cited by: 10.1093/sysbio/syab032 (DOI)

	All versions	This version
Views	203	203
Downloads	144	144
Data volume	14.7 GB	14.7 GB

Analysis of paralogs in target enrichment data pinpoints multiple ancient polyploidy events in Alchemilla s.l. (Rosaceae)

Authors/Creators

Description

Notes

Files

README.txt

Files (341.0 MB)

Additional details

Related works