There is a newer version of the record available.

Published May 16, 2023 | Version 1.0.0
Dataset Open

Data and code for 'Pseudogenes as a neutral reference for detecting selection in prokaryotic pangenomes'

  • 1. McGill University

Description

This repository contains the code and files for reproducing the analyses and results reported in 'Pseudogenes as a neutral reference for detecting selection in prokaryotic pangenomes' by Gavin M. Douglas, W. Ford Doolittle, and B. Jesse Shapiro.

File organization and descriptions:

  • code/ - Contains GitHub repository releases of code used in manuscript (the other folders contain datafiles only). This code is provided here as well as on GitHub to ensure long-term access.
    • handy_pop_gen-1.1.0/ - release v1.1.0 of the convenience repository (used for specific data processing and analysis steps referred to in the manuscript).
    • pangenome_pseudogene_null-1.0.0/ - Main code repository for manuscript.

 

  • broad_pangenome_analysis/
    • element_info/element_counts.tsv.gz - Counts of (filtered) pseudogenes and intact genes called per genome accession.
    • element_info/gene_sizes.tsv.gz - Gene sizes in base-pairs.
    • element_info/pseudogene_sizes.tsv.gz - Filtered pseudogene sizes in base-pairs.
    • element_info/element_percent_coverage/*tsv.gz - Tables containing the percent genome coverage of genes and pseudogenes, by accession and averaged over accessions per species separately.
    • example_Mycoplasmopsis_bovis_panaroo_output.csv.gz - Panaroo output table for Mycoplasmopsis bovis, which was used for an example. Corresponds to the gene_presence_absence.csv file in the raw Panaroo output.
    • focal_and_non.focal_full_to_short.tsv.gz - Mapfile of full to short (and unique) species ids used in analysis. Primarily to include species ids in cluster names without making them unnecessarily long.
    • genome_info/accessions.tsv.gz - Genome accessions used for broad pangenome analysis (note that not all genome accessions could be downloaded [and were ignored], which is indicated in the "could_download" column).
    • genome_info/genome_sizes.tsv.gz - Sizes of all genomes used for the broad pangenome analysis.
    • model_output/pangenome_linear_models.rds - R Data Serialization files containing the output of R linear model objects (generated by lm and provided as an R list object). There are separate elements in the list for the mean number of genes, genomic fluidity, percentage singletons (si), and si/sp.
    • model_output/linear_model_coef.tsv.gz - Coefficient summary table for all linear models.
    • pangenome_and_related_metrics.tsv.gz - Metrics used for broad pangenome analysis across 670 prokaryotic species. Note that this table was filtered down to 668 species after excluding those with < 9 genomes.
    • pangenome_and_related_metrics_filt.tsv.gz - Filtered table, as described above.
    • taxonomy.tsv.gz - Taxonomy for all species used for this analysis, taken from GTDB. Row names are species names.

 

  • indepth_10_species_analysis/
    • cluster_breakdown_tables/ - Folder containing tables providing breakdown of how clusters are distributed by element type, pangenome partition, and species. Provided for easy plotting.
    • cluster_member_breakdown.tsv.gz - Table providing information on each element (called pseudogenes and intact genes) and provides information such as what cluster they are part of, what species and genome accession they are found in, etc.
    • cluster_types.rds - R Data Serialization file containing R list providing breakdown of all clusters into categories (intact/pseudogene/mixed, where mixed means containing both pseudogene and intact elements).
    • COG_enrichment_results/ultra.cloud-COG-gene-enrichments.tsv.gz - Output file with enrichment test summaries for COG IDs in significant COG categories, which was run for the ultra-cloud pangenome partition model only.
    • element_glmm_input.tsv.gz - Table containing all information used for fitting generalized linear mixed models.
    • focal_species.txt - Names of species used for the in-depth analysis.
    • genome_info/ - Folder containing the genome accessions (and the corresponding genome sizes) for all ten analyzed species.
    • glmm_output/ - Folder containing R Data Serialization files containing output R objects after fitting generalized linear mixed models (only ultra-rare files are present, due to file size constraints).
    • per_genome_element.type_percent_coverages.rds - R Data Serialization file containing R list providing the percent coverage by intact genes vs pseudogenes per accession (nested by species)

Files

Files (6.6 GB)

Name Size Download all
md5:10aaa3ba14d14ca1a6b3288877eb5f64
226.8 MB Download
md5:566cde1693c884b6da5a0c931e288ab0
87.1 kB Download
md5:002df8299c0c5dacd339de11728ea8e6
6.3 GB Download