Whole-Gene Database and Custom Scripts for a genomics paper: Pangolin Genomes Offer Key Insights and Resources for the World's Most Trafficked Wild Mammals (Published)
Creators
- 1. Laboratoire Evolution et Diversité Biologique (EDB)—UPS-CNRS-IRD, Université Toulouse III, 118 route de Narbonne, Bât. 4R1, 31062 Toulouse, France.
- 2. ISEM, Univ Montpellier, CNRS, IRD, Montpellier, France.
- 3. The State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China.
- 4. Mammal Research Institute, Department Zoology & Entomology, University of Pretoria, Pretoria 0002, South Africa.
- 5. Laboratoire de Parasitologie et Ecologie, Université de Yaoundé I, Faculté des Sciences, Cameroon
Description
As part of our publication on the conservation genomics of pangolins (published in Molecular Biology and Evolution: https://doi.org/10.1093/molbev/msad190), we provide a database of orthologous whole-genes ranked by diversity amongst the eight pangolin species and set of custom scripts created and implemented during the processing of the genomic data. More information on when and how these scripts were implemented can be found in the methods section or the pipeline outlines in Figure S1. Raw genomic data and assemblies can be found at NCBI (BioProject: PRJNA795390).
Database S1 MSA diversity (QN2):
An excel file with a list of 3 238 orthologous whole-genes ranked from most diverse to least diverse amongst the eight pangolin species (Pholidota) based on mean pairwise identities (sheet "Whole-genes"). PhyKIT and DnaSP v6 outputs were merged based on gene ID with a custom script (Custom script 3) to obtain a range of diversity statistics per gene. Outlying genes with low levels of mean pairwise identity (or high levels of diversity between species) were removed and stored in a separate sheet within the Database called "Removed-Genes". Please refer to the methodology section of the manuscript for more information on the methods used to obtain this database. Column headings are based on PhyKIT (https://github.com/JLSteenwyk/PhyKIT) and DnaSP v6 (http://www.ub.edu/dnasp/) outputs, more information on these statistics and how they are estimated can be found in their respective manuals.
Custom script 1:
Obtain the average depth of the genome mapping and the percent of genome covered at a certain minimum or maximum depth of your choosing. It requires the use of samtools depth before running this script in order to obtain the depth at each nucleotide position of the mapped genome. Additionally it is in the form of a loop script for multiple samples to be run in parallel. 1a calls the loop using the id.txt of all the samples you're interested in running in 1b. If the loop is not required, you can run 1b buy changing 'ind=$1' to 'ind=YOUR_SAMPLE'.
Custom script 2:
This script extracts gene IDs from annotation files (GFF3) that are only found to be orthologous for mammals (based on OrthoMaM gene IDs), maniplulates them to fit a BED format, removes genes with duplicate annotations, and then compares gene IDs across multiple references in order to obtain a BED file of genes found common to all the references.
Custom script 3:
This provides a list of genes with all relevant information linked to it to indicate diversity and selection pressures extracted from phykit and DNAsp v6 outputs (Figure S1a). The list can then be sorted by whichever diversity or selection index you prefer (our list can be viewed in the Database S1 containing).
Files
Files
(953.1 kB)
Additional details
Funding
- PANGO-GO – Pangolins going extinct (PANGO-GO): Tracing the local-to-global trade of the most trafficked mammals on Earth with evolutionary-based toolkits ANR-17-CE02-0001
- Agence Nationale de la Recherche
- CeMEB – Mediterranean Center for Environment and Biodiversity ANR-10-LABX-0004
- Agence Nationale de la Recherche