Underlying data for "Mapping glycoprotein structure reveals Flaviviridae evolutionary history"
Description
This repository houses the underlying data for "Mapping glycoprotein structure reveals Flaviviridae evolutionary history", authored by Jonathon C.O. Mifsud, Spyros Lytras, Michael R. Oliver, Kamilla Toon, Vincenzo A. Costa, Edward C. Holmes, and Joe Grove.
The dataset is organised into several directories:
- flaviviridae_foldseek_output: Contains the Foldseek output and parsing scripts to extract the lowest e-value hit for each taxa and reference
- flaviviridae_structure_blocks: Contains the Flaviviridae structures generated by ColabFold and ESMFold. Structures are organised by taxa and numbered based on their block number. Polyprotein sequences were broken into 300 residue blocks, each overlapping by 100 residues. Numbering starts at Block_0 (residues 1-300) and continue sequentially (e.g. Block_1 = residues 100-400, Block_2 = residues 200-500, ...). This dataset constitutes the Flaviviridae protein foldome referred to in the main text.
- foldseek_reference_structures: Contains all structures used as references in FoldSeek analysis, including the Bole Tick Virus 4 proteins described in figure 3.
- glycoprotein_structural_alignments_and_trees: Contains all files to replicate the trees for the E, E1 and E2 glycoproteins. The underlying code can be found in structural_alignments_code.ipynb This directory contains complete glycoprotein structure predictions (refolded_fullglyco).
- ns5b_alignments_and_trees: Contains all alignment files, both trimmed and untrimmed, for NS5b RdRp. These include variations of alignments using different parameters, methods and those used in the stratified MUSCLE analysis. Also includes related scripts.
- sequence_benchmarks: Contains the files and scripts underlying the sequence benchmark analysis
- sequences: Holds sequence files including full genome sequences of Flaviviridae in .fasta formats, novel sequences identified in our study, and protein sequences extracted for alignment purposes. It also contains the script for creating the sequence blocks used in main analyses.
- stratified_MUSCLE_analysis: Contains the files and scripts to replicate the stratified MUSCLE analysis. Underlying tree files are located in ns5b_alignments_and_trees
- t2rnase_alignments_and_trees: Contains all alignment and tree files, both trimmed and untrimmed, for t2rnase.
- tables: Provides metadata tables, including interpro domain annotations, RNase T2 analyses summaries, phylogenetic model finder for the glycoprotein structural phylogenetics, and novel viruses identified through data mining.
- workflows: PDF flowchart diagrams illustrating the workflows behind the main pieces of analysis performed in our study. To orientate readers the diagrams refer to underlying data and scripts (as included in this repository), and resultant figure panels in the paper.
Note, structure and sequence names are prefixed by a four letter code denoting the sub-clade/classification of the taxa.
Flavi-Jingmen Clade
FJTB = Flavi-Jingmen Tick-Borne
FJMB = Flavi-Jingmen Mosquito-Borne
FJNV = Flavi-Jingmen No Known Vector
FJIS = Flavi-Jingmen Insect Only
FJAF = Flavi-Jingmen Aquatic Flavivirus
FJFL = Flavi-Jingmen Flavi-Like
FJJI = Flavi-Jingmen Jingmenvirus
FJUN = Flavi-Jingmen Unclassified
Pesti-LGF Clade
PLLG = Pesti-LGF Large Genome Flavivirus
PLPV = Pesti-LGF Pestivirus
PLUN = Pesti-LGF Unclassified
Hepaci-Pegi Clade
HPPV = Hepaci-Pegi Pegivirus
HPHV = Hepaci-Pegi Hepacivirus
HPUN = Hepaci-Pegi Unclassified
TOMB = Tombusvirus out group
For any queries, please contact the corresponding author at joe.grove@glasgow.ac.uk
Files
Mifsud_et_al_Underlying_Data.zip
Files
(1.8 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:5e7d4f1553ad6cad344e94adf9485a22
|
1.8 GB | Preview Download |
Additional details
Funding
- Wellcome Trust
- Studying hepatitis C virus to determine how viruses harness structural disorder to control entry and antibody resistance 107653/Z/15/A
- National Health and Medical Research Council
- Investigator Award GNT2017197