Published 2024 | Version v2
Dataset Open

Underlying data for "Mapping glycoprotein structure reveals Flaviviridae evolutionary history"

Description

This repository houses the underlying data for "Mapping glycoprotein structure reveals Flaviviridae evolutionary history", authored by Jonathon C.O. Mifsud, Spyros Lytras, Michael R. Oliver, Kamilla Toon, Vincenzo A. Costa, Edward C. Holmes, and Joe Grove.

The dataset is organised into several directories:

- flaviviridae_foldseek_output: Contains the Foldseek output and parsing scripts to extract the lowest e-value hit for each taxa and reference

- flaviviridae_structure_blocks: Contains the Flaviviridae structures generated by ColabFold and ESMFold. Structures are organised by taxa and numbered based on their block number. Polyprotein sequences were broken into 300 residue blocks, each overlapping by 100 residues. Numbering starts at Block_0 (residues 1-300) and continue sequentially (e.g. Block_1 = residues 100-400, Block_2 = residues 200-500, ...). This dataset constitutes the Flaviviridae protein foldome referred to in the main text.

- foldseek_reference_structures: Contains all structures used as references in FoldSeek analysis, including the Bole Tick Virus 4 proteins described in figure 3.

- glycoprotein_structural_alignments_and_trees: Contains all files to replicate the trees for the E, E1 and E2 glycoproteins. The underlying code can be found in structural_alignments_code.ipynb This directory contains complete glycoprotein structure predictions (refolded_fullglyco).

- ns5b_alignments_and_trees: Contains all alignment files, both trimmed and untrimmed, for NS5b RdRp. These include variations of alignments using different parameters, methods and those used in the stratified MUSCLE analysis. Also includes related scripts.

- sequence_benchmarks: Contains the files and scripts underlying the sequence benchmark analysis

- sequences: Holds sequence files including full genome sequences of Flaviviridae in .fasta formats, novel sequences identified in our study, and protein sequences extracted for alignment purposes. It also contains the script for creating the sequence blocks used in main analyses.

- stratified_MUSCLE_analysis: Contains the files and scripts to replicate the stratified MUSCLE analysis. Underlying tree files are located in ns5b_alignments_and_trees 

- t2rnase_alignments_and_trees: Contains all alignment and tree files, both trimmed and untrimmed, for t2rnase.

- tables: Provides metadata tables, including interpro domain annotations, RNase T2 analyses summaries, phylogenetic model finder for the glycoprotein structural phylogenetics, and novel viruses identified through data mining.

- workflows: PDF flowchart diagrams illustrating the workflows behind the main pieces of analysis performed in our study. To orientate readers the diagrams refer to underlying data and scripts (as included in this repository), and resultant figure panels in the paper.

Note, structure and sequence names are prefixed by a four letter code denoting the sub-clade/classification of the taxa.

Flavi-Jingmen Clade
FJTB = Flavi-Jingmen Tick-Borne
FJMB = Flavi-Jingmen Mosquito-Borne
FJNV = Flavi-Jingmen No Known Vector
FJIS = Flavi-Jingmen Insect Only
FJAF = Flavi-Jingmen Aquatic Flavivirus
FJFL = Flavi-Jingmen Flavi-Like
FJJI = Flavi-Jingmen Jingmenvirus
FJUN = Flavi-Jingmen Unclassified

Pesti-LGF Clade
PLLG = Pesti-LGF Large Genome Flavivirus
PLPV = Pesti-LGF Pestivirus
PLUN = Pesti-LGF Unclassified

Hepaci-Pegi Clade
HPPV = Hepaci-Pegi Pegivirus
HPHV = Hepaci-Pegi Hepacivirus
HPUN = Hepaci-Pegi Unclassified

TOMB = Tombusvirus out group

For any queries, please contact the corresponding author at joe.grove@glasgow.ac.uk

Files

Mifsud_et_al_Underlying_Data.zip

Files (1.8 GB)

Name Size Download all
md5:5e7d4f1553ad6cad344e94adf9485a22
1.8 GB Preview Download

Additional details

Funding

Wellcome Trust
Studying hepatitis C virus to determine how viruses harness structural disorder to control entry and antibody resistance 107653/Z/15/A
National Health and Medical Research Council
Investigator Award GNT2017197