Data for "Learning the language of viral evolution and escape"

Published September 14, 2020 | Version v1

Dataset Open

Training data from:

Influenza A HA protein sequences from the NIAID Influenza Research Database (IRD) (http://www.fludb.org)
HIV-1 Env protein sequences from the Los Alamos National Laboratory (LANL) HIV database (https://www.hiv.lanl.gov)
Coronavidae spike protein sequences from the Virus Pathogen Resource (ViPR) database (https://www.viprbrc.org/brc/home.spg?decorator=corona)
SARS-CoV-2 Spike protein sequences from NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/)
SARS-CoV-2 Spike and other Betacoronavirus spike protein sequences from GISAID (https://www.gisaid.org/)

Datasets for fitness and escape validation:

Fitness single-residue DMS of HA H1 WSN33 from Doud and Bloom (2016)
Fitness combinatorial DMS of antigenic site B in six HA H3 strains from Wu et al. (2020)
Fitness single-residue DMS of Env BF520 and BG505 from Haddox et al. (2018)
ACE2 binding affinity combinatorial DMS of Spike from Starr et al. (2020)
Escape single-residue DMS of HA H1 WSN33 from Doud et al. (2018)
Escape single-residue DMS of HA H3 Perth09 from Lee et al. (2019)
Escape single-residue DMS of Env BG505 from Dingens et al. (2019)
Escape mutations of Spike from Baum et al. (2020)
Escape single-residue DMS of Spike from Greaney et al. (2020)

Files

Name	Size	Download all
data.tar.gz md5:c11f2718094e36b06f1e400e2dfff946	93.3 MB	Download