January 2021
RL Warren
rwarren at bcgsc dot ca


Data supporting : Interactive SARS-CoV-2 mutation timemaps

*If you use our data, please cite:
Warren RL and Birol I. Interactive SARS-CoV-2 mutation timemaps [version 1; peer review: awaiting peer review]. F1000Research 2021, 10:68 (https://doi.org/10.12688/f1000research.50857.1)

*MAPS are available at:
https://bcgsc.github.io/SARS2/

Open access repository:
https://doi.org/10.5281/zenodo.4469840

-------------------------------
Content of directory:

1) SARS-CoV-2_gisaid-ntedit-mutation_nonredundantlist.txt
2) SARS-CoV-2_gisaid-ntedit-mutation_count-effect.tsv
3) SARS-CoV-2_gisaid-ntedit-mutation_count-effectRegions.tsv

-------------------------------

(1) text file
reporting on ntedit-derived* variants, preceded with date of file creation
*obtained by running (more details here: https://arxiv.org/abs/2012.15697):
A)nthits -b 36 --outbloom -c 1 -p seq -k 25 -t 24 @reads.in (where reads.in is a file of filename pointing to individual GISAID genomes)
B)ntedit -f wuhan-hu1.fa -r seq_k25.bf -b nteditOUT -t 12 -i 5 -d 5 -m 1 (where wuhan-hu1.fa is the reference SARS-CoV-2, COVID-19 human isolate MN908947.3)

-----
fields:
-----
uniqueGISAIDid;nucleotide variant list (identified by ntedit,each separated by comma)
eg.
hCoV-19/England/ALDP-BC7626/2020|EPI_ISL_721439|2020-11-23;C241T,A1163T,C3037T,T3256C,C5622T,G11083T,G14202T,C14408T,G19542T,C19718T,C22388T,A23403G,G24764T,C26060T,G29227T,C29466T,A29771G,
-----
The uniqueGISAIDid is comprised of
host/region/centerID/year|uniqueGenomeSubmissionID|collectionDate

The variant are extracted from ntedit's vcf. Some complex variations are seen as co-occurring indels in this list, due to how ntedit computes variations that are in close proximity (within word k, see note of caution below). Notable ones are reported as follows:

(1) text file              : (2) tsv file
AGGATG23400A, T23401TAGGTG : GGA23401AGG
T21570TGTTTT, TTTCTT21572T : T21570G, C21575T
AGGGGA28880A, A28881AAACGA : GGG28881AAC
A21550ACTCTA, CTAAAC21552C : AA21550CT
G25563GTAAC, GAGCG25563G   : GAG25563TAA
A23076AGGAA, AACCA23079A   : ACC23076GGA

-------------------------------

(2) and (3) tab-separated variable (TSV)
reporting on mutation count frequency, preceded with date of file creation

-----
fields:
-----
variant	change	genome_sample_count	+%chg	gene	product	init_date	init_id	init_region
-----

variant = the nucleotide variant in the form WAScoordinateIS.  WAS=reference base(s). coordinate is the base coordinate in the reference genome and IS=changed base(s)

change = the predicted amino acid change or silent if the nucleotide change is predicted to have no effect.  The predictions take into consideration ORF1ab ribosome slippage. In case of complexed variations (e.g. indels or 2+ sequential substitutions, the first change is reported)

genome_sample_count = the number of distinct gisaid genomes with the recorded variant (since Jan 1, 2020, using all available GISAID genomes from human hosts submitted to date). Very low counts are not reliable, and provided as-is for transparency. They may reveal real emerging variants, or spurious variant calls and warrant further scrutiny.

+%chg = Percent increase in variant genome count since the "previous=" specified date header*

gene =  gene name

product = protein/gene product

init_date = sample collection date where variant first observed in acquired GISAID records

init_id = GISAID id for the first genome where the variant is first observed (Note: first encountered ID reported only, but there could be multiple genomes [IDs] with the variant on that day)

init_region = jurisdiction associated with initial id (init_id)

-----
*In file(2) the date header has the following info:
-----
date:GISAIDrecordsProcessed:%change previous=date:GISAIDrecordsProcessed:%change
-----
date: the most up-to-date run/file creation date
GISAIDrecordsProcessed: count of all non-redundant GISAID records available (complete high cov genome, excl. low coverage, human host) that were processed for the variant analysis
%change: percent change in number of genome records since last time the pipeline ran (info which is found in previous=XX).
-----

The genome_sample_count in file (3) are further broken down by regions and the date/id where the variant is first observed, reported. Discrepancies may exist between (2) and (3) and are attributable to an additional filter imposed in (3), requesting complete collection dates


-------------------------------
NOTES

GISAID ancillary information is taken "as-is"; We have observed instances where the collection date was likely mistyped and erroneous. We will filter out any records with incomplete dates, as well as those we suspect contain errors.

We caution that variants occurring within k may not be detected. With indel detection mode enabled, variants in close proximity may be reported as indels, which is how ntedit rectifies a discrepancy in polishing mode. We also note that indels of at most 5 bases are reported; Longer stretches of deleted bases may exist in SARS-CoV-2 genomes, but will not be reported. We urge users to investigate multiple approaches for variant detection. 

The data is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
