|
Data File Descriptions and Methods
- Data file 1 [betacov_matching_IPR042578.fasta]: Representative set of 2,465 betacoronavirus S protein overlapping homologous superfamily sequences retrieved in fasta format on 4 December 2022 from the InterPro repository at https://www.ebi.ac.uk/interpro/entry/InterPro/IPR042578/.
- Data File 2 [betacov_matching_IPR042578_motif.fasta]: With Data File 1 as input, extracted 98,122 furin cleavage site (FCS) output motifs of 20 amino acids length, including overlapping and redundant sequences, produced with the FindFur algorithm with preset parameters as described by (Gu, 2020). FindFur as used was deposited on 15 December 2020 at the GitHub software repository at https://github.com/chwisteeng/FindFur.
- Data File 3 [table_s1s2_hits_betacov_polyf.pdf]: Compiled summary table of sequence hits (PDF) of spike S1/S2 domains across genus Betacoronavirus. The compiled table of hits removed from Data File 2 sequences corresponding to spike protein fragments (incomplete length spike proteins as deposited at GenBank) and duplicates (redundant parts identically overlapping within the 20 amino acids motif windows), and then selected one sequence representative for multiple but identical sequences. Collection dates and geographical locations were retrieved from the NCBI Genbank protein database at https://www.ncbi.nlm.nih.gov/protein/. For SARS-CoV-2 spike variants, these data were also cross-validated with the SARS-CoV-2 lineage mutation tracker (Gangavarapu, 2023) available at https://outbreak.info which was based on extensive sequencing data from the global GISAID initiative (https://gisaid.org/). Solid lines (-) depict pat7 NLS, asterisks (*) O-glycosites, and circumflex (^) symbols FCS.
-
Data File 4 [table_s1s2_hits_betacov_polyf.xlsx]: Compiled summary table of sequence hits (MS Excel) of spike S1/S2 domains across genus Betacoronavirus. The compiled table of hits removed from Data File 2 sequences corresponding to spike protein fragments (incomplete length spike proteins as deposited at GenBank) and duplicates (redundant parts identically overlapping within the 20 amino acids motif windows), and then selected one sequence representative for multiple but identical sequences. Collection dates and geographical locations were retrieved from the NCBI Genbank protein database at https://www.ncbi.nlm.nih.gov/protein/. For SARS-CoV-2 spike variants, these data were also cross-validated with the SARS-CoV-2 lineage mutation tracker (Gangavarapu, 2023) available at https://outbreak.info which was based on extensive sequencing data from the global GISAID initiative (https://gisaid.org/). Solid lines (-) depict pat7 NLS, asterisks (*) O-glycosites, and circumflex (^) symbols FCS.
-
Data File 5 [betacov_s1s2_nls_pat7_furin_psort.txt]: Nuclear localization signal (NLS) detection output for 5 representative betacoronavirus spike sequence domains, including the positive hits for pat7 in SARS-CoV-2 and for MERS-MA30 CoV. NLS predictions used the PSORT algorithm available as a webservice at https://wolfpsort.hgc.jp/ which is based on the work of Nakai and Horton (Nakai and Horton, 1999). Numbering refers to Data File 3 and Data File 4.
-
Data File 6 [betacov_s1s2_oglyc_netogly.txt]: Detection output for 5 representative betacoronavirus spike sequence domains tested for Thr/Ser O-glycosite residue pairs with the standard prediction software NetOGlyc4.0 (Steentoft et al., 2013) as available at https://services.healthtech.dtu.dk/services/NetOGlyc-4.0/. Positive hits have scores above 0.5. Numbering refers to Data File 3 and Data File 4.
-
Data File 7 [betacov_s1s2_nls_pat7_furin_blastp.txt]: Comprehensive sequence database searches using were performed using the NCBI protein BLAST (blastp) algorithm with webservice available at https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins. The following blastp search parameters and settings were used: Word size=2; Expect value=200000; Hitlist size=500; Gapcosts=9,1; Matrix=PAM30; Filter string=F; Genetic Code=1;Window Size=40; Threshold=11; Composition-based stats=0; Database Posted date=Jan 19, 2023 2:59 AM; Number of letters=17,117,563; Number of sequences=10,766; Entrez query: Includes: Betacoronavirus (taxid:694002); Excludes: SARS-CoV-2 (taxid:2697049). The six polyfunctional input query consensus motif sequences were TXXPR(K/H/R)XRSX and TXXPRX(K/H/R)RSX.
References
Gu, C., 2020. FindFur: A Tool for Predicting Furin Cleavage Sites of Viral Envelope Substrates. Master’s Thesis, San Jose State University, CA, USA. doi: 10.31979/etd.4ahv-9jya
Gangavarapu K, Latif AA, Mullen JL, Alkuzweny M, Hufbauer E, Tsueng G, Haag E, Zeller M, Aceves CM, Zaiets K, Cano M, Zhou X, Qian Z, Sattler R, Matteson NL, Levy JI, Lee RTC, Freitas L, Maurer-Stroh S; GISAID Core and Curation Team; Suchard MA, Wu C, Su AI, Andersen KG, Hughes LD. Outbreak.info genomic reports: scalable and dynamic surveillance of SARS-CoV-2 variants and mutations. Nat Methods. 2023. 20(4):512-522. doi: 10.1038/s41592-023-01769-3.
Nakai, K., Horton, P., 1999. PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci 24, 34–36. doi: 10.1016/s0968-0004(98)01336-x
Steentoft, C., Vakhrushev, S.Y., Joshi, H.J., Kong, Y., Vester-Christensen, M.B., Schjoldager, K.T.-B.G., Lavrsen, K., Dabelsteen, S., Pedersen, N.B., Marcos-Silva, L., Gupta, R., Bennett, E.P., Mandel, U., Brunak, S., Wandall, H.H., Levery, S.B., Clausen, H., 2013. Precision mapping of the human O-GalNAc glycoproteome through SimpleCell technology. EMBO J 32, 1478–1488. doi: 10.1038/emboj.2013.79
|