Published March 15, 2026 | Version v1
Dataset Open

SARS-CoV-2 Surface Glycoprotein Sequences, NCBI Data Hub, October 2021 (ViralEntropR archive)

Description

Archive of SARS-CoV-2 surface glycoprotein (Spike protein) amino acid sequences downloaded from the NCBI SARS-CoV-2 Data Hub on October 12, 2021. 

Downloaded with the following filters:
  - Organism: Severe acute respiratory syndrome coronavirus 2 (taxid: 2697049)
  - Nucleotide completeness: complete
  - Protein: surface glycoprotein
  - Result: 137,132 sequences, 173 MB uncompressed FASTA

Original data source: NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/). Original data is a US Government work and is in the public domain within 
the United States. Data from international contributors is subject to the INSDC open-access policy (https://www.insdc.org/about-insdc/).

Archived as a static snapshot for reproducibility of analyses in the ViralEntropR R package.

Cited as:

  • Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2016). GenBank. Nucleic Acids Research. 44(D1):D67-D72. doi:10.1093/nar/gkv1276
  • Sayers EW, Bolton EE, Brister JR, et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 50(D1):D20-D26. doi:10.1093/nar/gkab1112
  • NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [2020] - [cited 2021 Oct 12]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

Data source and licensing:

Sequence data downloaded from NCBI Virus (National Center for Biotechnology Information, U.S. National Library of Medicine) on October 12, 2021..

Per NCBI Website and Data Usage Policies (https://www.ncbi.nlm.nih.gov/home/about/policies/):
"NCBI itself places no restrictions on the use or distribution of the data contained therein."

Data use confirmed with NCBI Help Desk, Case #CAS-1470196-D4S2Z8, May 2025:
"You may use the sequence data for scientific and educational purposes." 

Note: Some submitted sequences may be subject to patent, copyright, or other intellectual property rights claimed by original submitters or their country of origin.

The compilation, curation, and packaging of this archive by the ViralEntropR authors is released under CC0 1.0 Universal.

Files

Files (181.5 MB)

Name Size Download all
md5:4e9f5ca1b8a0f99c15a7ad55e9ccb25b
181.5 MB Download

Additional details

Related works

Is derived from
Dataset: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/ (URL)
Is supplement to
Software: https://github.com/vadimtyuryaev/ViralEntropR (URL)

Software

Repository URL
https://github.com/vadimtyuryaev/ViralEntropR
Programming language
R
Development Status
Active

References

  • Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2016). GenBank. Nucleic Acids Research. 44(D1):D67-D72. doi:10.1093/nar/gkv1276
  • Sayers EW, Bolton EE, Brister JR, et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 50(D1):D20-D26. doi:10.1093/nar/gkab1112
  • NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [2020] - [cited 2021 Oct 12]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/