There is a newer version of this record available.

Software Open Access

ahmedmagds/GNUVID: GNUVID v2.0: Globally circulating clonal complexes as of 2020-10-20

Ahmed M Moustafa

GNUVID 2.0

Big update

This release of GNUVID comes with a significant speed-up and improved classification. The new classification algorithm is called GNUVID_Predict.

Use of GNUVID now is as simple as GNUVID query_fasta.fna

As of GNUVID 2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.

The model was trained using 53,565 SARS-CoV-2 sequences from GISAID. The alignment of these genomes to MN908947.3 was one-hot encoded. The Classifier model was built using the sci-kit learn implementation of Random Forest.

Globally circulating clonal complexes as of 2020-10-20:
  • 69686 GISAID sequences have been included in this analysis.

  • GNUVID compressed the 696860 ORFs in the 69686 genomes to 37921 unique alleles.

  • 35010 Sequence Types (STs) have been assigned in this dataset and were clustered in 154 clonal complexes (CCs).

  • 84 new CCs have been assigned.

  • 82 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-10-20).

  • 27 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-10-20).

  • 45 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-10-20).

  • CC70, CC26, CC343, CC439, CC927, CC1434, CC11290, CC13202, CC13669 and CC17244 have now been called CC550, CC750, CC9999, CC2649, CC1179, CC2175, CC18372, CC13208, CC12995 and CC13413 respectively.

Files (96.6 MB)
Name Size
ahmedmagds/GNUVID-v2.0.zip
md5:5830d2c9df606d2bfd454d231c625454
96.6 MB Download
3,499
104
views
downloads
All versions This version
Views 3,499647
Downloads 10419
Data volume 4.8 GB1.8 GB
Unique views 2,900575
Unique downloads 8818

Share

Cite as