Software Open Access
Ahmed M Moustafa
GNUVID 2.0Big update
This release of GNUVID comes with a significant speed-up and improved classification. The new classification algorithm is called GNUVID_Predict.
Use of GNUVID now is as simple as GNUVID query_fasta.fna
As of GNUVID 2.0, GNUVID_Predict.py is a speedy algorithm for assigning Clonal Complexes to new genomes, which uses a Machine Learning Random Forest Classifier.
The model was trained using 53,565 SARS-CoV-2 sequences from GISAID. The alignment of these genomes to MN908947.3 was one-hot encoded. The Classifier model was built using the sci-kit learn implementation of Random Forest.Globally circulating clonal complexes as of 2020-10-20:
69686 GISAID sequences have been included in this analysis.
GNUVID compressed the 696860 ORFs in the 69686 genomes to 37921 unique alleles.
35010 Sequence Types (STs) have been assigned in this dataset and were clustered in 154 clonal complexes (CCs).
84 new CCs have been assigned.
82 CCs have been Inactive (i.e. Last time seen more than 1 month before 2020-10-20).
27 CCs have gone Quiet (i.e. Last seen 2-4 weeks before 2020-10-20).
45 CCs have been Active (i.e. Last seen within the 2 weeks before 2020-10-20).
CC70, CC26, CC343, CC439, CC927, CC1434, CC11290, CC13202, CC13669 and CC17244 have now been called CC550, CC750, CC9999, CC2649, CC1179, CC2175, CC18372, CC13208, CC12995 and CC13413 respectively.