Dataset Open Access

Vector sequences in early WIV SRA sequencing data of SARS-CoV-2 inform on a potential large-scale security breach at the beginning of the COVID-19 pandemic

Daoyu Zhang


Sequences identified as Influenza A virus, Spodoptera frugiperda rhabdovirus and Nipah henipavirus have been previously identified within the early HiSeq 1000 and HiSeq 3000 sequencing data of SARS-CoV-2, SRR11092059,SRR11092060,SRR11092061 and SRR11092062, and were being used to support the hypothesis that a "simultaneous outbreak of multiple zoonotic viruses" have happened in the Huanan Seafood market.

However, a closer examination of these sequences revealed that they were not sequences of actual wild viruses, but were in stead fragments left behind from PCR products and cloning vectors harboring both cDNA clones and infectious clones of such viruses, with evidence of viral sequences being joined directly to DNA sequences of vector and non-human origin within the same short reads.

Here are the vector sequences and PCR product-like sequences recovered from the earliest WIV SRA sequencing data of Human SARS-CoV-2 from dataset SRR11092059,SRR11092060,SRR11092061,SRR11092062.

Sequences associated with Vectors and PCR products from 3 distinct viral species have been obtained: The 3'-end of a Nipah Henipahvirus with fusion to a Hepatitis D virus Ribozyme, a T7 terminator and a Tetracycline resistance gene, The 5'-end of the same Nipah Henipahvirus with fusion to sequences found in diverse vectors, A complete vector genome encoding the HA gene of Influenza A virus subtype H7N9 under a CMV promoter and a bgH polyA terminator, and 221 Contiguous sequences corresponding to the Spodoptera frugiperda rhabdovirus reference genome fused to sequences that were homologous to multiple Plastid sequences and Notably Mitochondrial sequences of Rodents.

As sequences corresponding to a rescued infectious clone of a BSL-4 organism (Nipah Henipahvirus) were found in sample sequences that supposedy represents patient samples that were obtained from Hospital ICU and sequenced in a pathogen diagnosis laboratory (which is separate from the Virology Research laboratory which is implied by the context of an Infectious Clone of such an organism, evident by the 3'-HDV ribozyme and T7 terminator fused directly to the 3'-terminus of the Nipah Henipahvirus reads), The discovery of artifact-containing sequences of at least 3 different pathogen species that are phylogenetically and methodologically distinct from each other in samples that were supposedly submitted by a laboratory that is Separate from the virological research laboratories that could have hosted such clone sequences imply extensive crosstalk and cross-contamination between the various laboratories within the Wuhan Institute of Virology, which includes at least one BSL-4 laboratory with evidence of containment breach of a BSL-4 organism and it's subsequent introduction into RNA-seq samples that were processed by a laboratory of distinct and separate purposes than the basic virological research evidenced by the Infectious Clone of the Hipah Henipahvirus.

Such a discovery therefore likely imply a major security breach happening within the Wuhan institute of Virology at the time when the first sequences of SARS-CoV-2 was sampled and sequenced, which have important implications on the origins of the SARS-CoV-2 virus itself.


The metagenomic sequencing datasets, SRR11092059,SRR11092060,SRR11092061 and SRR11092062 were first analyzed using the NCBI phylogenetic analysis tool, which identified viral sequences that is not related to SARS-CoV-2 itself. These include Influenza A virus (IAV, subtype H7N9), Spodoptera frugiperda rhabdovirus and Nipah Henipahvirus.

The datasets were then subjected to BLAST search using MEGABLAST against the reference sequences of such viruses to verify the existence of the viral sequences and determine the exact sybtype of such viruses and the closest sequences on GenBank that corresponds to the reads. There seuqences are MH926031.1 for the  Spodoptera frugiperda rhabdovirus,  KY199425.1 for the Influenza A virus and AY988601.1 for the Nipah Henipahvirus.

A second round BLAST analysis with these identified sequences were then performed, which unexpectedly revealed numerous reads corresponding to Cloning vectors and non-human Mitochondrial and Plastid sequences being fused directly to the sequences of the identified viral species. Reads were then downloaded and subjected to assembly using the CAP3 sequence assembly program and the EGASSEMBLER tool. Contig sequences were then queried against the NCBI nr/nt database which unanimously identified the original sample sequences as viral sequences inserted into cloning vectors.

The complete sequence of the Influenza A virus Haemagluttinin (HA) gene clone was obtained from SRR11092061,SRR11092062 using multiple rounds of BLAST search and sequence assembly expansion on the existing vector-virus junction contigs, and a partial sequence corresponding the 3'-end of Nipah Henipahvirus AY988601.1 fused to a 3'-HDV ribozyme, T7 terminator and a Tet resistance gene was obtained from SRR11092059. In addition, 221 Contig sequences corresponding to the Rhabdovirus MH926031.1 fused to Chloroplast sequence MN524635.1 and Rodent Mitochondrial sequence MT241668.1 have been recovered from SRR11092061.

We then performed a BLAST search using the identified vector sequences on SRR11092059,SRR11092060,SRR11092061 and SRR11092062, which confirms the existence of these two vetor sequences in all 4 datasets.

All versions This version
Views 19,22519,225
Downloads 410410
Data volume 135.0 MB135.0 MB
Unique views 15,97015,970
Unique downloads 209209


Cite as