Published August 29, 2021 | Version v1
Dataset Open

Simulated wastewater sequencing data for benchmarking SARS-CoV-2 variant abundance estimation

  • 1. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
  • 2. Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
  • 3. Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
  • 4. Biobot Analytics, Inc., Cambridge, MA, USA
  • 5. Ginkgo Bioworks, Inc., Boston, MA, USA
  • 6. Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
  • 7. Department of Epidemiology of Microbial Diseases and Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA

Description

To evaluate the accuracy of variant abundance predictions from wastewater sequencing, we built a collection of benchmarking datasets that resemble real wastewater samples. For each variant (B.1.1.7, B.1.351, B.1.427, B.1.429, P.1) we created a series of 33 benchmarks by simulating sequencing reads from a variant genome, as well as a collection of background (non-variant of concern/interest) sequences, such that the variant abundance ranges from 0.05% to 100%. Analogously, we created a second series of benchmarks, simulating reads only from the Spike gene of each SARS-CoV-2 genome. We refer to the first set of benchmarks as "whole genome" (WG) and to the second set of benchmarks as "S-only". We repeated these simulations at different sequencing depths: 100x and 1000x coverage for the whole genome benchmarks, and 100x, 1000x, and 10,000x coverage for the S-only benchmarks.

Files

Files (7.8 GB)

Name Size Download all
md5:8ae9a37da2a6d7b6b07e04359b1a19ad
4.0 GB Download
md5:d307294a92d25102a5cc70341ac89c16
391.2 MB Download
md5:58bdd104f63668a56831ddc6f76632a1
38.8 MB Download
md5:72e6236d04fba162c8b6f246efc9a52b
3.1 GB Download
md5:56b7bb7e7a16e155cc17b1fb320a64c5
307.5 MB Download