Simulated wastewater sequencing data for benchmarking SARS-CoV-2 variant abundance estimation
Creators
- Baaijens, Jasmijn A.1
- Zulli, Alessandro2
- Ott, Isabel M.3
- Petrone, Mary E.3
- Alpert, Tara3
- Fauver, Joseph R.3
- Kalinich, Chaney C.3
- Vogels, Chantal B.F.3
- Breban, Mallery I.3
- Duvallet, Claire4
- McElroy, Kyle4
- Ghaeli, Newsha4
- Imakaev, Maxim4
- Mckenzie-Bennett, Malaika5
- Robison, Keith5
- Plocik, Alex5
- Schilling, Rebecca5
- Pierson, Martha5
- Littlefield, Rebecca5
- Spencer, Michelle5
- Simen, Birgitte B.5
- Yale SARS-CoV-2 Genomic Surveillance Initiative
- Hanage, William P.6
- Grubaugh, Nathan D.7
- Peccia, Jordan2
- Baym, Michael1
- 1. Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- 2. Department of Chemical and Environmental Engineering, Yale University, New Haven, CT, USA
- 3. Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
- 4. Biobot Analytics, Inc., Cambridge, MA, USA
- 5. Ginkgo Bioworks, Inc., Boston, MA, USA
- 6. Center for Communicable Disease Dynamics and Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- 7. Department of Epidemiology of Microbial Diseases and Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
Description
To evaluate the accuracy of variant abundance predictions from wastewater sequencing, we built a collection of benchmarking datasets that resemble real wastewater samples. For each variant (B.1.1.7, B.1.351, B.1.427, B.1.429, P.1) we created a series of 33 benchmarks by simulating sequencing reads from a variant genome, as well as a collection of background (non-variant of concern/interest) sequences, such that the variant abundance ranges from 0.05% to 100%. Analogously, we created a second series of benchmarks, simulating reads only from the Spike gene of each SARS-CoV-2 genome. We refer to the first set of benchmarks as "whole genome" (WG) and to the second set of benchmarks as "S-only". We repeated these simulations at different sequencing depths: 100x and 1000x coverage for the whole genome benchmarks, and 100x, 1000x, and 10,000x coverage for the S-only benchmarks.