Published March 25, 2024 | Version v1
Dataset Open

NanoVarBench variant truthset files

  • 1. The University of Melbourne

Description

These tarballs contain the variant truthsets used for each sample in our paper "Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data".

Each directory contains the following files:

  •  <sample>.bed - A BED file of all regions in the genome
  • <sample>.repetitive_regions.bed - A BED file of all repetitive regions of the genome (see the paper for details of how these were identified).
  • <sample>.unique_regions.bed - Non-repetitive regions of the genome. This is the result of performing bedtools complement -i <repetitive BED> -g <faidx of mutref>
  • ani.tsv - skani output from skani search for the sample's assembly against all of the downloaded genomes for that species. The last three columns are not from skani. They are completeness_percentile, completeness, and contamination metrics, all obtained from NBCI for each assembly accession.
  • apply.vcf.gz - the variants that were applied to the sample's reference assembly.
  • apply.vcf.gz.csi - VCF index for the above VCF
  • dnadiff.vcf.gz - Variants between the sample and donor genome from mummer4
  • minimap2.vcf.gz - Variants between the sample and donor genome from minimap2
  • mutdonor.fna - the FASTA file of the selected variant donor
  • mutreference.fna - the sample's reference assembly with the apply.vcf.gz applied to it. This is the genome that the sample's read are aligned to for calling variants
  • mutreference.fna.fai - the faidx of the above genome
  • reference.fna - the reference assembly of the sample. These are also available on GenBank, but are included here for interoperability
  • truth.vcf.gz - the truthset of variants. This is essentially apply.vcf.gz with the REF and ALT invert and the POS adjusted for the difference in position between the sample and donor assemblies. (See this script)
  • vcfstats.txt - VCF statistics produced by paftools.js vcfstat on the truth VCF

For information about each sample, refer to the samplesheet and paper.

Files

Files (54.1 MB)

Name Size Download all
md5:8c91b8d327c5629736df3f63c3f21380
5.8 MB Download
md5:1e60d320f798f788b13b1947ee5efafb
4.2 MB Download
md5:2e2d8463d3c4ac97c9284f72abf6b451
5.3 MB Download
md5:83b063edc463cfdba25a9ce30a442234
6.9 MB Download
md5:fd96288db15f3b1493b47fe0e6d56cc0
3.0 MB Download
md5:5b124aa615731f312b1939b554c5cd60
5.2 MB Download
md5:39599e31cb95741eb18edd47c619492e
1.9 MB Download
md5:fa7dd267965b5e10abc2c342488b350f
2.0 MB Download
md5:b411f7bbd997d0806af43a8eb3d44371
3.2 MB Download
md5:b266cdb5d7cff95b703ae991c333948d
3.2 MB Download
md5:371775c6e9466e3beab3b531831caacd
3.2 MB Download
md5:3d46e5e6aa0f049c19b23f71f568598c
6.0 MB Download
md5:25b38b7bcbd4c7493b2c7a9aa61839a4
2.4 MB Download
md5:30d2ec8845697e66538cc2aea313e719
1.9 MB Download

Additional details

Related works

Is derived from
Preprint: 10.1101/2024.03.15.585313 (DOI)