Dataset Open Access

Bacterial training dataset for Galaxy training network tutorials on Genome assembly

Gladman, Simon; Seemann, Torsten; Bulach, Dieter

This training dataset is from an imaginary Staphylococcus aureus bacterium with a miniature genome. There is a reference genome in various formats as well as some fastq reads of a closely related but also imaginary mutant strain.

It is a useful dataset for demonstrating:

  • de novo genome assembly
  • read mapping and variant calling
  • genome annotation

The files included are:

  • wildtype.fna: the reference genome sequence of the wildtype strain in fasta format (a header line, then the nucleotide sequence of the genome.)
  • wildtype.gff: the reference genome sequence of the wildtype strain in general feature format (a list of features - one feature per line, then the nucleotide sequence of the genome.)
  • wildtype.gbk: the reference genome sequence in genbank format.
  • mutant_R1.fastq and mutant_R2.fastq: Fastq sequence reads of a closely related mutant strain.
    • The reads are paired-end.
    • Each read is 150 bases long.
    • The number of bases sequenced is equivalent to 19x the genome sequence of the wildtype strain. (Read coverage 19x - rather low!).

Files (9.1 MB)
Name Size
mutant_R1.fastq
md5:32ad7e3698f3f78fd35047cfd8718ea9
4.1 MB Download
mutant_R2.fastq
md5:6df54a24aed461dcf70c6a79a56f18f7
4.1 MB Download
wildtype.fna
md5:80fe318fdf4cdd0fee4a244f520a0c54
200.7 kB Download
wildtype.gbk
md5:8d8bc40a25fb7ce700abc7c34decadd0
399.7 kB Download
wildtype.gff
md5:b175fe7ba1400f1f6e60465e60b8132b
238.3 kB Download
4,133
10,464
views
downloads
All versions This version
Views 4,1334,133
Downloads 10,46410,464
Data volume 23.4 GB23.4 GB
Unique views 3,7143,714
Unique downloads 6,1896,189

Share

Cite as