Published May 23, 2017
| Version v1
Dataset
Open
Bacterial training dataset for Galaxy training network tutorials on Genome assembly
- 1. Melbourne Bioinformatics, University of Melbourne
Description
This training dataset is from an imaginary Staphylococcus aureus bacterium with a miniature genome. There is a reference genome in various formats as well as some fastq reads of a closely related but also imaginary mutant strain.
It is a useful dataset for demonstrating:
- de novo genome assembly
- read mapping and variant calling
- genome annotation
The files included are:
- wildtype.fna: the reference genome sequence of the wildtype strain in fasta format (a header line, then the nucleotide sequence of the genome.)
- wildtype.gff: the reference genome sequence of the wildtype strain in general feature format (a list of features - one feature per line, then the nucleotide sequence of the genome.)
- wildtype.gbk: the reference genome sequence in genbank format.
- mutant_R1.fastq and mutant_R2.fastq: Fastq sequence reads of a closely related mutant strain.
- The reads are paired-end.
- Each read is 150 bases long.
- The number of bases sequenced is equivalent to 19x the genome sequence of the wildtype strain. (Read coverage 19x - rather low!).