Refseq datasets for training frame classification

Voigt, Benjamin; Fischer, Oliver; Krumnow, Christian; Herta, Christian; Dabrowski, Piotr Wojciech

doi:10.5281/zenodo.4306248

Published December 4, 2020 | Version 1.0.0

Dataset Open

Refseq datasets for training frame classification

1. Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany

The data is based on andomly selected viral and bacterial genomes and the human193(GRCh38.p13) reference genome which were downloaded from GenBank. From each original nucleic acid sequences we created mutliple patches of length 300 in all possible reading frames using a sliding window on the initial sequence and its reversed complement. For the train and val file, the resulting patches are translated to amino acid sequences of length 100 where the DNA_test file contains the nucleic acid sequences patches of length 300. The data is stored in the FASTA format according to the following convention:

>{ID}_subsequence{patch index}_frame{frame index}|{class marker}|{frame index}
sequence

with

ID - denotes the ReSeq accession of the original sequence in the Refseq dataset.
sequence - either nucleic acid sequence patch of length 300 (DNA_test) or amino acid sequence of length 100 (train, val)
patch_index - denotes the starting triplet of the given patch within the original sequence or reverse complemented sequence (i.e. 3*patch_index is the starting index of frame 0 in the original sequence)
class marker - indicates the taxonomic domain
    0 - virus
    1 - bacteria
    2 - human / mammal
frame index - indicates the reading frame
    0 - on-frame
    1 - shifted by one
    2 - shifted by two
    3 - reverse complemented
    4 - shifted by one and reverse complemented
    5 - shifted by two and reverse complemented

The data is split into test, training and validation set which contain the following number of patches per frame:

- train: 1.700.944
- test: 212.618
- val: 212.618

Notes

The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany (BMBF) in the project deep.Health (project number 13FH770IX6).

Files

Files (197.3 MB)

Name	Size	Download all
refseq.tar.gz md5:0953e038309ca5d6717fb47145e8a3d8	197.3 MB	Download

	All versions	This version
Views	411	409
Downloads	64	64
Data volume	15.0 GB	15.0 GB

Refseq datasets for training frame classification

Authors/Creators

Description

Notes

Files

Files (197.3 MB)