Published December 4, 2020 | Version 1.0.0
Dataset Open

Uniprot datasets for training taxonomic classification

  • 1. Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany

Description

The data is based on the UniProt-Swiss-Prot release 2020-04 dataset and contains data derived from amino acid sequences of human, bacterial and viral origin. From each original sequence we created multiple patches of length 100 using a sliding window. The data is stored in the FASTA format according to

>{ID}_{patch index}|{class marker}
sequence

with

ID - denotes the ID of the original sequence in the UniProt-Swiss-Prot dataset
sequence - patch of length 100 of an amino acid sequence
patch_index - denotes the starting index of the given patch within the original sequence
class marker - indicates the taxonomic class
    0 - virus
    1 - bacteria
    2 - human / mammal

The data is split into test, training and validation set which contain the following number of patches per class:


- train: 4.891.278
- test: 611.602
- val: 611.602

Notes

The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany (BMBF) in the project deep.Health (project number 13FH770IX6).

Files

Files (118.9 MB)

Name Size Download all
md5:80de985c049f03e22df4ff916238c1d9
118.9 MB Download