Uniprot datasets for training taxonomic classification
Creators
- 1. Center for Bio-Medical image and Information processing (CBMI), HTW University of Applied Sciences, Berlin, Germany
Description
The data is based on the UniProt-Swiss-Prot release 2020-04 dataset and contains data derived from amino acid sequences of human, bacterial and viral origin. From each original sequence we created multiple patches of length 100 using a sliding window. The data is stored in the FASTA format according to
>{ID}_{patch index}|{class marker}
sequence
with
ID - denotes the ID of the original sequence in the UniProt-Swiss-Prot dataset
sequence - patch of length 100 of an amino acid sequence
patch_index - denotes the starting index of the given patch within the original sequence
class marker - indicates the taxonomic class
0 - virus
1 - bacteria
2 - human / mammal
The data is split into test, training and validation set which contain the following number of patches per class:
- train: 4.891.278
- test: 611.602
- val: 611.602
Notes
Files
Files
(118.9 MB)
Name | Size | Download all |
---|---|---|
md5:80de985c049f03e22df4ff916238c1d9
|
118.9 MB | Download |