Data set for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"
- 1. Two Six Technologies
- 2. Florida Atlantic University
- 3. Duke University
Description
Data set and results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"
The file "dna_binding_protein_sequences.zip" has the testing and training sets from the paper:
RLL - "random_<train/test>_full_1000.csv"
RSL - "random_<train/test>_50.csv"
RS&LL - "random_<train/test>_50_1000.csv"
RLL where included positive examples have verified DNA binding activity - "random_<train/test>_hq_1000.csv"
The results files are named similarly.
The species data sets are derived from "uniprot_data_bac.tab" and "uniprot_data_not_bac.tab." See code.
The ESM embeddings used by the XGBoost model are in "dna_binding_protein_esm.zip"
Files
dna_binding_protein_esm.zip
Additional details
Related works
- Is referenced by
- Preprint: 10.1101/2021.04.09.439184 (DOI)