There is a newer version of the record available.

Published August 2, 2021 | Version 1
Dataset Open

Data set for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"

  • 1. Two Six Technologies
  • 2. Florida Atlantic University
  • 3. Duke University

Description

Data set and results for "Improved data sets and evaluation methods for the automatic prediction of DNA-binding proteins"

The file "dna_binding_protein_sequences.zip" has the testing and training sets from the paper:

RLL - "random_<train/test>_full_1000.csv"
RSL - "random_<train/test>_50.csv"
RS&LL - "random_<train/test>_50_1000.csv"
RLL where included positive examples have verified DNA binding activity - "random_<train/test>_hq_1000.csv"

The results files are named similarly.
The species data sets are derived from "uniprot_data_bac.tab" and "uniprot_data_not_bac.tab."  See code. 

The ESM embeddings used by the XGBoost model are in "dna_binding_protein_esm.zip"

Files

dna_binding_protein_esm.zip

Files (3.2 GB)

Name Size Download all
md5:9ad5d9336c0f85020f779ba258240ab1
2.9 GB Preview Download
md5:052bb1d36bc91a1f34f95f8f8350e550
12.2 MB Preview Download
md5:69665bf4907529b31c4757e34a1fcd6d
353.4 MB Preview Download

Additional details

Related works

Is referenced by
Preprint: 10.1101/2021.04.09.439184 (DOI)