TFBindFormer_dataset
Authors/Creators
Description
This dataset contains the training, validation, and test data used for transcription factor (TF)–DNA binding prediction in the TFBindFormer framework. It integrates genomic DNA sequence bins with transcription factor protein information and curated metadata to support reproducible model training and evaluation.
DNA Sequence Data (dna_data/)
The dna_data directory contains one-hot–encoded genomic DNA sequence bins, organized into three mutually exclusive splits:
- train/ – training data
- train_oneHot.npy
- train_oneHot.mat
- train_labels.npy
- val/ – validation data
- valid_oneHot.npy
- valid_oneHot.mat
- valid_labels.npy
- test/ – held-out test data
- test_oneHot.npy
- test_oneHot.mat
DNA sequences are encoded in one-hot format and stored in both NumPy (.npy) and MATLAB (.mat) formats to facilitate reuse across different computational environments. Label files indicate TF binding status for the corresponding DNA bins.
Transcription Factor Data (tf_data/)
- tf_sequence/: Amino-acid FASTA sequences for transcription factors.
- tf_structure/: Protein structure files (PDB format) for transcription factors.
- 3di_out/: Precomputed 3Di structural token sequences derived from TF structures.
- tf_embeddings/: Precomputed TF embeddings generated from sequence and 3Di tokens.
- metadata_tfbs.tsv: Metadata linking DNA samples with their corresponding transcription factors.
Intended Use
This dataset is intended for:
- Training and evaluating TF–DNA binding prediction models
- Studying protein-conditioned DNA binding specificity
- Reproducing the experiments reported in the associated TFBindFormer study
The dataset is provided for research and academic use and supports full reproducibility of model training and evaluation.
Files
Files
(3.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:5de5debaf8225fa6f856c27081fa963d
|
3.7 GB | Download |