Published January 29, 2026 | Version v1
Dataset Open

TFBindFormer_dataset

Authors/Creators

Description

This dataset contains the training, validation, and test data used for transcription factor (TF)–DNA binding prediction in the TFBindFormer framework. It integrates genomic DNA sequence bins with transcription factor protein information and curated metadata to support reproducible model training and evaluation.

DNA Sequence Data (dna_data/)

The dna_data directory contains one-hot–encoded genomic DNA sequence bins, organized into three mutually exclusive splits:

  • train/ – training data
    • train_oneHot.npy
    • train_oneHot.mat
    • train_labels.npy
  • val/ – validation data
    • valid_oneHot.npy
    • valid_oneHot.mat
    • valid_labels.npy
  • test/ – held-out test data
    • test_oneHot.npy
    • test_oneHot.mat

DNA sequences are encoded in one-hot format and stored in both NumPy (.npy) and MATLAB (.mat) formats to facilitate reuse across different computational environments. Label files indicate TF binding status for the corresponding DNA bins.

Transcription Factor Data (tf_data/)

  • tf_sequence/: Amino-acid FASTA sequences for transcription factors.
  • tf_structure/: Protein structure files (PDB format) for transcription factors.
  • 3di_out/: Precomputed 3Di structural token sequences derived from TF structures.
  • tf_embeddings/: Precomputed TF embeddings generated from sequence and 3Di tokens.
  • metadata_tfbs.tsv: Metadata linking DNA samples with their corresponding transcription factors.

Intended Use

This dataset is intended for:

  • Training and evaluating TF–DNA binding prediction models
  • Studying protein-conditioned DNA binding specificity
  • Reproducing the experiments reported in the associated TFBindFormer study

The dataset is provided for research and academic use and supports full reproducibility of model training and evaluation.

 

Files

Files (3.7 GB)

Name Size Download all
md5:5de5debaf8225fa6f856c27081fa963d
3.7 GB Download