TFBindFormer_dataset

Liu, Ping

doi:10.5281/zenodo.18362288

Published January 29, 2026 | Version v1

Dataset Open

TFBindFormer_dataset

Liu, Ping (Data collector)

This dataset contains the training, validation, and test data used for transcription factor (TF)–DNA binding prediction in the TFBindFormer framework. It integrates genomic DNA sequence bins with transcription factor protein information and curated metadata to support reproducible model training and evaluation.

DNA Sequence Data (dna_data/)

The dna_data directory contains one-hot–encoded genomic DNA sequence bins, organized into three mutually exclusive splits:

train/ – training data

train_oneHot.npy
train_oneHot.mat
train_labels.npy

val/ – validation data

valid_oneHot.npy
valid_oneHot.mat
valid_labels.npy

test/ – held-out test data

test_oneHot.npy
test_oneHot.mat

DNA sequences are encoded in one-hot format and stored in both NumPy (.npy) and MATLAB (.mat) formats to facilitate reuse across different computational environments. Label files indicate TF binding status for the corresponding DNA bins.

Transcription Factor Data (tf_data/)

tf_sequence/: Amino-acid FASTA sequences for transcription factors.
tf_structure/: Protein structure files (PDB format) for transcription factors.
3di_out/: Precomputed 3Di structural token sequences derived from TF structures.
tf_embeddings/: Precomputed TF embeddings generated from sequence and 3Di tokens.
metadata_tfbs.tsv: Metadata linking DNA samples with their corresponding transcription factors.

Intended Use

This dataset is intended for:

Training and evaluating TF–DNA binding prediction models
Studying protein-conditioned DNA binding specificity
Reproducing the experiments reported in the associated TFBindFormer study

The dataset is provided for research and academic use and supports full reproducibility of model training and evaluation.

Files

Files (3.7 GB)

Name	Size	Download all
data.tar.gz md5:5de5debaf8225fa6f856c27081fa963d	3.7 GB	Download

	All versions	This version
Views	31	31
Downloads	2	2
Data volume	7.3 GB	7.3 GB

TFBindFormer_dataset

Authors/Creators

Description

Files

Files (3.7 GB)