Published December 28, 2025 | Version v1
Dataset Open

BinaryCorp small train processed

Authors/Creators

Description

The processed BinaryCorp small train dataset for fine-tuning on the binary code similarity detection.

This dataset is used in the paper: "Nova: Generative language models for assembly code with hierarchical attention and contrastive learning"

@inproceedings{
    jiang2025nova,
    title={Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning},
    author={Nan Jiang and Chengxiao Wang and Kevin Liu and Xiangzhe Xu and Lin Tan and Xiangyu Zhang and Petr Babkin},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025}
}

The dataset is originally obtained from paper "jTrans: jump-aware transformer for binary code similarity detection"

@inproceedings{10.1145/3533767.3534367,
    author = {Wang, Hao and Qu, Wenjie and Katz, Gilad and Zhu, Wenyu and Gao, Zeyu and Qiu, Han and Zhuge, Jianwei and Zhang, Chao},
    title = {jTrans: jump-aware transformer for binary code similarity detection},
    publisher = {Association for Computing Machinery}, 
    url = {https://doi.org/10.1145/3533767.3534367},
    doi = {10.1145/3533767.3534367},
    booktitle = {Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis},
    pages = {1–13}, 
    numpages = {13},
    series = {ISSTA 2022}
}

 

Files

BinaryCorp_small_train.zip

Files (1.0 GB)

Name Size Download all
md5:c7f35543b4fae05c2003e09dae43d4cb
1.0 GB Preview Download