Published April 15, 2024 | Version 1.0
Dataset Open

A collection of datasets for software vulnerability detection

  • 1. ROR icon Luxembourg Institute of Science and Technology

Description

This is a collection of datasets that are used for AI-based software vulnerability detection. All the datasets are in the .csv format and each row represents a sample. Each dataset includes a set of functions written in C and the target of each function is either 0 (non-vulnerable) or 1 (vulnerable).

  1. data_C_Lin2017_test.csv:
  2. data_C_LineVul_test.csv:
  3. data_C_PrimeVul_test.csv:
  4. data_C_Choi2017_test.csv:
  5. data_C_Devign_test.csv:
  6. data_C_Ours_{train,test}.csv:
    • This dataset is manually collected from projects on GitHub that have registered CVEs into NVD from 2002 to 2023. The 6,766 non-vulnerable code functions are extracted from the DiverseVul dataset to increase the code diversity. 
    • This training set includes 5413 vulnerable and 5413 non-vulnerable functions.
    • The test set includes 1353 vulnerable and 1353 non-vulnerable functions.

Files

data_C_Choi2017_test.csv

Files (137.7 MB)

Name Size Download all
md5:96db280b2c96c0be69dd2fffb0e45cbb
4.2 MB Preview Download
md5:c7e5f9295525f0cd22c13b8ec86a176f
57.2 MB Preview Download
md5:ba02686f97e9401a1f592b537cd981d5
933.6 kB Preview Download
md5:1efe630c00ca0479d4d6be77d774b7d4
17.3 MB Preview Download
md5:69d396eae8e7635c4ee4dbc32fc0bd5f
4.8 MB Preview Download
md5:2aadc4829e0ffadbce8a31e013dc83bb
19.4 MB Preview Download
md5:b7ef6cca7bb8b4c23024901b0a016e16
33.7 MB Preview Download

Additional details

Funding

European Commission
LAZARUS - pLatform for Analysis of Resilient and secUre Software 101070303

Dates

Collected
2024-04-15

References

  • Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. 2017. POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). Association for Computing Machinery, New York, NY, USA, 2539–2541. https://doi.org/10.1145/3133956.3138840
  • M. Fu and C. Tantithamthavorn, "LineVul: A Transformer-based Line-Level Vulnerability Prediction," 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 2022, pp. 608-620, doi: 10.1145/3524842.3528452.
  • Yangruibo Ding and Yanjun Fu and Omniyyah Ibrahim and Chawin Sitawarin and Xinyun Chen and Basel Alomair and David Wagner and Baishakhi Ray and Yizheng Chen. Vulnerability Detection with Code Language Models: How Far Are We? arXiv preprint, 2024
  • Min-Je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. 2017. End-to-end prediction of buffer overruns from raw source code via neural memory networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI'17). AAAI Press, 1546–1553.
  • Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 915, 10197–10207.