A collection of datasets for software vulnerability detection

Yuejun, Guo

doi:10.5281/zenodo.10975439

Published April 15, 2024 | Version 1.0

Dataset Open

A collection of datasets for software vulnerability detection

Yuejun, Guo (Contact person)¹

1. Luxembourg Institute of Science and Technology

This is a collection of datasets that are used for AI-based software vulnerability detection. All the datasets are in the .csv format and each row represents a sample. Each dataset includes a set of functions written in C and the target of each function is either 0 (non-vulnerable) or 1 (vulnerable).

data_C_Lin2017_test.csv:
- Reference paper: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects, 2017.
- Data source on GitHub: https://github.com/DanielLin1986/function_representation_learning
- This dataset includes 44 vulnerable and 577 non-vulnerable functions from the LibPNG project.
data_C_LineVul_test.csv:
- Reference paper: LineVul: A Transformer-based Line-Level Vulnerability Prediction, 2022.
- Data source on Hugging Face: https://huggingface.co/datasets/Partha117/LineVul_Test_Dataset
- This dataset includes 1055 vulnerable and 17809 non-vulnerable functions.
data_C_PrimeVul_test.csv:
- Reference paper: Vulnerability Detection with Code Language
  Models: How Far Are We? 2024.
- Data source on GitHub: https://github.com/DLVulDet/PrimeVul
- From the data source, the primevul_test.jsonl was used to created this dataset.
- This dataset includes 695 vulnerable and 25213 non-vulnerable functions.
data_C_Choi2017_test.csv:
- Reference paper: End-to-End Prediction of Buffer Overruns from Raw Source Code
  via Neural Memory Networks, 2017.
- Data source on GitHub: https://github.com/mjc92/buffer_overrun_memory_networks
- From GitHub, all the data in trainnig_100.txt, test_1_100.txt, test_2_100.txt,test_3_100.txt,test_4_100.txt, and corresponding _labels.txt files are combined to create this dataset.
- This dataset includes 7054 vulnerable and 6946 non-vulnerable functions.
data_C_Devign_test.csv:
- Reference paper: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks, 2019
- Data source on Hugging Face: https://huggingface.co/datasets/claudios/code_x_glue_devign
- From Hugging Face, all the data in train, validation, and test are combined to create this dataset.
- This dataset includes 12460 vulnerable and 14858 non-vulnerable functions.
data_C_Ours_{train,test}.csv:
- This dataset is manually collected from projects on GitHub that have registered CVEs into NVD from 2002 to 2023. The 6,766 non-vulnerable code functions are extracted from the DiverseVul dataset to increase the code diversity.
- This training set includes 5413 vulnerable and 5413 non-vulnerable functions.
- The test set includes 1353 vulnerable and 1353 non-vulnerable functions.

Files

data_C_Choi2017_test.csv

Files (137.7 MB)

Name	Size	Download all
data_C_Choi2017_test.csv md5:96db280b2c96c0be69dd2fffb0e45cbb	4.2 MB	Preview Download
data_C_Devign_test.csv md5:c7e5f9295525f0cd22c13b8ec86a176f	57.2 MB	Preview Download
data_C_Lin2017_test.csv md5:ba02686f97e9401a1f592b537cd981d5	933.6 kB	Preview Download
data_C_LineVul_test.csv md5:1efe630c00ca0479d4d6be77d774b7d4	17.3 MB	Preview Download
data_C_Ours_test.csv md5:69d396eae8e7635c4ee4dbc32fc0bd5f	4.8 MB	Preview Download
data_C_Ours_train.csv md5:2aadc4829e0ffadbce8a31e013dc83bb	19.4 MB	Preview Download
data_C_PrimeVul_test.csv md5:b7ef6cca7bb8b4c23024901b0a016e16	33.7 MB	Preview Download

Additional details

European Commission
LAZARUS - pLatform for Analysis of Resilient and secUre Software 101070303

Collected: 2024-04-15

Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. 2017. POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). Association for Computing Machinery, New York, NY, USA, 2539–2541. https://doi.org/10.1145/3133956.3138840
M. Fu and C. Tantithamthavorn, "LineVul: A Transformer-based Line-Level Vulnerability Prediction," 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 2022, pp. 608-620, doi: 10.1145/3524842.3528452.
Yangruibo Ding and Yanjun Fu and Omniyyah Ibrahim and Chawin Sitawarin and Xinyun Chen and Basel Alomair and David Wagner and Baishakhi Ray and Yizheng Chen. Vulnerability Detection with Code Language Models: How Far Are We? arXiv preprint, 2024
Min-Je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. 2017. End-to-end prediction of buffer overruns from raw source code via neural memory networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI'17). AAAI Press, 1546–1553.
Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 915, 10197–10207.

	All versions	This version
Views	2,312	2,312
Downloads	2,592	2,592
Data volume	50.6 GB	50.6 GB

data_C_Choi2017_test.csv

Files (137.7 MB)

Funding

Dates

References

A collection of datasets for software vulnerability detection

Authors/Creators

Description

Files

data_C_Choi2017_test.csv

Files (137.7 MB)

Additional details

Funding

Dates

References