Published April 15, 2024
| Version 1.0
Dataset
Open
A collection of datasets for software vulnerability detection
Description
This is a collection of datasets that are used for AI-based software vulnerability detection. All the datasets are in the .csv format and each row represents a sample. Each dataset includes a set of functions written in C and the target of each function is either 0 (non-vulnerable) or 1 (vulnerable).
- data_C_Lin2017_test.csv:
- Reference paper: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects, 2017.
- Data source on GitHub: https://github.com/DanielLin1986/function_representation_learning
- This dataset includes 44 vulnerable and 577 non-vulnerable functions from the LibPNG project.
- data_C_LineVul_test.csv:
- Reference paper: LineVul: A Transformer-based Line-Level Vulnerability Prediction, 2022.
- Data source on Hugging Face: https://huggingface.co/datasets/Partha117/LineVul_Test_Dataset
- This dataset includes 1055 vulnerable and 17809 non-vulnerable functions.
- data_C_PrimeVul_test.csv:
- Reference paper: Vulnerability Detection with Code Language
Models: How Far Are We? 2024. - Data source on GitHub: https://github.com/DLVulDet/PrimeVul
- From the data source, the primevul_test.jsonl was used to created this dataset.
- This dataset includes 695 vulnerable and 25213 non-vulnerable functions.
- Reference paper: Vulnerability Detection with Code Language
- data_C_Choi2017_test.csv:
- Reference paper: End-to-End Prediction of Buffer Overruns from Raw Source Code
via Neural Memory Networks, 2017. - Data source on GitHub: https://github.com/mjc92/buffer_overrun_memory_networks
- From GitHub, all the data in trainnig_100.txt, test_1_100.txt, test_2_100.txt,test_3_100.txt,test_4_100.txt, and corresponding _labels.txt files are combined to create this dataset.
- This dataset includes 7054 vulnerable and 6946 non-vulnerable functions.
- Reference paper: End-to-End Prediction of Buffer Overruns from Raw Source Code
- data_C_Devign_test.csv:
- Reference paper: Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks, 2019
- Data source on Hugging Face: https://huggingface.co/datasets/claudios/code_x_glue_devign
- From Hugging Face, all the data in train, validation, and test are combined to create this dataset.
- This dataset includes 12460 vulnerable and 14858 non-vulnerable functions.
- data_C_Ours_{train,test}.csv:
- This dataset is manually collected from projects on GitHub that have registered CVEs into NVD from 2002 to 2023. The 6,766 non-vulnerable code functions are extracted from the DiverseVul dataset to increase the code diversity.
- This training set includes 5413 vulnerable and 5413 non-vulnerable functions.
- The test set includes 1353 vulnerable and 1353 non-vulnerable functions.
Files
data_C_Choi2017_test.csv
Files
(137.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:96db280b2c96c0be69dd2fffb0e45cbb
|
4.2 MB | Preview Download |
|
md5:c7e5f9295525f0cd22c13b8ec86a176f
|
57.2 MB | Preview Download |
|
md5:ba02686f97e9401a1f592b537cd981d5
|
933.6 kB | Preview Download |
|
md5:1efe630c00ca0479d4d6be77d774b7d4
|
17.3 MB | Preview Download |
|
md5:69d396eae8e7635c4ee4dbc32fc0bd5f
|
4.8 MB | Preview Download |
|
md5:2aadc4829e0ffadbce8a31e013dc83bb
|
19.4 MB | Preview Download |
|
md5:b7ef6cca7bb8b4c23024901b0a016e16
|
33.7 MB | Preview Download |
Additional details
Funding
Dates
- Collected
-
2024-04-15
References
- Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, and Yang Xiang. 2017. POSTER: Vulnerability Discovery with Function Representation Learning from Unlabeled Projects. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS '17). Association for Computing Machinery, New York, NY, USA, 2539–2541. https://doi.org/10.1145/3133956.3138840
- M. Fu and C. Tantithamthavorn, "LineVul: A Transformer-based Line-Level Vulnerability Prediction," 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA, 2022, pp. 608-620, doi: 10.1145/3524842.3528452.
- Yangruibo Ding and Yanjun Fu and Omniyyah Ibrahim and Chawin Sitawarin and Xinyun Chen and Basel Alomair and David Wagner and Baishakhi Ray and Yizheng Chen. Vulnerability Detection with Code Language Models: How Far Are We? arXiv preprint, 2024
- Min-Je Choi, Sehun Jeong, Hakjoo Oh, and Jaegul Choo. 2017. End-to-end prediction of buffer overruns from raw source code via neural memory networks. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI'17). AAAI Press, 1546–1553.
- Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 915, 10197–10207.