Published September 19, 2021
| Version V3
Dataset
Restricted
ASE2021 vulnerability fix dataset
Authors/Creators
- 1. Huawei
- 2. Zhejiang University
- 3. Singapore Management University
- 4. Queen's University
Description
The dataset of "Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes", which was accepted in the 36th IEEE/ACM Automated Software Engineering (ASE) Conference.
Followings are the descriptions of columns:
- commit_id: The commit ID/hash.
- repo: The Github Author and repository (e.g., "apache/hive").
- filename: The name of the file changed in the commit.
- partition: Which dataset the commit information belongs to (i.e., "train", "val", or "test").
- PL: Programming Language (PL) (i.e., "java" or "py").
- label: Label of the commit, 0 for non-vulnerability fixing commit and 1 for vulnerability fixing commit.
- diff: The entire code change information of the file in this commit.
- committer_date: The date of the commit (e.g., 2015-03-02 13:48:25+13:00)
- msg: The commit message (NA if empty).
- MOD_DIFF: The code change of the file in this commit after preprocessing: filtering out lines that are not added lines or removed lines, and removing refactoring information and comments.
- BPE_MOD_DIFF: BPE processing applied to MOD_DIFF information (using codeprep Python package).
- ADD_DIFF: The added lines from the MOD_DIFF information (indicated as a line starting with '+' character).
- REM_DIFF: The removed lines from the MOD_DIFF information (indicated as a line starting with '-' character).
- LOC_ADD: Total lines of code added in this file change.
- LOC_REM: Total lines of code removed in this file change.
- LOC_MOD: Total lines of code modified in this file change (LOC_ADD + LOC_REM).
- commit_repo: The commit ID and repository concatenated.
- cve_list: A list of CVEs which the commit fixes (e.g., CVE-2015-5348, CVE-2016-8902).
Following is the code snippet to reproduce Table 1.
import pandas as pd
all_commits = pd.read_csv('./ase_dataset_sept_19_2021.csv')
#Separate by language, since the Java commits are missing some info which we will add later on.
py = all_commits[all_commits.PL == 'python']
java = all_commits[all_commits.PL == 'java']
#Java first: partition into train/val/test and check # of commits
print("Java VF vs NVF for train/val/test")
java_train = java[java.partition =="train"]
java_val = java[java.partition == "val"]
java_test = java[java.partition == "test"]
print(java_train.drop_duplicates(subset='commit_id').label.value_counts())
print(java_val.drop_duplicates(subset='commit_id').label.value_counts())
print(java_test.drop_duplicates(subset='commit_id').label.value_counts())
#Python: partition into train/val/test and check # of commits
print("Py VF vs NVF for train/val/test")
py_train = py[py.partition =="train"]
py_val = py[py.partition == "val"]
py_test = py[py.partition == "test"]
print(py_train.drop_duplicates(subset='commit_id').label.value_counts())
print(py_val.drop_duplicates(subset='commit_id').label.value_counts())
print(py_test.drop_duplicates(subset='commit_id').label.value_counts())