There is a newer version of the record available.

Published September 19, 2021 | Version V3
Dataset Restricted

ASE2021 vulnerability fix dataset

  • 1. Huawei
  • 2. Zhejiang University
  • 3. Singapore Management University
  • 4. Queen's University

Description

The dataset of "Finding A Needle in a Haystack: Automated Mining of Silent Vulnerability Fixes", which was accepted in the 36th IEEE/ACM Automated Software Engineering (ASE) Conference.

Followings are the descriptions of columns:

  1. commit_id: The commit ID/hash.
  2. repo: The Github Author and repository (e.g., "apache/hive").
  3. filename: The name of the file changed in the commit.
  4. partition: Which dataset the commit information belongs to (i.e., "train", "val", or "test").
  5. PL: Programming Language (PL) (i.e., "java" or "py").
  6. label: Label of the commit, 0 for non-vulnerability fixing commit and 1 for vulnerability fixing commit.
  7. diff: The entire code change information of the file in this commit.
  8. committer_date: The date of the commit (e.g., 2015-03-02 13:48:25+13:00)
  9. msg: The commit message (NA if empty).
  10. MOD_DIFF: The code change of the file in this commit after preprocessing: filtering out lines that are not added lines or removed lines, and removing refactoring information and comments.
  11. BPE_MOD_DIFF: BPE processing applied to MOD_DIFF information (using codeprep Python package).
  12. ADD_DIFF: The added lines from the MOD_DIFF information (indicated as a line starting with '+' character).
  13. REM_DIFF: The removed lines from the MOD_DIFF information (indicated as a line starting with '-' character).
  14. LOC_ADD: Total lines of code added in this file change.
  15. LOC_REM: Total lines of code removed in this file change.
  16. LOC_MOD: Total lines of code modified in this file change (LOC_ADD + LOC_REM).
  17. commit_repo: The commit ID and repository concatenated.
  18. cve_list: A list of CVEs which the commit fixes (e.g., CVE-2015-5348, CVE-2016-8902).

Following is the code snippet to reproduce Table 1.

import pandas as pd

all_commits = pd.read_csv('./ase_dataset_sept_19_2021.csv')

#Separate by language, since the Java commits are missing some info which we will add later on.
py = all_commits[all_commits.PL == 'python']
java = all_commits[all_commits.PL == 'java']

#Java first: partition into train/val/test and check # of commits
print("Java VF vs NVF for train/val/test")
java_train = java[java.partition =="train"]
java_val = java[java.partition == "val"]
java_test = java[java.partition == "test"]
print(java_train.drop_duplicates(subset='commit_id').label.value_counts())
print(java_val.drop_duplicates(subset='commit_id').label.value_counts())
print(java_test.drop_duplicates(subset='commit_id').label.value_counts())

#Python: partition into train/val/test and check # of commits
print("Py VF vs NVF for train/val/test")
py_train = py[py.partition =="train"]
py_val = py[py.partition == "val"]
py_test = py[py.partition == "test"]
print(py_train.drop_duplicates(subset='commit_id').label.value_counts())
print(py_val.drop_duplicates(subset='commit_id').label.value_counts())
print(py_test.drop_duplicates(subset='commit_id').label.value_counts())

 

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/5565182">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

You need to satisfy these conditions in order for this request to be accepted:

Hi, please contact us for more information.

You are currently not logged in. Do you have an account? Log in here