There is a newer version of the record available.

Published July 18, 2021 | Version v1.0.0
Dataset Open

CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software

  • 1. Simula Research Laboratory, Norway

Description

CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.

At the initial release, the dataset covers all published CVEs up to 9 June 2021. All open-source projects that were reported in CVE records in the NVD in this time frame and had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 5495 vulnerability fixing commits in 1754 open source projects for a total of 5365 CVEs in 180 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after fixing of 18249 files, and 50322 functions.

This repository includes the SQL dump of the dataset, as well as the JSON for the CVEs and XML of the CWEs at the time of collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software", which is published in the Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). You will find a copy of the paper in the Doc folder. 

Citation and Zenodo links

Please cite this work by referring to the published paper:

  • Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). ACM, 10 pages. https://doi.org/10.1145/3475960.3475985
@inproceedings{bhandari2021:cvefixes,
    title = {{CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software}},
    booktitle = {{Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21)}},
    author = {Bhandari, Guru and Naseer, Amara and Moonen, Leon},
    year = {2021},
    pages = {10},
    publisher = {{ACM}},
    doi = {10.1145/3475960.3475985},
    copyright = {Open Access},
    isbn = {978-1-4503-8680-7},
    language = {en}
}

The dataset has been released on Zenodo with DOI:10.5281/zenodo.4476563. The GitHub repository containing the code to automatically collect the dataset can be found at https://github.com/secureIT-project/CVEfixes, released with DOI:10.5281/zenodo.5111494.

Notes

This work has been financially supported by the Research Council of Norway through the secureIT project (RCN contract #288787).

Files

CVEfixes_v1.0.0.zip

Files (1.1 GB)

Name Size Download all
md5:729aa041a4f8c466cf90de9c7d14fe41
1.1 GB Preview Download

Additional details

Related works

Is compiled by
Software: 10.5281/zenodo.5112935 (DOI)
Is documented by
Conference paper: 10.1145/3475960.3475985 (DOI)