A dataset of Vulnerable Code Changes of the Chormium OS project

Paul, Rajshakhar; Turzo, Asif Kamal; Bosu, Amiangshu

doi:10.5281/zenodo.4539891

Published February 13, 2021 | Version 1.2

Dataset Open

A dataset of Vulnerable Code Changes of the Chormium OS project

1. Wayne State University

This dataset is associated with the paper ""Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project"

To cite this dataset please use following:

@inproceedings{paul-2021-ICSE,
author = {Paul, Rajshakhar and Turzo, Asif K. and Bosu, Amiangshu},
title = {Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project},
booktitle = {Proceedings of the 43th International Conference on Software Engineering},
series = {ICSE'21},
year = {2021},
location = {Madrid, Spain},
pages = {TBD},
note={Acceptance rate = 138/602 (22%)},
}

----------------------------------------------------------------------------------------------------------------

We conducted a case-control study of Chromium OS project to identify the factors that differentiate code reviews that successfully identified security defects from those that missed such defects. We identified the cases and the controls based on our outcome of interest, namely whether a security defect was identified or escaped during the code review of a vulnerability contributing commit (VCC). Using a keyword-based mining approach followed by manual validations on a dataset of 404,878 Chromium OS code reviews, we identified 516 code reviews that successfully identified security defects. In addition, from the Chromium OS bug repository, we identified 239 security defects that escaped code reviews. Using a modified version of the SZZ algorithm followed by manual validations, we identified 374 VCCs and corresponding code reviews that approved those changes. For each of the 890 identified VCCs, we computed 25 different attributes that may influence the identification of a vulnerability during code reviews. Our artifact includes locations of these 890 VCCs as well as the 25 attributes for each VCC. Among those 25 attributes, we considered 18 attributes to build our model.

To analyze our data, we developed a Logistic Regression model following the guidelines suggested by Harrell Jr. using the 18 attributes. The model, which achieved an AUC of 0.91, found nine code review metrics that distinguish code reviews that missed a vulnerability from the ones that did not. We have also made the R script to reproduce our results in the Github repository. Our dataset includes 890 real world vulnerable files. Each of those vulnerabilities is classified using the CWE specification. We envision several ways researchers may benefit from this artefact such as evaluating static analysis tools, training machine learning models, and replicating this study under a different context.

Files

code-description.md

Files (332.1 kB)

Name	Size	Download all
code-description.md md5:408ee4e6bf3be4ae0d761c6258d63430	1.3 kB	Preview Download
data-description.md md5:b844c7046c5ced06a31d08b4a2804bc3	3.8 kB	Preview Download
GLM-regression.R md5:960d90fae9b2b35e540e03f442f75b5f	8.2 kB	Download
identified_vs_escaped.csv md5:68b3addf796cd66ad40f4308f7e79023	317.6 kB	Preview Download
LICENSE md5:20d5342b1d415f9442ce9fc15e09a4b3	1.1 kB	Download
requirements.txt md5:501aa2f7e907e370a721dec05400aae3	100 Bytes	Preview Download

	All versions	This version
Views	1,510	877
Downloads	578	449
Data volume	60.8 MB	50.5 MB

A dataset of Vulnerable Code Changes of the Chormium OS project

Creators

Description

Files

code-description.md

Files (332.1 kB)