Planned intervention: On Thursday 19/09 between 05:30-06:30 (UTC), Zenodo will be unavailable because of a scheduled upgrade in our storage cluster.

There is a newer version of the record available.

Published February 5, 2021 | Version 1.1
Dataset Open

A dataset of Vulnerable Code Changes of the Chormium OS project

  • 1. Wayne State University

Description

This dataset is associated with the paper ""Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project"

 

To cite this dataset please use following:

@inproceedings{paul-2021-ICSE,
author = {Paul, Rajshakhar and Turzo, Asif K. and Bosu, Amiangshu},
title = {Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project},
booktitle = {Proceedings of the 43th International Conference on Software Engineering},
series = {ICSE'21},
year = {2021},
location = {Madrid, Spain},
pages = {TBD},
note={Acceptance rate = 138/602 (22%)},
}

----------------------------------------------------------------------------------------------------------------

We conducted a case-control study of Chromium OS project to identify the factors that differentiate code reviews that successfully identified security defects from those that  missed such defects. We identified the cases and the controls based on our outcome of interest, namely whether a security defect was identified or escaped during the  code review of a vulnerability contributing commit (VCC).  Using a keyword-based mining approach followed by manual validations on a dataset of 404,878 Chromium OS code reviews, we identified 516 code reviews that successfully identified security defects. In addition, from the Chromium OS bug repository, we identified 239 security defects that escaped code reviews. Using a modified version of the SZZ algorithm followed by manual validations, we identified 374 VCCs and corresponding code reviews that  approved those changes. For each of the 890 identified VCCs, we computed 25 different attributes that may influence the identification of a vulnerability during code reviews. Our artifact includes locations of these 890 VCCs as well as the 25 attributes for each VCC. Among those 25 attributes, we considered 18 attributes to build our model. 

To analyze our data, we developed a Logistic Regression model following the guidelines suggested by Harrell Jr. using the 18 attributes. The model, which achieved an AUC of 0.91, found nine code review metrics that distinguish code reviews that missed a vulnerability from the ones that did not. We have also made the R script to reproduce our results in the Github repository. Our dataset includes 890 real world vulnerable files. Each of those vulnerabilities is classified using the CWE specification. We envision  several ways researchers may benefit from this artefact such as evaluating static analysis tools, training machine learning models, and replicating this study under a different context.

Files

code-description.md

Files (328.5 kB)

Name Size Download all
md5:408ee4e6bf3be4ae0d761c6258d63430
1.3 kB Preview Download
md5:b844c7046c5ced06a31d08b4a2804bc3
3.8 kB Preview Download
md5:4bd1b3fbfd472aab002ce06344aea8b0
8.1 kB Download
md5:981bb5b7ae8b3b69ec1912f56978e2cf
314.1 kB Preview Download
md5:20d5342b1d415f9442ce9fc15e09a4b3
1.1 kB Download
md5:501aa2f7e907e370a721dec05400aae3
100 Bytes Preview Download