A dataset of Vulnerable Code Changes of the Chormium OS project
Description
This dataset is associated with the paper ""Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project"
To cite this dataset please use following:
@inproceedings{paul-2021-ICSE, author = {Paul, Rajshakhar and Turzo, Asif K. and Bosu, Amiangshu}, title = {Why Security Defects Go Unnoticed during Code Reviews? A Case-Control Study of the Chromium OS Project}, booktitle = {Proceedings of the 43th International Conference on Software Engineering}, series = {ICSE'21}, year = {2021}, location = {Madrid, Spain}, pages = {TBD}, note={Acceptance rate = 138/602 (22%)}, }
----------------------------------------------------------------------------------------------------------------
We conducted a case-control study of Chromium OS project to identify the factors that differentiate code reviews that successfully identified security defects from those that missed such defects. We identified the cases and the controls based on our outcome of interest, namely whether a security defect was identified or escaped during the code review of a vulnerability contributing commit (VCC). Using a keyword-based mining approach followed by manual validations on a dataset of 404,878 Chromium OS code reviews, we identified 516 code reviews that successfully identified security defects. In addition, from the Chromium OS bug repository, we identified 239 security defects that escaped code reviews. Using a modified version of the SZZ algorithm followed by manual validations, we identified 374 VCCs and corresponding code reviews that approved those changes. For each of the 890 identified VCCs, we computed 25 different attributes that may influence the identification of a vulnerability during code reviews. Our artifact includes locations of these 890 VCCs as well as the 25 attributes for each VCC. Among those 25 attributes, we considered 18 attributes to build our model.
To analyze our data, we developed a Logistic Regression model following the guidelines suggested by Harrell Jr. using the 18 attributes. The model, which achieved an AUC of 0.91, found nine code review metrics that distinguish code reviews that missed a vulnerability from the ones that did not. We have also made the R script to reproduce our results in the Github repository. Our dataset includes 890 real world vulnerable files. Each of those vulnerabilities is classified using the CWE specification. We envision several ways researchers may benefit from this artefact such as evaluating static analysis tools, training machine learning models, and replicating this study under a different context.
Files
code-description.md
Files
(332.1 kB)
Name | Size | Download all |
---|---|---|
md5:408ee4e6bf3be4ae0d761c6258d63430
|
1.3 kB | Preview Download |
md5:b844c7046c5ced06a31d08b4a2804bc3
|
3.8 kB | Preview Download |
md5:960d90fae9b2b35e540e03f442f75b5f
|
8.2 kB | Download |
md5:68b3addf796cd66ad40f4308f7e79023
|
317.6 kB | Preview Download |
md5:20d5342b1d415f9442ce9fc15e09a4b3
|
1.1 kB | Download |
md5:501aa2f7e907e370a721dec05400aae3
|
100 Bytes | Preview Download |