Published October 31, 2022
| Version v1
Dataset
Open
Dataset used for paper "Issues-Driven Features for Software Fault Prediction"
Creators
Description
Dataset used for paper "Issues-Driven Features for Software Fault Prediction".
The dataset contains 86 projects from the open source organizations Apache and Spring were written in Java that managed their source code using the Git version control system and an issue tracking system (JIRA or BUGZILLA).
For each project, we extracted data for software fault prediction (SFL) task as follows:
First, we filtered out projects without reported resolved bugs or less than 5 released versions.
Then we iterated the resolved bugs and mapped them to the commits that fixed them.
Next, for each version, we labeled the faulty files in the version. A faulty file is a file that was modified in a commit in the version that resolved a bug.
The methodology for labeling the files is a variant of the very known approach implemented in the {\it SZZ} algorithm, which accounts for its vulnerabilities.
Then we iterated the resolved bugs and mapped them to the commits that fixed them.
Next, for each version, we labeled the faulty files in the version. A faulty file is a file that was modified in a commit in the version that resolved a bug.
The methodology for labeling the files is a variant of the very known approach implemented in the {\it SZZ} algorithm, which accounts for its vulnerabilities.
Finally, for each project, we filtered out versions with faulty files' ratios lower than 5\% and higher than 30\%. The remaining set includes a good representation of bugs and reduces the class imbalance produced by the low number of defects. In addition, filtering versions with a low number of bugs helps to prune outliers, such as a version created to fix specific issues.
Extracted using Beirut repository mining tool:
https://github.com/beirut-repository-mining/repository_mining
Files
Files
(1.6 GB)
Name | Size | Download all |
---|---|---|
md5:0950cbe66d886bffaa93f36231d986c5
|
1.6 GB | Download |