Dataset used for paper "Issues-Driven Features for Software Fault Prediction"

Elmishali Amir; Kalech Meir

doi:10.5281/zenodo.7266448

Published October 31, 2022 | Version v1

Dataset Open

Dataset used for paper "Issues-Driven Features for Software Fault Prediction"

Dataset used for paper "Issues-Driven Features for Software Fault Prediction".

The dataset contains 86 projects from the open source organizations Apache and Spring were written in Java that managed their source code using the Git version control system and an issue tracking system (JIRA or BUGZILLA).

For each project, we extracted data for software fault prediction (SFL) task as follows:

First, we filtered out projects without reported resolved bugs or less than 5 released versions.
Then we iterated the resolved bugs and mapped them to the commits that fixed them.
Next, for each version, we labeled the faulty files in the version. A faulty file is a file that was modified in a commit in the version that resolved a bug.
The methodology for labeling the files is a variant of the very known approach implemented in the {\it SZZ} algorithm, which accounts for its vulnerabilities.

Finally, for each project, we filtered out versions with faulty files' ratios lower than 5\% and higher than 30\%. The remaining set includes a good representation of bugs and reduces the class imbalance produced by the low number of defects. In addition, filtering versions with a low number of bugs helps to prune outliers, such as a version created to fix specific issues.

Extracted using Beirut repository mining tool:

https://github.com/beirut-repository-mining/repository_mining

Files

Files (1.6 GB)

Name	Size	Download all
isd.rar md5:0950cbe66d886bffaa93f36231d986c5	1.6 GB	Download

	All versions	This version
Views	59	58
Downloads	10	10
Data volume	17.8 GB	17.8 GB

Dataset used for paper "Issues-Driven Features for Software Fault Prediction"

Creators

Description

Files

Files (1.6 GB)