Published March 15, 2015 | Version v1
Dataset Open

bugclassify

Description

About the Data

They download Herzig et al.’s datasets which included the identiers of issue reports that they have manually analyzed. The description of that dataset follows.

The authors conducted a study on five open-source JAVA projects described in Table I (see paper). They aimed to select projects that were under active development and were developed by teams that follow strict commit and bug fixing procedures similar to industry. They also aimed to have a more or less homogenous data set which eased the manual inspection phase. Projects from APACHE and MOZILLA seemed to fit their requirements best. Additionally, they selected the five projects such that they cover atleast two different and popular bug tracking systems: Bugzilla1 and Jira2. Three out of five projects (Lucene-Java, Jackrabbit,and HTTPClient) use a Jira bug tracker. The remaining two projects (Rhino, Tomcat5) use a Bugzilla tracker. For each of the five projects, they selected all issue reports that were marked as being RESOLVED , CLOSED, or VERIFIED and whose resolution was set to FIXED and performed a manual inspection on these issues. They disregarded issues with resolution in progress or not being accepted, as their features may change in the future.The number of inspected reports per project can be found in the table above. In total, they obtained 7,401 closed and fixed issue reports. 1,810 of these reports originate from the Rhino and Tomcat5 projects and represent Bugzilla issue reports. The remaining of the 5,591 reports were filed in a Jira bug tracker.

Abstract

Bug localization refers to the task of automatically process- ing bug reports to locate source code files that are respon- sible for the bugs. Many bug localization techniques have been proposed in the literature. These techniques are often evaluated on issue reports that are marked as bugs by their reporters in issue tracking systems. However, recent findings by Herzig et al. find that a substantial number of issue re- ports marked as bugs, are not bugs but other kinds of issues like refactorings, request for enhancement, documentation changes, test case creation, and so on. Herzig et al. report that these misclassifications affect bug prediction, namely the task of predicting which files are likely to be buggy in the future. In this work, we investigate whether these misclas- sifications also affect bug localization. To do so, we analyze issue reports that have been manually categorized by Herzig et al. and apply a bug localization technique to recover a ranked list of candidate buggy files for each issue report. We then evaluate whether the quality of ranked lists of reports reported as bugs is the same as that of real bug reports. Our findings shed light that there is a need for additional clean- ing steps to be performed on issue reports before they are used to evaluate bug localization techniques.

Files

httpclient_classification_vs_type.csv

Files (170.5 kB)

Name Size Download all
md5:6c84192e56e61b3bc918c51b91573de7
21.2 kB Preview Download
md5:c007d74b73217385447436d8ce49fecd
54.0 kB Preview Download
md5:96a6cc5b217d0c41eaf694b4b6d95e76
64.6 kB Preview Download
md5:34f9ff45699b79cbdc01f983cc52623c
230 Bytes Preview Download
md5:9ea69c3dee20f25393516f0f140caaff
10.1 kB Preview Download
md5:1ed10d4415e74b5fae3249671461b505
20.4 kB Preview Download