Published February 7, 2020 | Version 2.0
Dataset Open

ManySStuBs4J Dataset

  • 1. The University of Edinburgh
  • 2. Google Research

Description

The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques.
We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes.
These are single statement fixes, classified where possible into one of 16 syntactic templates which we call SStuBs.
The dataset contains simple statement bugs mined from open-source Java projects hosted in GitHub.
There are two variants of the dataset. One mined from the 100 Java Maven Projects and one mined from the top 1000 Java Projects.
A project's popularity is determined by computing the sum of z-scores of its forks and watchers.
We kept only bug commits that contain only single statement changes and ignore stylistic differences such as spaces or empty as well as differences in comments.
Some single statement changes can be caused by refactorings, like changing a variable name rather than bug fixes.
We attempted to detect and exclude refactorings such as variable, function, and class renamings, function argument renamings or changing the number of arguments in a function.
The commits are classified as bug fixes or not by checking if the commit message contains any of a set of predetermined keywords such as bug, fix, fault etc.
We evaluated the accuracy of this method on a random sample of 100 commits that contained SStuBs from the smaller version of the dataset and found it to achieve a satisfactory 94% accuracy.
This method has also been used before to extract bug datasets (Ray et al., 2015; Tufano et al., 2018) where it achieved an accuracy of 96% and 97.6% respectively.

The bugs are stored in a JSON file (each version of the dataset has each own instance of this file).
Any bugs that fit one of 16 patterns are also annotated by which pattern(s) they fit in a separate JSON file (each version of the dataset has each own instance of this file).
We refer to bugs that fit any of the 16 patterns as simple stupid bugs (SStuBs).

For more information on extracting the dataset and a detailed documentation of the software visit our GitHub repo: https://github.com/mast-group/SStuBs-mining

Files

README.txt

Files (1.0 GB)

Name Size Download all
md5:f70285b8762c03b7f940c7896a5af5cd
81.3 MB Download
md5:58fc0960fddd9a93dd9fb21002b6ef85
509.7 MB Download
md5:b93e4b3cea75cb7abe624b38dab8d8d3
17.4 kB Download
md5:c625fed4f8f357daf520a50dfc9ede7e
7.3 kB Preview Download
md5:85dd5e46017a4c46e27b02918d46a36f
26.3 MB Download
md5:75d9745a2b0a54f789782e6a1566f1b3
173.2 MB Download
md5:227dcb2796aea81555e18d4f83c1c9a2
6.8 kB Preview Download
md5:f56b4c4ac382cada5ee608ec08bd02b3
251.6 MB Preview Download

Additional details

Related works

Is new version of
Dataset: 10.7488/ds/2628 (DOI)

Funding

EPSRC Centre for Doctoral Training in Data Science EP/L016427/1
UK Research and Innovation