TSSB-3M: A massive scale dataset of single statement bugs
Description
Datasets created for the paper "TSSB-3M: Mining single statement bugs at massive scale".
Access to single statement bug fixes at massive scale is not only important for exploring how developers introduce bugs in code and fix them but it is also a valuable ressource for research in data-driven bug detection and automatic repair. Therefore, we are releasing multiple large-scale collections of single statement bug fixes mined from over 500K public Python repositories.
To facilitate future research, we are releasing three datasets:
-
TSSB-3M: A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more than a single statement.
-
SSB-9M: A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.
-
SSC-28M: A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evolution. Therefore, a code change might not necessarily relate to a bug fix.
Because of concerns regarding the licensing of code, we do not release the original source code related to the single statement code changes. However, our datasets provide enough information to load the original code from the source project.
All dataset entries are saved in a compressed jsonlines format. Each individual entry provides access to the following information:
Commit details:
- project: Name of the git project where the commit occurred.
- project_url: URL of project containing the commit
- commit_sha: commit SHA of the code change
- parent_sha: commit SHA of the parent commit
- file_path: File path of the changed source file
- diff: Universal diff describing the change made during the commit
- before: Python statement before commit
- after: Python statement after commit (addresses the same line)
Commit analysis:
- likely_bug:
true
if the commit message indicates that the commit is a bug fix. This is heuristically determined. - comodified:
true
if the commit modifies more than one statement in a single file (formatting and comments are ignored). - in_function:
true
if the changed statement appears inside a Python function - sstub_pattern: the name of the single statement change pattern the commit can be classified for (if any). Default:
SINGLE_STMT
- edit_script: A sequence of AST operation to transform the code before the commit to the code after the commit (includes
Insert
,Update
,Move
andDelete
operations).