There is a newer version of the record available.

Published January 27, 2022 | Version v1.0.0
Dataset Open

TSSB-3M: A massive scale dataset of single statement bugs

  • 1. Carl von Ossietzky University of Oldenburg

Description

Datasets created for the paper "TSSB-3M: Mining single statement bugs at massive scale".

Access to single statement bug fixes at massive scale is not only important for exploring how developers introduce bugs in code and fix them but it is also a valuable ressource for research in data-driven bug detection and automatic repair. Therefore, we are releasing multiple large-scale collections of single statement bug fixes mined from over 500K public Python repositories.

To facilitate future research, we are releasing three datasets:

  • TSSB-3M: A dataset of over 3 million isolated single statement bug fixes. Each bug fix is related to a commit in a public Python that does not change more than a single statement.

  • SSB-9M: A dataset of over 9 million single statement bug fixes. Each fix modifies at least a single statement to fix a bug. However, the related code changes might incorporate changes to other files.

  • SSC-28M: A dataset of over 28 million general single statement changes. We are releasing this dataset with the intention to faciliate research in software evolution. Therefore, a code change might not necessarily relate to a bug fix.

Because of concerns regarding the licensing of code, we do not release the original source code related to the single statement code changes. However, our datasets provide enough information to load the original code from the source project. 

All dataset entries are saved in a compressed jsonlines format. Each individual entry provides access to the following information:

Commit details:

  • project: Name of the git project where the commit occurred.
  • project_url: URL of project containing the commit
  • commit_sha: commit SHA of the code change
  • parent_sha: commit SHA of the parent commit
  • file_path: File path of the changed source file
  • diff: Universal diff describing the change made during the commit
  • before: Python statement before commit
  • after: Python statement after commit (addresses the same line)

Commit analysis:

  • likely_bug: true if the commit message indicates that the commit is a bug fix. This is heuristically determined.
  • comodified: true if the commit modifies more than one statement in a single file (formatting and comments are ignored).
  • in_function: true if the changed statement appears inside a Python function
  • sstub_pattern: the name of the single statement change pattern the commit can be classified for (if any). Default: SINGLE_STMT
  • edit_script: A sequence of AST operation to transform the code before the commit to the code after the commit (includes InsertUpdateMove and Delete operations).

Files

ssb_data_9M.zip

Files (8.6 GB)

Name Size Download all
md5:694cbaf07333ba9bc49f810c190464e0
2.1 GB Preview Download
md5:665ad34c5913059f54a4e40aa788c55b
5.5 GB Preview Download
md5:43739520a088cae6d2313816a05280e1
912.4 MB Preview Download