Benchmark Set for Program Repair Based on Partial Fixes, Dataset

Beyer, Dirk; Grunske, Lars; Kettl, Matthias; Lingsch-Rosenfeld, Marian; Raselimo, Moeketsi

doi:10.5281/zenodo.10369427

Published December 13, 2023 | Version v3

Software Open

Benchmark Set for Program Repair Based on Partial Fixes, Dataset

1. LMU Munich, Germany
2. Humboldt University Berlin, Germany
3. Ludwig-Maximilians-Universität München
4. Humboldt-Universität zu Berlin

Partial-Fix Data Set

Partial Fixes

Identifying and fixing errors in programs remains a challenge and is one of the most time-consuming tasks in software development. But even after a bug is identified, and a fix has been proposed by a developer or tool, it can happen that the fix is incomplete and does not cover all possible inputs that trigger the bug. This happens quite often and leads to re-opened issues and inefficiencies.

We present the first curated benchmark set composed of incomplete fixes. Each entry in the benchmark set contains a series of commits fixing the same issue, where multiple of the intermediate commits are incomplete fixes. These are sourced from real-world open-source C projects from GitHub.

The selection process involves both automated and manual stages. Initially, we employ heuristics to identify potential partial fixes from repositories, subsequently validating them through meticulous manual inspection. This process ensures the accuracy and reliability of the curated dataset.

We envision that the data set will allow researchers to investigate partial fixes in more detail, allowing them to develop new techniques to detect and fix them.

Reconstruction of the Data Set

To recover the data set from the task definitions, use the script `init_yaml.py` under `scripts/`:

```

./scripts/init_yaml.py <path-to-task-definition-1> <path-to-task-definition-2> ...

```

This will download all the required files and place them in the locations specified in the task definitions.

You can also use the script to get the zip archives at the different revisions by using the flag `--init-zips`.

To initialize all task definitions in the data set, you can use the following command, if your shell supports globbing:

```

./scripts/init_yaml.py partial-fixes/**/**/*.yml

```

Check Task Definitions

To check if the task definitions comply our format run:

`./doc/check-schema.py doc/schema.yml partial-fixes/`

from this directory.

The expected output is "`All task definitions are valid.`".

Structure of the Data Set

The data set has the following structure:

```

├── repository_1
│ ├── partial_1
│ | ├── 7e52c483.zip
│ | ├── cab7b562.diff
│ | ├── c595a789.diff
│ | └── task_definition.yml
│ └── partial_2
│ ├── ...
├── repository_2
...

```

For every repository with at least one curated partial fix, we create a directory `partial_<id>` housing the snapshots of the project at the specific revisions and a the task definition `task_definition.yml` described below.

The file names correspond to the SHA1 commit hash of the project in the ZIP file.

Overview

The Partial Fix Data Set Schema is a blueprint for describing tasks within a dataset of manually curated partial fixes.

Each partial fix is stored in a dedicated directory, housing zip files representing project snapshots before, during, and after addressing an issue, along with a YML file providing crucial details about the fix.

A nice visual representation of the following information can be found [here](doc/schema_doc.html).

General Information

- **Schema Version:** Version of task definitions.

- **Repository URL:** URL of the repository associated with the partial fix.

Sequence

- **Base Version:**

- `input_file`: The name of the file in the dataset relative to the YML file.

- `commit_sha1`: SHA1 commit hash of the base version.

- **Fix Attempt:**

- List of fix attempts, each with:

- `input_file`: The filename to the diff compared to the previous partial fix or the based version relative to the Y

ML file.

- `commit_sha1`: SHA1 commit hash of the fix attempt.

- **Expected Fix**

- `input_file`: The filename of the diff compared to the last partial fix (relative to the YML file).

- `commit_sha1`: SHA1 commit hash of the expected fix.

To fully restore all revision, unzip the input_file of the base version and apply all diffs sequentially.

Classification

Fix classification (one of):

- Partial Fix

- Unknown

- No Partial Fix

Categories (optional)

List of bug-causing categories, for example:

- `Build`: Problems with building the project in certain environments.

- `Null Pointer`: Any null pointer error.

- `Arithmetic and Control-flow`: Problematic control-flow or buggy statements.

- `Preprocessing Directives`: Issue with CI or unit tests.

- `Hardware and OS Related`: The problem occurs only on special hardware and operating systems.

- `Wrong API Usage`: Misuse of functions and API calls.

- `Race Conditions`: Race conditions.

- `Performance`: The fix attempts cause a loss in performance (speed, memory).

- `Memory Leaks`: Memory Leaks.

Metadata

Additional task metadata, including:

- `language`: Programming language of the original project and partial fix.

- `strategy`: The strategy used for mining the partial fix (e.g., 'reopen', 'status', 'linux-convention').

- `fix_size`: Size of the fix (number of additions and deletions).

- `build_system`: Keywords indicating how to build the project.

- `related_issue`: URL to an issue related to this partial fix.

Files

PartialFixBenchmarkSet-artifact-MSR24-submission2.zip

Files (77.3 GB)

Name	Size	Download all
PartialFixBenchmarkSet-artifact-MSR24-submission2.zip md5:9c32f2d2e136d40aacb0e23041ac2544	77.3 GB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	314	153
Downloads	38	12
Data volume	2.0 TB	1.0 TB

Benchmark Set for Program Repair Based on Partial Fixes, Dataset

Creators

Description

Partial-Fix Data Set

Partial Fixes

Reconstruction of the Data Set

Check Task Definitions

Structure of the Data Set

Overview

General Information

Sequence

Classification

Categories (optional)

Metadata

Files

PartialFixBenchmarkSet-artifact-MSR24-submission2.zip

Files (77.3 GB)