Benchmark Set for Program Repair Based on Partial Fixes, Dataset
Creators
Description
Partial-Fix Data Set
Partial Fixes
Identifying and fixing errors in programs remains a challenge and is one of the most time-consuming tasks in software development. But even after a bug is identified, and a fix has been proposed by a developer or tool, it can happen that the fix is incomplete and does not cover all possible inputs that trigger the bug. This happens quite often and leads to re-opened issues and inefficiencies.
We present the first curated benchmark set composed of incomplete fixes. Each entry in the benchmark set contains a series of commits fixing the same issue, where multiple of the intermediate commits are incomplete fixes. These are sourced from real-world open-source C projects from GitHub.
The selection process involves both automated and manual stages. Initially, we employ heuristics to identify potential partial fixes from repositories, subsequently validating them through meticulous manual inspection. This process ensures the accuracy and reliability of the curated dataset.
We envision that the data set will allow researchers to investigate partial fixes in more detail, allowing them to develop new techniques to detect and fix them.
Reconstruction of the Data Set
To recover the data set from the task definitions, use the script `init_yaml.py` under `scripts/`:
```
./scripts/init_yaml.py <path-to-task-definition-1> <path-to-task-definition-2> ...
```
This will download all the required files and place them in the locations specified in the task definitions.
You can also use the script to get the zip archives at the different revisions by using the flag `--init-zips`.
To initialize all task definitions in the data set, you can use the following command, if your shell supports globbing:
```
./scripts/init_yaml.py partial-fixes/**/**/*.yml
```
Check Task Definitions
To check if the task definitions comply our format run:
`./doc/check-schema.py doc/schema.yml partial-fixes/`
from this directory.
The expected output is "`All task definitions are valid.`".
Structure of the Data Set
The data set has the following structure:
```
├── repository_1
│ ├── partial_1
│ | ├── 7e52c483.zip
│ | ├── cab7b562.diff
│ | ├── c595a789.diff
│ | └── task_definition.yml
│ └── partial_2
│ ├── ...
├── repository_2
...
```
For every repository with at least one curated partial fix, we create a directory `partial_<id>` housing the snapshots of the project at the specific revisions and a the task definition `task_definition.yml` described below.
The file names correspond to the SHA1 commit hash of the project in the ZIP file.
Overview
The Partial Fix Data Set Schema is a blueprint for describing tasks within a dataset of manually curated partial fixes.
Each partial fix is stored in a dedicated directory, housing zip files representing project snapshots before, during, and after addressing an issue, along with a YML file providing crucial details about the fix.
A nice visual representation of the following information can be found [here](doc/schema_doc.html).
General Information
- **Schema Version:** Version of task definitions.
- **Repository URL:** URL of the repository associated with the partial fix.
Sequence
- **Base Version:**
- `input_file`: The name of the file in the dataset relative to the YML file.
- `commit_sha1`: SHA1 commit hash of the base version.
- **Fix Attempt:**
- List of fix attempts, each with:
- `input_file`: The filename to the diff compared to the previous partial fix or the based version relative to the Y
ML file.
- `commit_sha1`: SHA1 commit hash of the fix attempt.
- **Expected Fix**
- `input_file`: The filename of the diff compared to the last partial fix (relative to the YML file).
- `commit_sha1`: SHA1 commit hash of the expected fix.
To fully restore all revision, unzip the input_file of the base version and apply all diffs sequentially.
Classification
Fix classification (one of):
- Partial Fix
- Unknown
- No Partial Fix
Categories (optional)
List of bug-causing categories, for example:
- `Build`: Problems with building the project in certain environments.
- `Null Pointer`: Any null pointer error.
- `Arithmetic and Control-flow`: Problematic control-flow or buggy statements.
- `Preprocessing Directives`: Issue with CI or unit tests.
- `Hardware and OS Related`: The problem occurs only on special hardware and operating systems.
- `Wrong API Usage`: Misuse of functions and API calls.
- `Race Conditions`: Race conditions.
- `Performance`: The fix attempts cause a loss in performance (speed, memory).
- `Memory Leaks`: Memory Leaks.
Metadata
Additional task metadata, including:
- `language`: Programming language of the original project and partial fix.
- `strategy`: The strategy used for mining the partial fix (e.g., 'reopen', 'status', 'linux-convention').
- `fix_size`: Size of the fix (number of additions and deletions).
- `build_system`: Keywords indicating how to build the project.
- `related_issue`: URL to an issue related to this partial fix.
Files
PartialFixBenchmarkSet-artifact-MSR24-submission2.zip
Files
(77.3 GB)
Name | Size | Download all |
---|---|---|
md5:9c32f2d2e136d40aacb0e23041ac2544
|
77.3 GB | Preview Download |