BalancedCommitBench: A Language-Balanced Commit Messages Subset Extracted from CommitBench:
Authors/Creators
Contributors
Contact person:
Researcher (3):
Description
BalancedCommitBench: A Language-Balanced Commit Messages Subset Extracted from CommitBench:
We base our experiments on CommitBench, a large-scale benchmark for commit message generation introduced by Schall et al.(2024) . CommitBench aggregates more than one million real commits collected from thousands of open-source repositories across six major programming languages: Python, Java, JavaScript, Go, PHP, and Ruby. Each instance contains a git diff, the corresponding human-written commit message, and metadata such as author, timestamp, and repository name.
From this corpus, we derive BalancedCommitBench, a language-balanced subset obtained through systematic preprocessing. The dataset is cleaned via language normalization, de-duplication, removal of bot-generated and low-information commits, and length-based filtering, followed by uniform sampling across the six languages to ensure balanced representation.
M. Schall, T. Czinczoll and G. De Melo, "CommitBench: A Benchmark for Commit Message Generation," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 728-739, doi: 10.1109/SANER60148.2024.00080.
Files
dataset_balanced_test_norm6.zip
Files
(169.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:f6943695312e0b4c651f1463b663d779
|
57.7 MB | Download |
|
md5:0ca286bf89b7f412d57a96b2db269d68
|
57.7 MB | Download |
|
md5:2dda8019cf917df00994e5c82927425b
|
53.7 MB | Preview Download |
Additional details
Dates
- Available
-
2026-01-29