Published January 29, 2026 | Version v1
Dataset Open

BalancedCommitBench: A Language-Balanced Commit Messages Subset Extracted from CommitBench:

  • 1. ROR icon King Fahd University of Petroleum and Minerals
  • 2. ROR icon Imam Abdulrahman Bin Faisal University

Description

BalancedCommitBench: A Language-Balanced Commit Messages Subset Extracted from CommitBench:

We base our experiments on CommitBench, a large-scale benchmark for commit message generation introduced by Schall et al.(2024) . CommitBench aggregates more than one million real commits collected from thousands of open-source repositories across six major programming languages: Python, Java, JavaScript, Go, PHP, and Ruby. Each instance contains a git diff, the corresponding human-written commit message, and metadata such as author, timestamp, and repository name.

From this corpus, we derive BalancedCommitBench, a language-balanced subset obtained through systematic preprocessing. The dataset is cleaned via language normalization, de-duplication, removal of bot-generated and low-information commits, and length-based filtering, followed by uniform sampling across the six languages to ensure balanced representation.

 

M. Schall, T. Czinczoll and G. De Melo, "CommitBench: A Benchmark for Commit Message Generation," 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 2024, pp. 728-739, doi: 10.1109/SANER60148.2024.00080.

Files

dataset_balanced_test_norm6.zip

Files (169.1 MB)

Name Size Download all
md5:f6943695312e0b4c651f1463b663d779
57.7 MB Download
md5:0ca286bf89b7f412d57a96b2db269d68
57.7 MB Download
md5:2dda8019cf917df00994e5c82927425b
53.7 MB Preview Download

Additional details

Dates

Available
2026-01-29