Dataset Open Access

Code4Bench: A Multidimensional Benchmark of Codeforces Data for Different Program Analysis Techniques

Majd Amirabbas; Vahidi-Asl Mojtaba; Khalilian Alireza; Baraani-Dastjerdi Ahmad; Zamani Bahman

Thesis supervisor(s)

Vahidi-Asl Mojtaba; Haghighi Hasan

Reproducible research relies on well-designed benchmarks. However, evaluation on a single benchmark increases the risk of overfitting; that is, an optimization to reach a certain performance. In recent years several well-designed benchmarks have been constructed for different subfields of program analysis. However, they often involve real-world industrial projects in few languages such as C or Java. We provide Code4Bench, a benchmark comprising 3,421,357 programs totaling of 306,053,105 lines of code in 41 versions of 28 programming languages such as C/C++, Java, Python, and Kotlin. We have constructed this benchmark from Codeforces, a famous programming competition website, which is widely used by international programmers. Code4Bench advances the state-of-the-art in conducting reproducible and comparative experiments. It helps mitigate the bias and increase the generality and conclusiveness of the results. We present our methodology in construction of Code4Bench and give various descriptive statistics. We have also conducted an online survey on the users of Codeforces’ website whose code is included in the benchmark. The survey is concerned about the user’s demographic information and programming habits, whose results are also provided in the benchmark. Finally, we leveraged an automatic process by which we localized faults within the faulty versions and categorize them according to a coarse-grained classification. In addition to its usage in empirical studies, Code4Bench can be used to teach programming and evolve algorithmic problems. We release Code4Bench in database format to allow researchers to extract other data of the benchmark by arbitrary queries.

Code4Bench version 1.0.0 is publicly available at https://zenodo.org/record/2582968, with DOI 10.5281/zenodo.2582968, thereby providing long-term storage and versioning. It is released under the terms of Creative Commons Attribution 4.0 International license. Code4Bench is also publicly available at: https://github.com/code4bench/Code4Bench, in which we have provided some additional information and script examples.

Files (635.9 MB)
Name Size
code4bench.rar
md5:3ae77dfabec6e7a97ca7608c1aa41c04
635.9 MB Download
472
69
views
downloads
All versions This version
Views 472472
Downloads 6969
Data volume 43.9 GB43.9 GB
Unique views 431431
Unique downloads 4242

Share

Cite as