Published March 4, 2019 | Version 1.0.0
Dataset Open

Code4Bench: A Multidimensional Benchmark of Codeforces Data for Different Program Analysis Techniques

  • 1. Faculty of Computer Science and Engineering, Shahid Beheshti University G. C., Tehran, Iran
  • 2. Department of Software Engineering, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran

Contributors

  • 1. Faculty of Computer Science and Engineering, Shahid Beheshti University G. C., Tehran, Iran

Description

Reproducible research relies on well-designed benchmarks. However, evaluation on a single benchmark increases the risk of overfitting; that is, an optimization to reach a certain performance. In recent years several well-designed benchmarks have been constructed for different subfields of program analysis. However, they often involve real-world industrial projects in few languages such as C or Java. We provide Code4Bench, a benchmark comprising 3,421,357 programs totaling of 306,053,105 lines of code in 41 versions of 28 programming languages such as C/C++, Java, Python, and Kotlin. We have constructed this benchmark from Codeforces, a famous programming competition website, which is widely used by international programmers. Code4Bench advances the state-of-the-art in conducting reproducible and comparative experiments. It helps mitigate the bias and increase the generality and conclusiveness of the results. We present our methodology in construction of Code4Bench and give various descriptive statistics. We have also conducted an online survey on the users of Codeforces’ website whose code is included in the benchmark. The survey is concerned about the user’s demographic information and programming habits, whose results are also provided in the benchmark. Finally, we leveraged an automatic process by which we localized faults within the faulty versions and categorize them according to a coarse-grained classification. In addition to its usage in empirical studies, Code4Bench can be used to teach programming and evolve algorithmic problems. We release Code4Bench in database format to allow researchers to extract other data of the benchmark by arbitrary queries.

Code4Bench version 1.0.0 is publicly available at https://zenodo.org/record/2582968, with DOI 10.5281/zenodo.2582968, thereby providing long-term storage and versioning. It is released under the terms of Creative Commons Attribution 4.0 International license. Code4Bench is also publicly available at: https://github.com/code4bench/Code4Bench, in which we have provided some additional information and script examples.

Files

Files (635.9 MB)

Name Size Download all
md5:3ae77dfabec6e7a97ca7608c1aa41c04
635.9 MB Download