PSB2: The Second Program Synthesis Benchmark Suite

doi:10.5281/zenodo.4678740

Published April 10, 2021 | Version 1.0

Dataset Open

PSB2: The Second Program Synthesis Benchmark Suite

1. Hamilton College

General Program Synthesis Benchmark Suite Datasets

Version 1.0

This repository contains datasets for the 25 problems described in the paper PSB2: The Second Program Synthesis Benchmark Suite. These problems come from a variety of sources, and require a range of programming constructs and datatypes to solve. These datasets are designed to be usable for any method of performing general program synthesis, including and not limited to inductive program synthesis and evolutionary methods such as genetic programming.

Use

Each problem in the benchmark suite is located in a separate directory in the `datasets` directory.

For each problem, we provide a set of `edge` cases and a set of `random` cases. The `edge` cases are hand-chosen cases representing the limits of the problem. The `random` cases are all generated based on problem-specific distributions. For each problem, we included exactly 1 million `random` cases.

A typical use of these datasets for a set of runs of program synthesis would be:

- For each run, use every `edge` case in the training set
- For each run, use a different, randomly-sampled set of `random` cases in the training set.
- Use a larger set of `random` cases as an unseen test set.

Dataset format

Each edge and random dataset is provided in three formats: CSV, JSON, and EDN, with all three formats containing identical data.

The CSV files are formatted as follows:

- The first row of the file is the column names.
- Each following row corresponds to one set of program inputs and expected outputs.
- Input columns are labeled `input1`, `input2`, etc., and output columns are labeled `output1`, `output2`, etc.
- In CSVs, string inputs and outputs are double quoted when necessary, but not if not necessary. Newlines within strings are escaped.
- Columns in CSV files are comma-separated.

The JSON and EDN files are formatted using the JSON Lines standard (adapted for EDN).
Each case is put on its own line of the data file. The files should be read line-by-line and each parsed into an object/map using a JSON/EDN parser.

Citation

If you use these datasets in a publication, please cite the paper PSB2: The Second Program Synthesis Benchmark Suite and include a link to this repository.

BibTeX entry for paper:

@InProceedings{Helmuth:2021:GECCO,
  author =    "Thomas Helmuth and Peter Kelly",
  title =    "{PSB2}: The Second Program Synthesis Benchmark Suite",
  booktitle =    "GECCO '21: Proceedings of the 2021 Annual Conference on Genetic and Evolutionary Computation",
  year =     "2021",
  isbn13 =    "978-1-4503-8350-9",
  organisation = "SIGEVO",
  address =    "Lille, France",
  URL =      "http://doi.acm.org/10.1145/3449639.3459285",
  DOI =      "10.1145/3449639.3459285",
  publisher =    "ACM",
  publisher_address = "New York, NY, USA",
}

Files

PSB2.zip

Files (2.1 GB)

Name	Size	Download all
PSB2.zip md5:cc331e95c289d3ab6cb6960850c62366	2.1 GB	Preview Download

Additional details

Is documented by: Conference paper: 10.1145/3449639.3459285 (DOI)

	All versions	This version
Views	926	397
Downloads	71	9
Data volume	165.1 GB	18.8 GB

PSB2: The Second Program Synthesis Benchmark Suite

Creators

Description

Files

PSB2.zip

Files (2.1 GB)

Additional details

Related works