Abstract

This repository contains the artifact for the PLDI'23 paper "Proving and Disproving Equivalence of Functional Programming Assignments" by Anonymous Authors.

Getting Started Guide

Docker Image

Install Docker from https://docs.docker.com

Load the docker image:

docker load --input artifact62.tar.gz

Run the docker image in interactive mode:

docker run -it artifact62

Move to the artifact directory:

cd artifact

Artifact Information

The docker image contains the following content in the directory ~/artifact/:

  1. benchmarks/

    This directory contains all the benchmarks we used to evaluate our system. The benchmarks are organized as follows:

    • grading/ contains Scala translations of benchmarks from the LearnML framework (Table 1).

      • grading/translations/ contains our Scala translation of student submissions, enhanced with termination annotations.
      • grading/terminating/ contains a subset of programs from grading/translations/ that our system proves terminating.
      • grading/ta_solutions/ contains reference solutions.
    • equivalence/ contains benchmarks from the equivalence checking literature (Table 3).

      • equivalence/reve.scala contains benchmarks from: Dennis Felsing, Sarah Grebing, Vladimir Klebanov, Philipp Rümmer, and Mattias Ulbrich. 2014. Automating Regression Verification.
      • equivalence/rvt-2016.scala contains benchmarks from: Ofer Strichman and Maor Veitsman. 2016. Regression Verification for Unbalanced Recursive Functions.
      • equivalence/rvt-2022.scala contains benchmarks from: Chaked R. J. Sayedoff and Ofer Strichman. 2022. Regression verification of unbalanced recursive functions with multiple calls (long version).
  2. stainless/

    This directory contains two versions of our implementation:

    • pipeline/ contains the version used in our experiments. Equivalence checking is implemented as a pipeline phase on top of the original verification system Stainless. The implementation of this phase is available in source/core/src/main/scala/stainless/extraction/trace/.

    • component/ contains the version with prettier printing and some optimizations, oriented towards deployment. Equivalence checking is implemented as a separate Stainless component on top of the original verification system Stainless. The implementation of this component is available in source/core/src/main/scala/stainless/equivchk/.

  3. figures/

    This directory contains the source files for the figures from the paper.

  4. tables/

    This directory contains the material for the tables from the paper.

  5. makefile

    This file defines a set of tasks to reproduce the experiments from the paper. Step-by-Step Instructions below explain how to run them.

Example Run

To run the equivalence checker, use the --equivchk option of Stainless. The option --comparefuns specifies the names of candidate functions. The option --models specifies the names of reference functions.

For example, once in the directory ~/artifact/, the following command runs the equivalence checking for programs from Figure 2, stored in figures/fig2/isSorted.scala:

stainless/component/package/stainless.sh figures/fig2/isSorted.scala \
  --equivchk=true \
  --comparefuns=isSortedA,isSortedB,isSortedC \
  --models=isSortedR \
  --timeout=10 \
  --solvers=smt-z3 \
  --silent-verification \
  --no-colors

For our example run, we get the following output (followed by a Stainless summary table):

Printing equivalence checking results:
List of functions that are equivalent to model IsSorted.isSortedB: IsSorted.isSortedC
List of functions that are equivalent to model IsSorted.isSortedR: IsSorted.isSortedB
List of erroneous functions: IsSorted.isSortedA
List of timed-out functions:
List of wrong functions:
Printing the final state:
Path for the function IsSorted.isSortedB: IsSorted.isSortedR
Path for the function IsSorted.isSortedC: IsSorted.isSortedB, IsSorted.isSortedR
Counterexample for the function IsSorted.isSortedA:
  l -> Cons[Int](-1686134787, Cons[Int](1, Cons[Int](1, Nil[Int]())))

Stainless successfully proves the equivalence of isSortedB and isSortedR. For isSortedC, the equivalence checking against isSortedR times out. Subsequently, however, Stainless proves the equivalence of isSortedC and isSortedB. By transitivity, our approach concludes that isSortedC is also equivalent to isSortedR, and therefore correct. On the other hand, Stainless labels isSortedA as incorrect, and reports the counterexample that disproves the equivalence.

Step-by-Step Instructions

This section contains step-by step instructions for reproducing the results from the paper.

Table 1. Description of our benchmark programs.

Our benchmarks are available in benchmarks/grading/translations/. Translation from OCaml to Scala is not supported by this artifact, because it was semi-manual. The following command computes the number of candidate programs (Column #P), the average number of functions in candidate programs (Column #F) and the the average size of programs in number of lines of code (Column LOC):

make print-tab1

Reference solutions are available in the folder benchmarks/grading/terminating/ta_solutions/ (Column #S).

The following command computes the percentage of submissions with decreases annotations (Section 5.2 Termination Analysis):

make print-decreases

Table 2. Termination Analysis, Equivalence Checking Results and Ablation Study.

The following command prints the equivalence checking results for the entire grading data set:

make print-tab2

This subsection contains step-by step instructions for reproducing the experiments for Table 2, on a random sample of 10 submissions for each benchmark. On this sample, the execution should take around half an hour on a standard laptop for the main Equivalence Checking experiment, and around four hours for the Ablation Study.

To increase the sample size, modify the sample_size value on the line 5 of the makefile. To only compute a subset of table rows, modify the grading_benchmarks list on the line 8 of the makefile.

The sample_size and grading_benchmarks parameters allow alternative sampling of Table 2. For instance, to compute the full first row, use grading_benchmarks=filter and sample_size=210.

For each experiment, we provide the following three operations:

  • print aggregates the existing log files and prints the summary of results; in the initial state, this operation prints the summary of precomputed results.
  • setup re-initializes the experiment by generating the input samples of the given size; Warning: executing this operation will erase existing log files.
  • run runs the experiment and stores the results in corresponding log files; Warning: executing this operation will erase existing log files.

The initial state of this artifact (upon loading the docker image) contains the final state of the experiments. We start each subsection with a print command to print the summary of precomputed results, followed by the setup and run commands to reproduce each experiment.

  1. Termination Analysis

The following command prints the summary of termination checking results:

make print-termination

To reproduce the experiment, use the following commands to set up and run termination checking:

make setup-termination
make run-termination

The output is logged in tables/tab2/termination/termination/*.termination files.

Running make print-termination prints the summary of the new results.

  1. Equivalence Checking

The following command prints the summary of results:

make print-base

To reproduce the experiment, use the following commands to set up and run equivalence checking:

make setup-base
make run-base

The output is logged in tables/tab2/base/base/*.log files.

Running make print-base prints the summary of the new results.

  1. Ablation Study
  • Functional Induction (Column NI)

The following command prints the summary of results:

make print-ni

To reproduce the experiment, use the following commands to set up and run equivalence checking:

make setup-ni
make run-ni

The output is logged in tables/tab2/ni/ni/*.log files.

Running make print-ni prints the summary of the new results.

Observe that the number of programs proven incorrect (Column I) is the same as in the base results, while the number of programs proven correct (Column C) is always 0 (Section 5.5 Ablation Study, paragraph Functional Induction).

  • Function Call Matching (Column NM)

The following command prints the summary of results:

make print-nm

To reproduce the experiment, use the following commands to set up and run equivalence checking:

make setup-nm
make run-nm

The output is logged in tables/tab2/nm/nm/*.log files.

Running make print-nm prints the summary of the new results.

Observe that the number of programs proven correct (Column C) is 0 for the four benchmarks that contain auxiliary recursive functions: natmul, uniq, formula, and lambda (Section 5.5 Ablation Study, paragraph Function Call Matching).

  • Multiple Reference Solutions (Column 1RS)

The following command prints the summary of results:

make print-1rs

To reproduce the experiment, use the following commands to set up and run equivalence checking:

make setup-1rs
make run-1rs

The output is logged in tables/tab2/1rs/1rs/*.log files.

Running make print-1rs prints the summary of the new results.

Observe that the results are the same as the base results for the 7 benchmarks that only contain one reference solution: filter, max, mirror, mem, change, heap, formula (Section 5.5 Ablation Study, paragraph Multiple Reference Solutions).

Table 3. Comparison to REVE and RVT.

Our benchmarks from Table 3 are available in benchmarks/equivalence/. The following command prints the summary of equivalence checking results (for all but ackermann and mccarthy91 benchmarks, where termination proof attempts failed):

make print-tab3

To reproduce the experiment, the following command runs termination checking and equivalence checking:

make run-tab3

The execution takes a few minutes. The output is logged in tables/tab3/*.log files and tables/tab3/termination/*.termination files.

Running make print-tab3 prints the summary of the new results.

Figure 1. Subtle bugs in introductory programming exercises.

The source code is available in figures/fig1a/nat.scala and figures/fig1b/binary.scala. The following commands run the examples:

stainless/component/package/stainless.sh figures/fig1a/nat.scala --timeout=1 --solvers=smt-z3 --no-colors
stainless/component/package/stainless.sh figures/fig1b/binary.scala --timeout=1 --solvers=smt-z3 --no-colors

Stainless emits warnings for possible addition overflows in the program from Figure 1a and reports a counterexample for program from Figure 1b.

Figure 2. Motivating example.

The source code is available in figures/fig2/isSorted.scala. The following command runs the example:

stainless/component/package/stainless.sh figures/fig2/isSorted.scala --equivchk=true --comparefuns=isSortedA,isSortedB,isSortedC --models=isSortedR --timeout=1 --check-measures=no --infer-measures=false --solvers=smt-z3 --silent-verification --no-colors

Stainless proves the correctness of isSortedB and isSortedC, by proving them equivalent to the reference solution. For isSortedA, it disproves the equivalence and reports a counterexample.

Figure 4. Example programs to illustrate function call matching.

The source code is available in figures/fig4/uniq.scala. The following command runs the example:

stainless/component/package/stainless.sh figures/fig4/uniq.scala --equivchk=true --comparefuns=uniqA --models=uniqR --timeout=1 --check-measures=no --infer-measures=false --solvers=smt-z3 --silent-verification --no-colors

Our pairwise equivalence checking subroutine proves uniqA equivalent to uniqR, by matching functions unique and distinct, as well as find and isin.

Figure 5. Clusters of correct submissions found by our system for the natadd benchmark.

The source graph is in figures/fig5/natadd.csv.

Figure 6. Numbers of pairwise comparisons for the max benchmark.

The following script prints the summary of results:

(cd figures/fig6/ && make print)

To reproduce the experiment, the following command computes 24 data points (average on 10 runs) for each of the seven variations:

(cd figures/fig6/ && make)

The entire experiment takes multiple days. To compute a subset of data points, reduce the number of runs by removing corresponding figures/fig6/*.inputs files.

Figure 7. The reference solution of the max benchmark and two student submissions.

The source code is available in figures/fig7/max.scala. The following command runs the example:

stainless/component/package/stainless.sh figures/fig7/max.scala --equivchk=true --comparefuns=maxC,maxT --models=maxR --timeout=2 --solvers=smt-z3 --silent-verification --no-colors

Our system proves the equivalence of maxT and maxR by first proving the equivalence of maxC and maxR, and then proving the equivalence of maxC and maxT.