This repository contains the artifact for the PLDI'23 paper "Proving and Disproving Equivalence of Functional Programming Assignments" by Anonymous Authors.
Install Docker from https://docs.docker.com
Load the docker image:
docker load --input artifact62.tar.gz
Run the docker image in interactive mode:
docker run -it artifact62
Move to the artifact directory:
cd artifact
The docker image contains the following content in the directory ~/artifact/
:
benchmarks/
This directory contains all the benchmarks we used to evaluate our system. The benchmarks are organized as follows:
grading/
contains Scala translations of benchmarks from the LearnML framework (Table 1).
grading/translations/
contains our Scala translation of student submissions, enhanced with termination annotations.grading/terminating/
contains a subset of programs from grading/translations/
that our system proves terminating.grading/ta_solutions/
contains reference solutions.equivalence/
contains benchmarks from the equivalence checking literature (Table 3).
equivalence/reve.scala
contains benchmarks from:
Dennis Felsing, Sarah Grebing, Vladimir Klebanov, Philipp Rümmer, and Mattias Ulbrich. 2014. Automating Regression Verification.equivalence/rvt-2016.scala
contains benchmarks from:
Ofer Strichman and Maor Veitsman. 2016. Regression Verification for Unbalanced Recursive Functions.equivalence/rvt-2022.scala
contains benchmarks from:
Chaked R. J. Sayedoff and Ofer Strichman. 2022. Regression verification of unbalanced recursive functions with multiple calls (long version).stainless/
This directory contains two versions of our implementation:
pipeline/
contains the version used in our experiments. Equivalence checking is implemented as a pipeline phase on top of the original verification system Stainless. The implementation of this phase is available in source/core/src/main/scala/stainless/extraction/trace/
.
component/
contains the version with prettier printing and some optimizations, oriented towards deployment. Equivalence checking is implemented as a separate Stainless component on top of the original verification system Stainless. The implementation of this component is available in source/core/src/main/scala/stainless/equivchk/
.
figures/
This directory contains the source files for the figures from the paper.
tables/
This directory contains the material for the tables from the paper.
makefile
This file defines a set of tasks to reproduce the experiments from the paper. Step-by-Step Instructions below explain how to run them.
To run the equivalence checker, use the --equivchk
option of Stainless. The option --comparefuns
specifies the names of candidate functions. The option --models
specifies the names of reference functions.
For example, once in the directory ~/artifact/
, the following command runs the equivalence checking for programs from Figure 2, stored in figures/fig2/isSorted.scala
:
stainless/component/package/stainless.sh figures/fig2/isSorted.scala \
--equivchk=true \
--comparefuns=isSortedA,isSortedB,isSortedC \
--models=isSortedR \
--timeout=10 \
--solvers=smt-z3 \
--silent-verification \
--no-colors
For our example run, we get the following output (followed by a Stainless summary table):
Printing equivalence checking results:
List of functions that are equivalent to model IsSorted.isSortedB: IsSorted.isSortedC
List of functions that are equivalent to model IsSorted.isSortedR: IsSorted.isSortedB
List of erroneous functions: IsSorted.isSortedA
List of timed-out functions:
List of wrong functions:
Printing the final state:
Path for the function IsSorted.isSortedB: IsSorted.isSortedR
Path for the function IsSorted.isSortedC: IsSorted.isSortedB, IsSorted.isSortedR
Counterexample for the function IsSorted.isSortedA:
l -> Cons[Int](-1686134787, Cons[Int](1, Cons[Int](1, Nil[Int]())))
Stainless successfully proves the equivalence of isSortedB
and isSortedR
. For
isSortedC
, the equivalence checking against isSortedR
times out. Subsequently, however, Stainless proves the equivalence of isSortedC
and isSortedB
. By transitivity, our approach concludes that isSortedC
is also equivalent to isSortedR
, and therefore correct. On the other hand, Stainless labels isSortedA
as incorrect, and reports the counterexample that disproves the equivalence.
This section contains step-by step instructions for reproducing the results from the paper.
Our benchmarks are available in benchmarks/grading/translations/
. Translation from OCaml to Scala is not supported by this artifact, because it was semi-manual.
The following command computes the number of candidate programs (Column #P), the average number of functions in candidate programs (Column #F) and the the average size of programs in number of lines of code (Column LOC):
make print-tab1
Reference solutions are available in the folder benchmarks/grading/terminating/ta_solutions/
(Column #S).
The following command computes the percentage of submissions with decreases annotations (Section 5.2 Termination Analysis):
make print-decreases
The following command prints the equivalence checking results for the entire grading data set:
make print-tab2
This subsection contains step-by step instructions for reproducing the experiments for Table 2, on a random sample of 10 submissions for each benchmark. On this sample, the execution should take around half an hour on a standard laptop for the main Equivalence Checking experiment, and around four hours for the Ablation Study.
To increase the sample size, modify the sample_size
value on the line 5 of the makefile
.
To only compute a subset of table rows, modify the grading_benchmarks
list on the line 8 of the makefile
.
The sample_size
and grading_benchmarks
parameters allow alternative sampling of Table 2. For instance, to compute the full first row, use grading_benchmarks=filter
and sample_size=210
.
For each experiment, we provide the following three operations:
print
aggregates the existing log files and prints the summary of results; in the initial state, this operation prints the summary of precomputed results.setup
re-initializes the experiment by generating the input samples of the given size; Warning: executing this operation will erase existing log files.run
runs the experiment and stores the results in corresponding log files; Warning: executing this operation will erase existing log files.The initial state of this artifact (upon loading the docker image) contains the final state of the experiments. We start each subsection with a print
command to print the summary of precomputed results, followed by the setup
and run
commands to reproduce each experiment.
The following command prints the summary of termination checking results:
make print-termination
To reproduce the experiment, use the following commands to set up and run termination checking:
make setup-termination
make run-termination
The output is logged in tables/tab2/termination/termination/*.termination
files.
Running make print-termination
prints the summary of the new results.
The following command prints the summary of results:
make print-base
To reproduce the experiment, use the following commands to set up and run equivalence checking:
make setup-base
make run-base
The output is logged in tables/tab2/base/base/*.log
files.
Running make print-base
prints the summary of the new results.
The following command prints the summary of results:
make print-ni
To reproduce the experiment, use the following commands to set up and run equivalence checking:
make setup-ni
make run-ni
The output is logged in tables/tab2/ni/ni/*.log
files.
Running make print-ni
prints the summary of the new results.
Observe that the number of programs proven incorrect (Column I) is the same as in the base
results, while the number of programs proven correct (Column C) is always 0 (Section 5.5 Ablation Study, paragraph Functional Induction).
The following command prints the summary of results:
make print-nm
To reproduce the experiment, use the following commands to set up and run equivalence checking:
make setup-nm
make run-nm
The output is logged in tables/tab2/nm/nm/*.log
files.
Running make print-nm
prints the summary of the new results.
Observe that the number of programs proven correct (Column C) is 0 for the four benchmarks that contain auxiliary recursive functions: natmul
, uniq
, formula
, and lambda
(Section 5.5 Ablation Study, paragraph Function Call Matching).
The following command prints the summary of results:
make print-1rs
To reproduce the experiment, use the following commands to set up and run equivalence checking:
make setup-1rs
make run-1rs
The output is logged in tables/tab2/1rs/1rs/*.log
files.
Running make print-1rs
prints the summary of the new results.
Observe that the results are the same as the base
results for the 7 benchmarks that only contain one reference solution: filter
, max
, mirror
, mem
, change
, heap
, formula
(Section 5.5 Ablation Study, paragraph Multiple Reference Solutions).
Our benchmarks from Table 3 are available in benchmarks/equivalence/
.
The following command prints the summary of equivalence checking results (for all but ackermann
and mccarthy91
benchmarks, where termination proof attempts failed):
make print-tab3
To reproduce the experiment, the following command runs termination checking and equivalence checking:
make run-tab3
The execution takes a few minutes. The output is logged in tables/tab3/*.log
files and tables/tab3/termination/*.termination
files.
Running make print-tab3
prints the summary of the new results.
The source code is available in figures/fig1a/nat.scala
and figures/fig1b/binary.scala
.
The following commands run the examples:
stainless/component/package/stainless.sh figures/fig1a/nat.scala --timeout=1 --solvers=smt-z3 --no-colors
stainless/component/package/stainless.sh figures/fig1b/binary.scala --timeout=1 --solvers=smt-z3 --no-colors
Stainless emits warnings for possible addition overflows in the program from Figure 1a and reports a counterexample for program from Figure 1b.
The source code is available in figures/fig2/isSorted.scala
.
The following command runs the example:
stainless/component/package/stainless.sh figures/fig2/isSorted.scala --equivchk=true --comparefuns=isSortedA,isSortedB,isSortedC --models=isSortedR --timeout=1 --check-measures=no --infer-measures=false --solvers=smt-z3 --silent-verification --no-colors
Stainless proves the correctness of isSortedB
and isSortedC
, by proving them equivalent to the reference solution. For isSortedA
, it disproves the equivalence and reports a counterexample.
The source code is available in figures/fig4/uniq.scala
.
The following command runs the example:
stainless/component/package/stainless.sh figures/fig4/uniq.scala --equivchk=true --comparefuns=uniqA --models=uniqR --timeout=1 --check-measures=no --infer-measures=false --solvers=smt-z3 --silent-verification --no-colors
Our pairwise equivalence checking subroutine proves uniqA equivalent to uniqR, by matching functions unique and distinct, as well as find and isin.
The source graph is in figures/fig5/natadd.csv
.
The following script prints the summary of results:
(cd figures/fig6/ && make print)
To reproduce the experiment, the following command computes 24 data points (average on 10 runs) for each of the seven variations:
(cd figures/fig6/ && make)
The entire experiment takes multiple days.
To compute a subset of data points, reduce the number of runs by removing corresponding figures/fig6/*.inputs
files.
The source code is available in figures/fig7/max.scala
.
The following command runs the example:
stainless/component/package/stainless.sh figures/fig7/max.scala --equivchk=true --comparefuns=maxC,maxT --models=maxR --timeout=2 --solvers=smt-z3 --silent-verification --no-colors
Our system proves the equivalence of maxT
and maxR
by first proving the equivalence of maxC
and maxR
, and then proving the equivalence of maxC
and maxT
.