Reproduction Package for FSE 2024 Article `A Transferability Study of Interpolation-Based Hardware Model Checking for Software Verification'
Creators
Description
Abstract
This artifact is a reproduction package for the article “A Transferability Study of Interpolation-Based Hardware Model Checking for Software Verification”, accepted at FSE 2024. It is archived on Zenodo with the DOI 10.5281/zenodo.11070973.
The FSE article investigates the transferability of the claims reported in two prior publications on interpolation-based hardware model checking to software verification. The two publications are (1) Interpolation-Sequence-Based Model Checking (Vizel and Grumberg, 2009) and (2) Intertwined Forward-Backward Reachability Analysis Using Interpolants (Vizel, Grumberg, and Shoham, 2013), proposing model-checking algorithms ISMC and DAR for hardware circuits, respectively. In the FSE article, we adopted ISMC and DAR for programs and implemented them in a software-verification framework CPAchecker. This artifact supports the reproduction of the experiments in the FSE article, which compared the implementations of ISMC and DAR against existing interpolation-based verification techniques in CPAchecker, including IMC (McMillan, 2003), Impact (McMillan, 2006), and PredAbs (Henzinger, Jhala, Majumdar, and McMillan, 2004), to validate the claims in the above two publications as well as investigate their performance characteristics versus classical approaches for software verification.
The artifact consists of source code, precompiled executables, and input data used in the evaluation of the transferability study, as well as the results produced from the experiments. Specifically, it includes the source code and binaries of CPAchecker (at revision 45787 of branch “itp-mc-with-slt”), which implements the verification algorithms compared in the article, the SV-COMP 2023 benchmark suite, the experimental data generated from the evaluation, and instructions to run the tools and experiments.
This reproduction package works best with the SoSy-Lab Virtual Machine, which runs Ubuntu 22.04 LTS and has all the required dependencies installed. If you test this artifact with this VM, you do not need to install any package.
By default, we assign 2 CPU cores, 15 GB of memory, and 1800 s of CPU time limit to each verification task. A full reproduction of all experiments took more than 10 months of CPU time on our machines. For demonstration purposes, a subset of the benchmark tasks can be executed, which requires roughly 2 hours of CPU time in total.
Contents
This artifact contains the following items:
README.{html,md}
: this documentation (we recommend viewing the HTML version with a browser)fse2024-paper483.pdf
: the preprint of the FSE articleLICENSE.txt
: license information of the artifactdoc/
: a directory containing the additional information of the artifact, includingREQUIREMENTS.{html,md}
: system requirements of the artifactINSTALL.{html,md}
: installation guide for the software dependenciesSTATUS.{html,md}
: the artifact badges we are applying for and the justifications
cpachecker/
: a directory containing the source code and precompiled binaries of CPAcheckersv-benchmarks/
: a directory containing the SV-COMP 2023 benchmark tasks used in our evaluationdata-submission/
: a directory containing the raw and processed data produced from our full evaluation (used in the article, underpaper-results/
) and from a demo experiment (prepared for this reproduction package, underdemo-results/
)bench-defs/
: a directory containing the benchmark and table definitions of the experiments (used by BenchExec, a framework for reliable benchmarking)scripts/
: a directory containing utility scripts for processing dataMakefile
: a file that assembles commands for running experiments and processing data
This readme file will guide you through the following steps:
- Set up evaluation environment
- Execute CPAchecker on example programs
- Reproduce experiments in the paper
- Validate the claims in the two publications
- Claims in the ISMC Publication
- H1.A: ISMC is faster than IMC on tasks with property violation.
- H1.B: ISMC is faster than IMC when IMC finds a proof only at high unrolling bounds.
- H1.C: Overall, ISMC is faster than IMC (by 30 % in their experiment).
- Claims in the DAR Publication
- H2.A: For DAR, the ratio between iterations using global strengthening to the total number of iterations is less than 0.5 in most tasks.
- H2.B: IMC finds a proof slower than DAR in many tasks even though it has a smaller convergence length.
- H2.C: DAR computes more interpolants than IMC.
- H2.D: DAR’s run-time is more sensitive to the sizes of interpolants than IMC.
- H2.E: Overall, DAR is faster than IMC (by 36 % in their experiment).
- Claims in the ISMC Publication
- Compare ISMC, DAR, and IMC against classical software-verification approaches
- Validate the claims in the two publications
- Interpret the experimental results
TL;DR
After copying and unzipping the artifact into the SoSy-Lab VM, execute the following commands in the root directory of the artifact.
- Set up and test BenchExec:
make prepare-benchexec
: turn off swap memory and allow user namespaces (needs to be redone after every reboot)make test-cgroups
: test if cgroups are configured correctly
- Execute verification algorithms in CPAchecker on an example program (the expected verification result is
TRUE
, meaning that the program does not contain property violation):make run-cpa-dar
: run DAR (Vizel, Grumberg, and Shoham, 2013)make run-cpa-ismc
: run ISMC (Vizel and Grumberg, 2009)make run-cpa-imc
: run IMC (McMillan, 2003)make run-cpa-impact
: run Impact (McMillan, 2006)make run-cpa-predabs
: run PredAbs (Henzinger, Jhala, Majumdar, and McMillan, 2004)
- Perform a demo experiment on 45 tasks:
make run-demo-exp
- To quickly check the if experiment is runnable, you could use
make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp
to limit the CPU run-time to 60 seconds, memory to 3 GB, and the number of CPU cores to 1 per task. - After the runs are finished, execute
make gen-demo-table
to produced an HTML table (results/demo.table.html
) containing the experimental results. - Results obtained from our machines are provided in the directory
data-submission/demo-results/
(HTML table for the demo experiment).
- To quickly check the if experiment is runnable, you could use
- Perform a full experiment:
make run-full-exp
-
2 CPU cores, 15 GB of RAM, and 1800 seconds of CPU time are given to each task. The full experiment takes roughly 10 months of CPU time.
-
After the runs are finished, execute the following command to generate HTML tables (
results/*.table.html
) containing the experimental results:make gen-full-table gen-nomultiloop-table gen-nomultiloop-eca-table gen-nomultiloop-sequentialized-table gen-nomultiloop-loops-table
-
Results obtained from our machines are stored in the directory
data-submission/paper-results/
.
-
To view HTML pages corresponding to tables and figures of the paper, please refer to sections: - “Our Results - Validating Claims in Prior Publications” - “Our Results - Comparison with Impact and PredAbs”
To improve readability, in the following we only excerpt important fragments of the logs. The complete log messages for some of the listed commands are attached in data-submission/complete-logs.html
for your reference.
Set Up Evaluation environment
System Requirements
Please refer to doc/REQUIREMENTS.html
.
Installation Guide
Please refer to doc/INSTALL.html
.
Execute CPAchecker
Run Verification Algorithms in CPAchecker
The following five interpolation-based algorithms in CPAchecker were evaluated: IMC (McMillan, 2003), ISMC (Vizel and Grumberg, 2009), DAR (Vizel, Grumberg, and Shoham, 2013), Impact (McMillan, 2006), and PredAbs (Henzinger, Jhala, Majumdar, and McMillan, 2004).
Below are the commands to run the evaluated verification algorithms on an example program sv-benchmarks/c/loop-invariants/const.c
.
-
DAR:
# expected verification result: TRUE make run-cpa-dar
-
ISMC:
# expected verification result: TRUE make run-cpa-ismc
-
IMC:
# expected verification result: TRUE make run-cpa-imc
-
Impact:
# expected verification result: TRUE make run-cpa-impact
-
PredAbs:
# expected verification result: TRUE make run-cpa-predabs
You can run the algorithms on another program by specifying c-prog=<path_to_program>
. For example,
# expected verification result: FALSE
make c-prog="sv-benchmarks/c/bitvector-regression/implicitfloatconversion.c" run-cpa-dar
Note that a verdict for a program under verification is affected by the data model of the platform. In our experiments, the data model for a program under verification is recorded as parameters for the verification task and given as input to CPAchecker. The above two example programs assume the ILP32 data model (cf. sv-benchmarks/c/loop-invariants/const.yml and sv-benchmarks/c/bitvector-regression/implicitfloatconversion.yml). To specify the data model (ILP32 or LP64), please add cpa-args="-32"
or cpa-args="-64"
to the command. For more information on how to run CPAchecker, see cpachcker/README.md
.
Understand the output of CPAchecker
The following is an example output shown on the console after an analysis is finished.
[...redacted...]
Verification result: TRUE. No property violation found by chosen configuration.
More details about the verification run can be found in the directory "./output".
Graphical representation included in the file "./output/Report.html".
There are 3 possible outcomes of the verification result:
TRUE
: the program is “safe”, i.e., it satisfies the given safety propertyFALSE
: the program is “unsafe”, i.e., it contains a violation to the given safety propertyUNKNOWN
: the analysis was inconclusive
CPAchecker also generates an HTML report (in directory output/
) for conveniently browsing the results. Please refer to cpachecker/doc/Report.md
for more information.
Conduct Experiments
Experimental settings
The settings for full (resp. demo) experiment are described in the XML file bench-defs/cpachecker.xml
(resp. bench-defs/cpachecker-demo.xml
). This XML file listed the tasks to be benchmarked, and is used by BenchExec for limiting computing resources and scheduling runs. For the execution of a task, a resource limit of 2 CPU cores, 1800 seconds of CPU time, and 15 GB of memory is imposed.
Perform Demo Evaluation
The full evaluation requires more than 10 months of CPU time on our machines with 3.40 GHz cores. Therefore, we provide ways to conduct experiments on a subset of the benchmark tasks.
We selected 45 tasks from the directory sv-benchmarks/c/eca-rers2012/
(the tasks are listed in bench-defs/sets/demo.set
). Note that our observations and conclusions in the article were drawn from the full evaluation. The demo evaluation is provided for demonstration purposes only.
To perform evaluation on these selected tasks, run:
make run-demo-exp
This demo experiment is expected to take roughly 2 hours of CPU time. After the run is finished, run the following command to generate an HTML table containing the experimental results:
make gen-demo-table # output HTML at results/stats.demo.html
If the default resource requirements are too high for your machine, you can specify the options timelimit
, memlimit
, and cpulimit
in the make
command. For instance, to run the demo experiment with a time limit of 60 seconds, a memory limit of 3 GB, and a CPU limit of 1 core, run:
make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp
Some benchmark runs might run out of resource (e.g., timeout or OOM) and their results are inconclusive if the specified resource limits are too low. For your reference, the experimental results obtained from our machines are provided in the directory data-submission/demo-results/
.
If, on the other hand, you have enough hardware resources and want to launch parallel benchmark runs, add benchexec-args="-N <num_jobs>"
to the make
command.
For more usage information about BenchExec, please refer to benchexec -h
.
Perform Full Evaluation
To run the full experiment over the 8813 benchmark tasks reported in the article, execute the following command:
make run-full-exp
After the runs are finished, you can generate HTML tables containing the experimental results by:
make gen-full-table # results of all tasks (Table 1 and 2)
make gen-nomultiloop-table # results of programs with at most 1 loop (Table 3)
make gen-nomultiloop-eca-table # results of "ECA" programs with at most 1 loop (Table 3)
make gen-nomultiloop-sequentialized-table # results of "Sequentialized" programs with at most 1 loop (Table 3)
make gen-nomultiloop-loops-table # results of "Loops" programs with at most 1 loop (Table 3)
The generated HTML tables can be found in results/stats.*.table.html
. For your reference, the experimental results obtained from our machines are stored in the directory data-submission/paper-results/
. In sections “Our Results - Validating Claims in Prior Publications” and “Our Results - Comparison with Impact and PredAbs” below, we provide pre-configured links to easily view the corresponding tables/figures shown in the article.
Alternatively, you could also perform evaluation
- of a verification algorithm (a
rundefinition
element inbench-defs/cpachecker.xml
) - on a subcategory (a
tasks
element inbench-defs/cpachecker.xml
) of the benchmark tasks
by specifying -r <algorithm_name>
and -t <tasks_name>
via the option benchexec-args
, respectively. For example, the following command starts an experiment of DAR on the ReachSafety-BitVectors subcategory.
make benchexec-args="-r dar -t ReachSafety-BitVectors" run-full-exp
The HTML table can then be generated by:
make gen-sub-table
View the Experimental Results
The results (both raw and processed data) of the demo and full evaluation obtained by our machines are in folders data-submission/demo-results/
and data-submission/paper-results/
, respectively. The results produced by the make
commands in the previous section are written to the directory results/
.
Our Results - Validating Claims in Prior Publications
In the transferability study, there are eight hypotheses extracted from the two prior publications. We list them below to make this document more standalone. For details, please refer to the bundled article.
-
Claims in the ISMC Publication
- H1.A: ISMC is faster than IMC on tasks with property violation.
- H1.B: ISMC is faster than IMC when IMC finds a proof only at high unrolling bounds.
- H1.C: Overall, ISMC is faster than IMC (by 30 % in their experiment).
In our experiments, we confirmed H1.A also hold for software verification, but H1.B and H1.C did not hold.
-
Claims in the DAR Publication
- H2.A: For DAR, the ratio between iterations using global strengthening to the total number of iterations is less than 0.5 in most tasks.
- H2.B: IMC finds a proof slower than DAR in many tasks even though it has a smaller convergence length.
- H2.C: DAR computes more interpolants than IMC.
- H2.D: DAR’s run-time is more sensitive to the sizes of interpolants than IMC.
- H2.E: Overall, DAR is faster than IMC (by 36 % in their experiment).
In our experiments, we confirmed H2.A and H2.C also hold for software verification, but H2.B, H2.D, and H2.E did not hold.
Here we provide pre-configured links to easily view the corresponding tables/figures shown in the article. Please open the links with a browser.
- Table 1 (summary of the experimental results)
- Table 2 (left)* (H1.C)
- Table 2 (right)* (H2.E)
- Figure 1a⋄ (H1.A)
- Figure 1b⋄ (H1.B)
- Figure 2a (H1.C and H2.E)
- Figure 2b (H1.C and H2.E)
- Figure 3a⋄ (H2.E)
- Figure 3b⋄ (H2.B)
- Figure 4a (H2.C)
- Figure 4b (IMC) (H2.D)
- Figure 4b (DAR) (H2.D)
⋄: Timeout cases are excluded from the plots.
*: CPU time in the HTML table is rounded to 3 significant digits
Our Results - Comparison with Impact and PredAbs
Besides validating the claims in the prior publications, we compare IMC, ISMC and DAR with Impact and PredAbs.
Here we provide pre-configured links to easily view the corresponding tables/figures shown in the article. Please open the links with a browser.
- Table 3 (top)
- Table 3 (bottom: ReachSafety-ECA)
- Table 3 (bottom: ReachSafety-Sequentialized)
- Table 3 (bottom: ReachSafety-Loops)
- Figure 5a
- Figure 5b
Navigate Through the Data
When opening the generated HTML table, you will be guided to the Summary
page of the experiment, where detailed settings of the experiment and a summary table of the compared algorithms are displayed.
To see the full table, please navigate to the tab Table
. By filtering the status from the drop-down menus, you can see the results of Timeouts
, Out of memory
, and Other inconclusive
of each compared approach as reported in Table 1. Below we briefly explain the meaning of some columns in the table:
#bmc-unroll
: the number of loop unrollings in the BMC stageconverg-len
: the convergence length to reach a fixed point#interpolants
: the number of interpolants computed during the verification process#avg-itp-atoms
: the average number of atoms in the computed interpolantsfp-reached
: whether a fixed-point was reached during the verification process#dar-global-phases
: the number of times that DAR enters the global strengthening stagedar-global-ratio
: the ratio between iterations using global strengthening to the total number of iterations (#dar-global-phases / converg-len
)
To inspect the log file of an individual task, click on the status of that task. If the log file cannot be displayed, configure your browser according to the printed instructions.
To filter tasks, you can make use of the task filter at the upper-right corner of the page. To view quantile plots, please navigate to tab Quantile Plot
and adjust the drop-down menus as you prefer. To view scatter plots, please navigate to tab Scatter Plot
, and adjust the x- and y- axes according to your interests.
Known issues
Known issues of the artifact are documented below.
CPU-throttling warnings
When you are running experiments (especially on a laptop), BenchExec might raise the following warning:
2024-XX-XX XX:XX:XX - WARNING - CPU throttled itself during benchmarking due to overheating. Benchmark results are unreliable!
This is normal on a laptop. Please ignore it.
File-not-found warnings
Following warnings might pop up when you start an experiment:
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'ntdrivers-simplified/*.yml'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'openssl/*.yml'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'verifythis/duplets.yml'.
[...redacted...]
These are caused the the directory structure of the SV-COMP benchmark set and can be safely ignored.
Files
DarIsmcTransferability-artifact-FSE24-proceedings.zip
Files
(1.9 GB)
Name | Size | Download all |
---|---|---|
md5:6b3a48bd929526511201d5b3ea836b37
|
1.9 GB | Preview Download |
Additional details
Dates
- Submitted
-
2024-04-30Artifact Evaluation