Reproduction package for SPIN 2024 submission `Augmenting Interpolation-Based Model Checking with Auxiliary Invariants'
Description
Reproduction Package
Augmenting Interpolation-Based Model Checking with Auxiliary Invariants
Abstract
This artifact is a reproduction package for the article “Augmenting Interpolation-Based Model Checking with Auxiliary Invariants”, published at SPIN 2024. It is accessible via the DOI 10.5281/zenodo.10548594 on Zenodo.
The artifact consists of source code, precompiled executables, and input data used in the evaluation of the paper, as well as the obtained experimental results. Specifically, it contains the source code and precompiled binaries of CPAchecker at revision 42901, the executables of 2LS and Symbiotic, the benchmark suite of SV-COMP 2022, the raw and processed data collected from our experiments, and the scripts to reproduce the evaluation results.
The reproduction artifact is based on the TACAS ’23 Artifact Evaluation VM (tested with Oracle VM VirtualBox v6.1), which operates Ubuntu 22.04 LTS. All the necessary software dependencies for executing tools and performing evaluation are shipped together with the artifact as well. Alternatively, the artifact can be executed on the SoSy-Lab VM, which also runs Ubuntu 22.04 LTS and has all the required dependencies pre-installed. (If you test this artifact with the SoSy-Lab VM, all installation steps can be skipped.)
By default, we assign 4 CPU cores and 15 GB of memory to each verification task. A full reproduction of all the experiments requires roughly 3 months of CPU time. For demonstration purposes, a subset of the benchmark tasks can be executed using 1 CPU core and 3 GB of memory, which takes roughly 6 hours of CPU time in total.
Contents
This artifact contains the following items:
README.html
: this documentationLicense.txt
: license information of the artifactAugmenting_IMC_with_Auxiliary_Invariants.pdf
: a preprint of the submitted manuscriptexample.c
: an example C program for demonstration (see Fig. 1 of the article)cpachecker/
: a directory containing the source code and precompiled binaries of CPAchecker, which implements the proposed approaches2ls/
: a directory containing the executables of 2LS downloaded from the SV-COMP 2022 tool archivessymbiotic/
: a directory containing the executables of Symbiotic downloaded from the SV-COMP 2022 tool archivessv-benchmarks/
: a directory containing the SV-COMP 2022 benchmark tasks used in our evaluationbench-defs/
: a directory containing the benchmark definitions of the experiments (used by BenchExec, a framework for reliable benchmarking)data-submission/
: a directory containing the raw and processed data produced from our full evaluation (used in the article, underpaper-results/
) and from a demo experiment (prepared for this reproduction package, underdemo-results/
)packages/
: a directory containing the necessary Debian and Python packages to set up the environment for the experiments in the TACAS ’23 Artifact Evaluation VMMakefile
: a file containing commands for running experiments and processing data
This readme file will guide you through the following steps:
- Set up evaluation environment
- Execute software verifiers
- Perform experiments
- Analyze experimental data
- Known issues of the artifact
TL;DR
On the TACAS ’23 Artifact Evaluation (AE) VM, type the following commands in the root folder of the reproduction package to:
- Install required dependencies (root permission required):
make install-packages
- Set up BenchExec (root permission required):
make configure-cgroups
: configure cgroups version for BenchExec- WARNING: the script changes the version of cgroup to version 1
- Important: please reboot your system afterwards for the settings to take effect!
make prepare-benchexec
: turn off swap memory and allow user namespaces (needs to be redone after every reboot)make test-benchexec
: test if BenchExec has been installed (see the installation guide)make test-cgroups
: test if cgroups are configured correctly
- Run software verifiers on the example in Fig. 1 of the submission (time limit set to 10 seconds for quick response):
make timelimit=10s test-cpachecker
: the proposed analysis in CPAcheckermake timelimit=10s test-2ls
: the default analysis of 2LSmake timelimit=10s test-symbiotic
: the default analysis of Symbiotic
- Perform a demo experiment on 30 tasks:
make run-demo-exp
- To quickly check the if experiment is runnable, you could use
make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp
to limit the CPU run-time to 60 seconds, memory to 3 GB, and the number of CPU cores to 1 per task. - After the run is finished, an HTML table containing the experimental results can be found in
results/demo.table.html
. - Note that the tasks in the demo experiment are selected to showcase the strengths of the proposed approaches. It is expected that the baseline approach (plain IMC) goes into timeout for several of them.
- To quickly check the if experiment is runnable, you could use
- Perform a full experiment:
make run-full-exp
- 4 CPU cores, 15 GB of RAM, and 900 seconds of CPU time are given to each task. The full experiment takes roughly 3 months of CPU time.
- After the run is finished, HTML tables containing the experimental results can be found in
results/*.table.html
.
To view HTML files corresponding to tables and figures of the paper, please open the following links with a browser.
Note that the figures and tables in the article are formatted for space reasons, so some of them do not look exactly the same as the HTML files.
- Figure 2(a) (Note that the number of program unrollings reported by CPAchecker has a constant offset of +1, i.e., a number k in the HTML plot corresponds to k-1 in the article.)
- Figure 2(b)
- Table 1 (Note that in the article, we reported the run-time with two significant digits.)
- Figure 3(a) (Note that in the article, we cropped the first 400 tasks solved within 1 minute for space.)
- Figure 3(b)
- Figure 4(a)
- Figure 4(b)
- Figure 5
To improve readability, in the following we only excerpt important fragments of the logs. The complete log messages for the above commands are listed in data-submission/complete-logs.html
for reference.
Set Up Evaluation Environment
Hardware Requirements
For the demo experiment, 3 GB of memory and 1 CPU core are allocated for a verification task. For the complete experiment, 15 GB of memory and 4 CPU cores are used. Please provide hardware resources higher than a benchmark task requires. An internet connection is not necessary.
Software Requirements
This artifact requires a Linux-based operating system using cgroups v1 and has been tested on a 64-bit Ubuntu 22.04 computer with Linux kernel 5.15.0.
In addition, the following software dependencies are requisite:
- BenchExec 3.17 (installation guide)
- Clang 14.0.0
- Java Runtime Environment (JRE) 11 or above
- libz3-dev 4.8.12
On the TACAS ’23 AE VM, the above dependencies can be fulfilled via the following command (you will be asked for root permission):
make install-packages
Set Up BenchExec
We use BenchExec, a framework for reliable benchmarking and resource measurements, to perform our evaluation.
To configure cgroups version for BenchExec, please run:
make configure-cgroups
WARNING: The script will change the version of cgroups to version 1. If you do not want this change on your machine, we recommend testing the reproduction package in the TACAS ’23 AE VM.
Important: After running the above command, please reboot your system for the settings to take effect!
Note that an additional preparation for BenchExec is required after each reboot. Please run (you will again be asked for root permission):
make prepare-benchexec
The above command turns off the swap memory and allows user namespaces to be used.
To test if the permission of cgroups needed by BenchExec is correctly configured, please run:
make check-cgroups
No warnings or error messages should be printed if the permission is correctly configured.
If there are still unresolved problems, please take a look at BenchExec’s installation guide.
Execute Software Verifiers
Run Different Verification Algorithms in CPAchecker
To execute CPAchecker on a C program example.c
, please run:
make timelimit=10s c-prog=example.c cpa-config=imc_i-df test-cpachecker
You can change the time limit, the input C program and the used configuration by passing the arguments to timelimit
, c-prog
, and cpa-config
, respectively. The following configurations are supported:
imc
: plain interpolation-based model checking (IMC, McMillan 2003)imc_f-df
: augmented IMC with fixed-point checks strengthened (IMCf ← DF)imc_i-df
: augmented IMC with interpolants strengthened (IMCi ← DF)ki-df
: k-Induction boosted by auxiliary invariants (KI ← DF)impact
: Impact (Impact)pred_abs
: Predicate Abstraction (PredAbs)
Below is an example output shown on the console after the analysis is finished.
[…redacted…]
Verification result: TRUE. No property violation found by chosen configuration.
More details about the verification run can be found in the directory “./output”.
There are 3 possible outcomes of the verification result:
TRUE
: the program is “safe”, i.e. it does not violate the given propertyFALSE
: the program is “unsafe”, i.e. it contains a violation to the given propertyUNKNOWN
: the program might contain some unsupported feature, or the analysis went into some error (timeout, out of memory, etc.)
For the program example.c
, IMC, IMCf ← DF, IMCi ← DF, and PredAbs are able to deliver a proof within 10 seconds, whereas KI ← DF and Impact are not.
Also note that there will be no output/
folder, because CPAchecker is executed with the -noout
option (see line 22 of the Makefile
).
Run 2LS and Symbiotic
To execute 2LS or Symbiotic on a C program, please run:
make timelimit=10s c-prog=example.c test-2ls # or test-symbiotic
You can change the time limit and the input C program by passing the arguments to timelimit
and c-prog
, respectively.
2LS can prove the program example.c
within 10 seconds, while Symbiotic cannot.
Perform Experiments
We provide two settings for the experiments: one for the demo run and the other for the full run. The two settings differ in (1) the set of executed tasks and (2) the executed tools/algorithms. All the other common settings are explained below.
Experimental Settings
The settings are described in the XML files bench-defs/*.xml
. These XML files are used by BenchExec, a framework for reliable benchmarking.
For the execution of a task, a default resource limit of 4 CPU cores, 900 seconds of CPU time, and 15 GB of memory is imposed. (If the required memory amount is not available on your system, please follow the instructions explained below to adjust the limit.)
The XML files contain the following configurations of the compared verifiers in the evaluation, namely:
- CPAchecker
- Compared SMT-based algorithms:
imc
,imc_f-df
,imc_i-df
,ki-df
,impact
, andpred_abs
- Different random seeds for plain and augmented IMC:
imc-rs{7,61,89,165}
andimc_i-df-rs{7,61,89,165}
- Compared SMT-based algorithms:
- 2LS:
default
(configuration used in SV-COMP 2022) - Symbiotic:
svcomp
(configuration used in SV-COMP 2022)
Before you start executing any experiment, please make sure that
- BenchExec is successfully installed by running
make test-benchexec
and - cgroups are correctly configured by running
make test-cgroups
.
Demo Run on the Selected Tasks
A complete experiment on the whole benchmark suite consisting of 1623 C-verification tasks (listed in bench-defs/sets/overall.set
) takes a vast amount of time (the elapsed CPU time in our experiment was about 3 months). The experimental data produced from the full evaluation reported in the paper can be found in folder data-submission/paper-results/
.
To show how our experiments were conducted, we selected 30 tasks from the benchmark suite (listed in bench-defs/sets/demo.set
) and 3 algorithms (plain and augmented IMC: IMC, IMCf ← DF, and IMCi ← DF) in CPAchecker for demonstration.
We emphasize that the demo run is only for demonstration purposes. The observations on the comparison between algorithms and tools in the article were drawn from the evaluation on the whole benchmark suite. The tasks are selected to showcase the strengths of the proposed approaches. It is expected that the baseline approach, IMC, goes into timeout for several of them. In comparison, the proposed approaches, IMC, IMCf ← DF, and IMCi ← DF, are able to find more proofs on the selected set of tasks within the time limit.
This demonstrative experiment was designed such that it is feasible given reasonable hardware equipment and time: it could be finished within several hours on a laptop.
To perform the demonstrative experiment, run the command below:
make run-demo-exp
Below is an example on how to adjust the resource limits. Suppose you would like to set the time limit to 60 seconds, the memory limit to 3 GB, and use only 1 CPU core for a task, please run:
make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp
Moreover, if you have enough hardware resources and would like to launch parallel benchmark tasks, add benchexec-args="-N <num_jobs>"
to the make command. For more usage information about BenchExec, please refer to benchexec -h
.
After the run is finished, an HTML table containing the experimental results can be found in results/demo.table.html
.
Full Run on the Complete Benchmark Suite
As mentioned above, the total CPU time elapsed for a complete experiment is about 3 months, and 900 seconds of CPU time, 15 GB of memory, and 4 CPU cores are given to each benchmark task.
To perform the full experiment, run the command:
make run-full-exp
The full experiments can be split into 3 make-targets.
-
make run-aug-imc-exp
:Evaluate IMC, IMCf ← DF, and IMCi ← DF on 870 tasks (listed in
bench-defs/sets/nontrivial-inv.set
) where DF, the invariant generation component in CPAchecker, is able to generate non-trivial inductive invariants. The experimental results are summarized in Fig. 2, Table 1, Table 2, Fig. 4, and Fig. 5 of the article. -
make run-rand-seed-exp
:Compare the IMC and IMCi ← DF using different random seeds for SMT solving on 870 tasks. The experimental results are summarized in Fig. 3 of the article.
-
make run-cmp-exp
:Compare IMCi ← DF against other SMT-based algorithms (KI ← DF, Impact, and PredAbs) in CPAchecker and 2 state-of-the-art verifiers (2LS and Symbiotic) from SV-COMP 2022 on the whole benchmark suite. The experimental results are summarized in Table 3 and Fig. 6 of the article.
After the run is finished, HTML tables containing the experimental results can be found in results/*.table.html
.
Analyze the Experimental Data
We recommend to take advantage of the interactive HTML files to help visualize the results of the experiments. These files can be easily opened with a web browser (e.g. firefox), and can display the information presented in all tables and figures of the article.
Results from Our Experiments
The results (both raw and processed data) of the demo run and full run obtained by our machines are in folder data-submission/demo-results/
and data-submission/paper-results/
, respectively. The demo run was performed in order to prepare this artifact, and the full run was performed to collect the data used in the paper.
The generated HTML files are:
tab1-1.imc_i-df.improvement.table.html
: generated by the make-targetrun-aug-imc-exp
(Table 1 in the article)tab2.augmented-imc.summary.table.html
: generated by the make-targetrun-aug-imc-exp
(Fig. 2, Fig. 3, and Fig. 4 in the article)tab3.overall-comparison.table.html
: generated by the make-targetrun-cmp-exp
(Fig. 5 in the article)
We also provide pre-configured links to easily view the exact tables/figures as shown in the paper, as listed in the TL;DR section.
Here we additionally provide the links to view all the tables/figures in the extended technical report of this work:
- Figure 2(a) (Note that the number of program unrollings reported by CPAchecker has a constant offset of +1, i.e., a number k in the HTML plot corresponds to k-1 in the article.)
- Figure 2(b)
- Table 1-1 (Note that in the report, the run-time is rounded to two significant digits.)
- Table 1-2
- Table 2
- A summary table for Figure 3
- Figure 4(a) (Note that in the report, we cropped the first 400 tasks solved within 1 minute for space.)
- Figure 4(b)
- Figure 5(a)
- Figure 5(b)
- Table 3
- Figure 6
If you want to re-generate all the above HTML files from the raw data obtained by our experiments, run make gen-paper-tables
. Note that this command will overwrite the existing files.
Navigate Through the Data
Once an experiment is finished, the Makefile automatically collects the results and generates the HTML file, whose path is printed on the console.
A sample output printed at the end of demo run:
[…redacted…]
INFO: Merging results…
INFO: The resulting table will have 30 rows and 21 columns (in 3 run sets).
INFO: Generating table…
INFO: Writing HTML into /path/to/artifact/results/demo.table.html …
INFO: done
When opening the generated HTML table, you will be guided to the Summary
page of the experiment, where detailed settings of the experiment and a summary table of the compared tools/algorithms are displayed. If you open tab2.augmented-imc.summary.table.html
, in this page you can see the number of proofs found by each compared approach as reported in Table 2.
To see the full table, please navigate to the tab Table
. By filtering the status from the drop-down menus, you can see the results of Timeouts
, Out of memory
, and Other inconclusive
of each compared approach as reported in Table 2.
To inspect the log file of an individual task, click on the status of that task. If the log file cannot be displayed, configure your browser according to the printed instructions.
To filter tasks, you can make use of the task filter at the upper-right corner of the page. To view quantile plots, please navigate to tab Quantile Plot
and adjust the drop-down menus as you prefer. To view scatter plots, please navigate to tab Scatter Plot
, and adjust the x- and y- axes according to your interests.
Known Issues of the Artifact
Known issues of this artifact are documented below.
CPU-throttling Warnings
When you perform the demo or full runs (especially on a laptop), BenchExec might raise the following warning:
2023-XX-XX XX:XX:XX - WARNING - CPU throttled itself during benchmarking due to overheating. Benchmark results are unreliable!
This is normal on a laptop. Please ignore it.
Complete Logs
The complete logs produced by each command mentioned above can be found in data-submission/complete-logs.html
for reference.
Files
IMCDF-artifact-SPIN24-submission.zip
Files
(1.6 GB)
Name | Size | Download all |
---|---|---|
md5:2817d8b1af216180ddc192826ce66e27
|
1.6 GB | Preview Download |