Reproduction package for SPIN 2024 submission `Augmenting Interpolation-Based Model Checking with Auxiliary Invariants'

Beyer, Dirk; Chien, Po-Chun; Lee, Nian-Ze

doi:10.5281/zenodo.10548594

Published January 26, 2024 | Version SPIN24-submission

Software Open

Reproduction package for SPIN 2024 submission `Augmenting Interpolation-Based Model Checking with Auxiliary Invariants'

1. LMU Munich, Germany

Reproduction Package

Augmenting Interpolation-Based Model Checking with Auxiliary Invariants

Abstract

This artifact is a reproduction package for the article “Augmenting Interpolation-Based Model Checking with Auxiliary Invariants”, published at SPIN 2024. It is accessible via the DOI 10.5281/zenodo.10548594 on Zenodo.

The artifact consists of source code, precompiled executables, and input data used in the evaluation of the paper, as well as the obtained experimental results. Specifically, it contains the source code and precompiled binaries of CPAchecker at revision 42901, the executables of 2LS and Symbiotic, the benchmark suite of SV-COMP 2022, the raw and processed data collected from our experiments, and the scripts to reproduce the evaluation results.

The reproduction artifact is based on the TACAS ’23 Artifact Evaluation VM (tested with Oracle VM VirtualBox v6.1), which operates Ubuntu 22.04 LTS. All the necessary software dependencies for executing tools and performing evaluation are shipped together with the artifact as well. Alternatively, the artifact can be executed on the SoSy-Lab VM, which also runs Ubuntu 22.04 LTS and has all the required dependencies pre-installed. (If you test this artifact with the SoSy-Lab VM, all installation steps can be skipped.)

By default, we assign 4 CPU cores and 15 GB of memory to each verification task. A full reproduction of all the experiments requires roughly 3 months of CPU time. For demonstration purposes, a subset of the benchmark tasks can be executed using 1 CPU core and 3 GB of memory, which takes roughly 6 hours of CPU time in total.

README.html: this documentation
License.txt: license information of the artifact
Augmenting_IMC_with_Auxiliary_Invariants.pdf: a preprint of the submitted manuscript
example.c: an example C program for demonstration (see Fig. 1 of the article)
cpachecker/: a directory containing the source code and precompiled binaries of CPAchecker, which implements the proposed approaches
2ls/: a directory containing the executables of 2LS downloaded from the SV-COMP 2022 tool archives
symbiotic/: a directory containing the executables of Symbiotic downloaded from the SV-COMP 2022 tool archives
sv-benchmarks/: a directory containing the SV-COMP 2022 benchmark tasks used in our evaluation
bench-defs/: a directory containing the benchmark definitions of the experiments (used by BenchExec, a framework for reliable benchmarking)
data-submission/: a directory containing the raw and processed data produced from our full evaluation (used in the article, under paper-results/) and from a demo experiment (prepared for this reproduction package, under demo-results/)
packages/: a directory containing the necessary Debian and Python packages to set up the environment for the experiments in the TACAS ’23 Artifact Evaluation VM
Makefile: a file containing commands for running experiments and processing data

This readme file will guide you through the following steps:

Set up evaluation environment
Execute software verifiers
Perform experiments
Analyze experimental data
Known issues of the artifact

TL;DR

On the TACAS ’23 Artifact Evaluation (AE) VM, type the following commands in the root folder of the reproduction package to:

Install required dependencies (root permission required): make install-packages
Set up BenchExec (root permission required):
- make configure-cgroups: configure cgroups version for BenchExec
  - WARNING: the script changes the version of cgroup to version 1
  - Important: please reboot your system afterwards for the settings to take effect!
- make prepare-benchexec: turn off swap memory and allow user namespaces (needs to be redone after every reboot)
- make test-benchexec: test if BenchExec has been installed (see the installation guide)
- make test-cgroups: test if cgroups are configured correctly
Run software verifiers on the example in Fig. 1 of the submission (time limit set to 10 seconds for quick response):
- make timelimit=10s test-cpachecker: the proposed analysis in CPAchecker
- make timelimit=10s test-2ls: the default analysis of 2LS
- make timelimit=10s test-symbiotic: the default analysis of Symbiotic
Perform a demo experiment on 30 tasks: make run-demo-exp
- To quickly check the if experiment is runnable, you could use make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp to limit the CPU run-time to 60 seconds, memory to 3 GB, and the number of CPU cores to 1 per task.
- After the run is finished, an HTML table containing the experimental results can be found in results/demo.table.html.
- Note that the tasks in the demo experiment are selected to showcase the strengths of the proposed approaches. It is expected that the baseline approach (plain IMC) goes into timeout for several of them.
Perform a full experiment: make run-full-exp
- 4 CPU cores, 15 GB of RAM, and 900 seconds of CPU time are given to each task. The full experiment takes roughly 3 months of CPU time.
- After the run is finished, HTML tables containing the experimental results can be found in results/*.table.html.

To view HTML files corresponding to tables and figures of the paper, please open the following links with a browser.

Note that the figures and tables in the article are formatted for space reasons, so some of them do not look exactly the same as the HTML files.

Figure 2(a) (Note that the number of program unrollings reported by CPAchecker has a constant offset of +1, i.e., a number k in the HTML plot corresponds to k-1 in the article.)
Figure 2(b)
Table 1 (Note that in the article, we reported the run-time with two significant digits.)
Figure 3(a) (Note that in the article, we cropped the first 400 tasks solved within 1 minute for space.)
Figure 3(b)
Figure 4(a)
Figure 4(b)
Figure 5

To improve readability, in the following we only excerpt important fragments of the logs. The complete log messages for the above commands are listed in data-submission/complete-logs.html for reference.

Set Up Evaluation Environment

Hardware Requirements

For the demo experiment, 3 GB of memory and 1 CPU core are allocated for a verification task. For the complete experiment, 15 GB of memory and 4 CPU cores are used. Please provide hardware resources higher than a benchmark task requires. An internet connection is not necessary.

Software Requirements

This artifact requires a Linux-based operating system using cgroups v1 and has been tested on a 64-bit Ubuntu 22.04 computer with Linux kernel 5.15.0.

In addition, the following software dependencies are requisite:

BenchExec 3.17 (installation guide)
Clang 14.0.0
Java Runtime Environment (JRE) 11 or above
libz3-dev 4.8.12

On the TACAS ’23 AE VM, the above dependencies can be fulfilled via the following command (you will be asked for root permission):

make install-packages

Set Up BenchExec

We use BenchExec, a framework for reliable benchmarking and resource measurements, to perform our evaluation.

To configure cgroups version for BenchExec, please run:

make configure-cgroups

WARNING: The script will change the version of cgroups to version 1. If you do not want this change on your machine, we recommend testing the reproduction package in the TACAS ’23 AE VM.

Important: After running the above command, please reboot your system for the settings to take effect!

Note that an additional preparation for BenchExec is required after each reboot. Please run (you will again be asked for root permission):

make prepare-benchexec

The above command turns off the swap memory and allows user namespaces to be used.

To test if the permission of cgroups needed by BenchExec is correctly configured, please run:

make check-cgroups

No warnings or error messages should be printed if the permission is correctly configured.

If there are still unresolved problems, please take a look at BenchExec’s installation guide.

Execute Software Verifiers

Run Different Verification Algorithms in CPAchecker

To execute CPAchecker on a C program example.c, please run:

make timelimit=10s c-prog=example.c cpa-config=imc_i-df test-cpachecker

You can change the time limit, the input C program and the used configuration by passing the arguments to timelimit, c-prog, and cpa-config, respectively. The following configurations are supported:

imc: plain interpolation-based model checking (IMC, McMillan 2003)
imc_f-df: augmented IMC with fixed-point checks strengthened (IMC_f ← DF)
imc_i-df: augmented IMC with interpolants strengthened (IMC_i ← DF)
ki-df: k-Induction boosted by auxiliary invariants (KI ← DF)
impact: Impact (Impact)
pred_abs: Predicate Abstraction (PredAbs)

Below is an example output shown on the console after the analysis is finished.

[…redacted…]

Verification result: TRUE. No property violation found by chosen configuration. More details about the verification run can be found in the directory “./output”.

There are 3 possible outcomes of the verification result:

TRUE: the program is “safe”, i.e. it does not violate the given property
FALSE: the program is “unsafe”, i.e. it contains a violation to the given property
UNKNOWN: the program might contain some unsupported feature, or the analysis went into some error (timeout, out of memory, etc.)

For the program example.c, IMC, IMC_f ← DF, IMC_i ← DF, and PredAbs are able to deliver a proof within 10 seconds, whereas KI ← DF and Impact are not.

Also note that there will be no output/ folder, because CPAchecker is executed with the -noout option (see line 22 of the Makefile).

Run 2LS and Symbiotic

To execute 2LS or Symbiotic on a C program, please run:

make timelimit=10s c-prog=example.c test-2ls # or test-symbiotic

You can change the time limit and the input C program by passing the arguments to timelimit and c-prog, respectively.

2LS can prove the program example.c within 10 seconds, while Symbiotic cannot.

Perform Experiments

We provide two settings for the experiments: one for the demo run and the other for the full run. The two settings differ in (1) the set of executed tasks and (2) the executed tools/algorithms. All the other common settings are explained below.

Experimental Settings

The settings are described in the XML files bench-defs/*.xml. These XML files are used by BenchExec, a framework for reliable benchmarking.

For the execution of a task, a default resource limit of 4 CPU cores, 900 seconds of CPU time, and 15 GB of memory is imposed. (If the required memory amount is not available on your system, please follow the instructions explained below to adjust the limit.)

The XML files contain the following configurations of the compared verifiers in the evaluation, namely:

CPAchecker
- Compared SMT-based algorithms: imc, imc_f-df, imc_i-df, ki-df, impact, and pred_abs
- Different random seeds for plain and augmented IMC: imc-rs{7,61,89,165} and imc_i-df-rs{7,61,89,165}
2LS: default (configuration used in SV-COMP 2022)
Symbiotic: svcomp (configuration used in SV-COMP 2022)

Before you start executing any experiment, please make sure that

BenchExec is successfully installed by running make test-benchexec and
cgroups are correctly configured by running make test-cgroups.

Demo Run on the Selected Tasks

A complete experiment on the whole benchmark suite consisting of 1623 C-verification tasks (listed in bench-defs/sets/overall.set) takes a vast amount of time (the elapsed CPU time in our experiment was about 3 months). The experimental data produced from the full evaluation reported in the paper can be found in folder data-submission/paper-results/.

To show how our experiments were conducted, we selected 30 tasks from the benchmark suite (listed in bench-defs/sets/demo.set) and 3 algorithms (plain and augmented IMC: IMC, IMC_f ← DF, and IMC_i ← DF) in CPAchecker for demonstration.

We emphasize that the demo run is only for demonstration purposes. The observations on the comparison between algorithms and tools in the article were drawn from the evaluation on the whole benchmark suite. The tasks are selected to showcase the strengths of the proposed approaches. It is expected that the baseline approach, IMC, goes into timeout for several of them. In comparison, the proposed approaches, IMC, IMC_f ← DF, and IMC_i ← DF, are able to find more proofs on the selected set of tasks within the time limit.

This demonstrative experiment was designed such that it is feasible given reasonable hardware equipment and time: it could be finished within several hours on a laptop.

To perform the demonstrative experiment, run the command below:

make run-demo-exp

Below is an example on how to adjust the resource limits. Suppose you would like to set the time limit to 60 seconds, the memory limit to 3 GB, and use only 1 CPU core for a task, please run:

make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp

Moreover, if you have enough hardware resources and would like to launch parallel benchmark tasks, add benchexec-args="-N <num_jobs>" to the make command. For more usage information about BenchExec, please refer to benchexec -h.

After the run is finished, an HTML table containing the experimental results can be found in results/demo.table.html.

Full Run on the Complete Benchmark Suite

As mentioned above, the total CPU time elapsed for a complete experiment is about 3 months, and 900 seconds of CPU time, 15 GB of memory, and 4 CPU cores are given to each benchmark task.

To perform the full experiment, run the command:

make run-full-exp

The full experiments can be split into 3 make-targets.

make run-aug-imc-exp:

Evaluate IMC, IMC_f ← DF, and IMC_i ← DF on 870 tasks (listed in bench-defs/sets/nontrivial-inv.set) where DF, the invariant generation component in CPAchecker, is able to generate non-trivial inductive invariants. The experimental results are summarized in Fig. 2, Table 1, Table 2, Fig. 4, and Fig. 5 of the article.
make run-rand-seed-exp:

Compare the IMC and IMC_i ← DF using different random seeds for SMT solving on 870 tasks. The experimental results are summarized in Fig. 3 of the article.
make run-cmp-exp:

Compare IMC_i ← DF against other SMT-based algorithms (KI ← DF, Impact, and PredAbs) in CPAchecker and 2 state-of-the-art verifiers (2LS and Symbiotic) from SV-COMP 2022 on the whole benchmark suite. The experimental results are summarized in Table 3 and Fig. 6 of the article.

After the run is finished, HTML tables containing the experimental results can be found in results/*.table.html.

Analyze the Experimental Data

We recommend to take advantage of the interactive HTML files to help visualize the results of the experiments. These files can be easily opened with a web browser (e.g. firefox), and can display the information presented in all tables and figures of the article.

Results from Our Experiments

The results (both raw and processed data) of the demo run and full run obtained by our machines are in folder data-submission/demo-results/ and data-submission/paper-results/, respectively. The demo run was performed in order to prepare this artifact, and the full run was performed to collect the data used in the paper.

The generated HTML files are:

tab1-1.imc_i-df.improvement.table.html: generated by the make-target run-aug-imc-exp (Table 1 in the article)
tab2.augmented-imc.summary.table.html: generated by the make-target run-aug-imc-exp (Fig. 2, Fig. 3, and Fig. 4 in the article)
tab3.overall-comparison.table.html: generated by the make-target run-cmp-exp (Fig. 5 in the article)

We also provide pre-configured links to easily view the exact tables/figures as shown in the paper, as listed in the TL;DR section.

Here we additionally provide the links to view all the tables/figures in the extended technical report of this work:

Figure 2(a) (Note that the number of program unrollings reported by CPAchecker has a constant offset of +1, i.e., a number k in the HTML plot corresponds to k-1 in the article.)
Figure 2(b)
Table 1-1 (Note that in the report, the run-time is rounded to two significant digits.)
Table 1-2
Table 2
A summary table for Figure 3
Figure 4(a) (Note that in the report, we cropped the first 400 tasks solved within 1 minute for space.)
Figure 4(b)
Figure 5(a)
Figure 5(b)
Table 3
Figure 6

If you want to re-generate all the above HTML files from the raw data obtained by our experiments, run make gen-paper-tables. Note that this command will overwrite the existing files.

Navigate Through the Data

Once an experiment is finished, the Makefile automatically collects the results and generates the HTML file, whose path is printed on the console.

A sample output printed at the end of demo run:

[…redacted…]

INFO: Merging results… INFO: The resulting table will have 30 rows and 21 columns (in 3 run sets). INFO: Generating table… INFO: Writing HTML into /path/to/artifact/results/demo.table.html … INFO: done

When opening the generated HTML table, you will be guided to the Summary page of the experiment, where detailed settings of the experiment and a summary table of the compared tools/algorithms are displayed. If you open tab2.augmented-imc.summary.table.html, in this page you can see the number of proofs found by each compared approach as reported in Table 2.

To see the full table, please navigate to the tab Table. By filtering the status from the drop-down menus, you can see the results of Timeouts, Out of memory, and Other inconclusive of each compared approach as reported in Table 2.

To inspect the log file of an individual task, click on the status of that task. If the log file cannot be displayed, configure your browser according to the printed instructions.

To filter tasks, you can make use of the task filter at the upper-right corner of the page. To view quantile plots, please navigate to tab Quantile Plot and adjust the drop-down menus as you prefer. To view scatter plots, please navigate to tab Scatter Plot, and adjust the x- and y- axes according to your interests.

Known Issues of the Artifact

Known issues of this artifact are documented below.

CPU-throttling Warnings

When you perform the demo or full runs (especially on a laptop), BenchExec might raise the following warning:

2023-XX-XX XX:XX:XX - WARNING - CPU throttled itself during benchmarking due to overheating. Benchmark results are unreliable!

This is normal on a laptop. Please ignore it.

Complete Logs

The complete logs produced by each command mentioned above can be found in data-submission/complete-logs.html for reference.

Files

IMCDF-artifact-SPIN24-submission.zip

Files (1.6 GB)

Name	Size
IMCDF-artifact-SPIN24-submission.zip md5:2817d8b1af216180ddc192826ce66e27	1.6 GB	Preview Download

	All versions	This version
Views	568	405
Downloads	85	55
Data volume	134.0 GB	88.9 GB

Reproduction package for SPIN 2024 submission `Augmenting Interpolation-Based Model Checking with Auxiliary Invariants'

Authors/Creators

Description

Reproduction Package

Augmenting Interpolation-Based Model Checking with Auxiliary Invariants

Abstract

Contents

TL;DR

Set Up Evaluation Environment

Hardware Requirements

Software Requirements

Set Up BenchExec

Execute Software Verifiers

Run Different Verification Algorithms in CPAchecker

Run 2LS and Symbiotic

Perform Experiments

Experimental Settings

Demo Run on the Selected Tasks

Full Run on the Complete Benchmark Suite

Analyze the Experimental Data

Results from Our Experiments

Navigate Through the Data

Known Issues of the Artifact

CPU-throttling Warnings

Complete Logs

Files

IMCDF-artifact-SPIN24-submission.zip

Files (1.6 GB)