Published May 3, 2024 | Version FSE24-proceedings
Software Open

Reproduction Package for FSE 2024 Article `A Transferability Study of Interpolation-Based Hardware Model Checking for Software Verification'

Description

Abstract

This artifact is a reproduction package for the article “A Transferability Study of Interpolation-Based Hardware Model Checking for Software Verification”, accepted at FSE 2024. It is archived on Zenodo with the DOI 10.5281/zenodo.11070973.

The FSE article investigates the transferability of the claims reported in two prior publications on interpolation-based hardware model checking to software verification. The two publications are (1) Interpolation-Sequence-Based Model Checking (Vizel and Grumberg, 2009) and (2) Intertwined Forward-Backward Reachability Analysis Using Interpolants (Vizel, Grumberg, and Shoham, 2013), proposing model-checking algorithms ISMC and DAR for hardware circuits, respectively. In the FSE article, we adopted ISMC and DAR for programs and implemented them in a software-verification framework CPAchecker. This artifact supports the reproduction of the experiments in the FSE article, which compared the implementations of ISMC and DAR against existing interpolation-based verification techniques in CPAchecker, including IMC (McMillan, 2003), Impact (McMillan, 2006), and PredAbs (Henzinger, Jhala, Majumdar, and McMillan, 2004), to validate the claims in the above two publications as well as investigate their performance characteristics versus classical approaches for software verification.

The artifact consists of source code, precompiled executables, and input data used in the evaluation of the transferability study, as well as the results produced from the experiments. Specifically, it includes the source code and binaries of CPAchecker (at revision 45787 of branch “itp-mc-with-slt”), which implements the verification algorithms compared in the article, the SV-COMP 2023 benchmark suite, the experimental data generated from the evaluation, and instructions to run the tools and experiments.

This reproduction package works best with the SoSy-Lab Virtual Machine, which runs Ubuntu 22.04 LTS and has all the required dependencies installed. If you test this artifact with this VM, you do not need to install any package.

By default, we assign 2 CPU cores, 15 GB of memory, and 1800 s of CPU time limit to each verification task. A full reproduction of all experiments took more than 10 months of CPU time on our machines. For demonstration purposes, a subset of the benchmark tasks can be executed, which requires roughly 2 hours of CPU time in total.

Contents

This artifact contains the following items:

  • README.{html,md}: this documentation (we recommend viewing the HTML version with a browser)
  • fse2024-paper483.pdf: the preprint of the FSE article
  • LICENSE.txt: license information of the artifact
  • doc/: a directory containing the additional information of the artifact, including
  • cpachecker/: a directory containing the source code and precompiled binaries of CPAchecker
  • sv-benchmarks/: a directory containing the SV-COMP 2023 benchmark tasks used in our evaluation
  • data-submission/: a directory containing the raw and processed data produced from our full evaluation (used in the article, under paper-results/) and from a demo experiment (prepared for this reproduction package, under demo-results/)
  • bench-defs/: a directory containing the benchmark and table definitions of the experiments (used by BenchExec, a framework for reliable benchmarking)
  • scripts/: a directory containing utility scripts for processing data
  • Makefile: a file that assembles commands for running experiments and processing data

This readme file will guide you through the following steps:

  • Set up evaluation environment
  • Execute CPAchecker on example programs
  • Reproduce experiments in the paper
    • Validate the claims in the two publications
      • Claims in the ISMC Publication
        • H1.A: ISMC is faster than IMC on tasks with property violation.
        • H1.B: ISMC is faster than IMC when IMC finds a proof only at high unrolling bounds.
        • H1.C: Overall, ISMC is faster than IMC (by 30 % in their experiment).
      • Claims in the DAR Publication
        • H2.A: For DAR, the ratio between iterations using global strengthening to the total number of iterations is less than 0.5 in most tasks.
        • H2.B: IMC finds a proof slower than DAR in many tasks even though it has a smaller convergence length.
        • H2.C: DAR computes more interpolants than IMC.
        • H2.D: DAR’s run-time is more sensitive to the sizes of interpolants than IMC.
        • H2.E: Overall, DAR is faster than IMC (by 36 % in their experiment).
    • Compare ISMC, DAR, and IMC against classical software-verification approaches
  • Interpret the experimental results

TL;DR

After copying and unzipping the artifact into the SoSy-Lab VM, execute the following commands in the root directory of the artifact.

  • Set up and test BenchExec:
    • make prepare-benchexec: turn off swap memory and allow user namespaces (needs to be redone after every reboot)
    • make test-cgroups: test if cgroups are configured correctly
  • Execute verification algorithms in CPAchecker on an example program (the expected verification result is TRUE, meaning that the program does not contain property violation):
  • Perform a demo experiment on 45 tasks: make run-demo-exp
    • To quickly check the if experiment is runnable, you could use make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp to limit the CPU run-time to 60 seconds, memory to 3 GB, and the number of CPU cores to 1 per task.
    • After the runs are finished, execute make gen-demo-table to produced an HTML table (results/demo.table.html) containing the experimental results.
    • Results obtained from our machines are provided in the directory data-submission/demo-results/ (HTML table for the demo experiment).
  • Perform a full experiment: make run-full-exp
    • 2 CPU cores, 15 GB of RAM, and 1800 seconds of CPU time are given to each task. The full experiment takes roughly 10 months of CPU time.

    • After the runs are finished, execute the following command to generate HTML tables (results/*.table.html) containing the experimental results:

      make gen-full-table gen-nomultiloop-table gen-nomultiloop-eca-table gen-nomultiloop-sequentialized-table gen-nomultiloop-loops-table
    • Results obtained from our machines are stored in the directory data-submission/paper-results/.

To view HTML pages corresponding to tables and figures of the paper, please refer to sections: - “Our Results - Validating Claims in Prior Publications” - “Our Results - Comparison with Impact and PredAbs

To improve readability, in the following we only excerpt important fragments of the logs. The complete log messages for some of the listed commands are attached in data-submission/complete-logs.html for your reference.

Set Up Evaluation environment

System Requirements

Please refer to doc/REQUIREMENTS.html.

Installation Guide

Please refer to doc/INSTALL.html.

Execute CPAchecker

Run Verification Algorithms in CPAchecker

The following five interpolation-based algorithms in CPAchecker were evaluated: IMC (McMillan, 2003), ISMC (Vizel and Grumberg, 2009), DAR (Vizel, Grumberg, and Shoham, 2013), Impact (McMillan, 2006), and PredAbs (Henzinger, Jhala, Majumdar, and McMillan, 2004).

Below are the commands to run the evaluated verification algorithms on an example program sv-benchmarks/c/loop-invariants/const.c.

  • DAR:

    # expected verification result: TRUE
    make run-cpa-dar
  • ISMC:

    # expected verification result: TRUE
    make run-cpa-ismc
  • IMC:

    # expected verification result: TRUE
    make run-cpa-imc
  • Impact:

    # expected verification result: TRUE
    make run-cpa-impact
  • PredAbs:

    # expected verification result: TRUE
    make run-cpa-predabs

You can run the algorithms on another program by specifying c-prog=<path_to_program>. For example,

# expected verification result: FALSE
make c-prog="sv-benchmarks/c/bitvector-regression/implicitfloatconversion.c" run-cpa-dar

Note that a verdict for a program under verification is affected by the data model of the platform. In our experiments, the data model for a program under verification is recorded as parameters for the verification task and given as input to CPAchecker. The above two example programs assume the ILP32 data model (cf. sv-benchmarks/c/loop-invariants/const.yml and sv-benchmarks/c/bitvector-regression/implicitfloatconversion.yml). To specify the data model (ILP32 or LP64), please add cpa-args="-32" or cpa-args="-64" to the command. For more information on how to run CPAchecker, see cpachcker/README.md.

Understand the output of CPAchecker

The following is an example output shown on the console after an analysis is finished.

[...redacted...]

Verification result: TRUE. No property violation found by chosen configuration.
More details about the verification run can be found in the directory "./output".
Graphical representation included in the file "./output/Report.html".

There are 3 possible outcomes of the verification result:

  1. TRUE: the program is “safe”, i.e., it satisfies the given safety property
  2. FALSE: the program is “unsafe”, i.e., it contains a violation to the given safety property
  3. UNKNOWN: the analysis was inconclusive

CPAchecker also generates an HTML report (in directory output/) for conveniently browsing the results. Please refer to cpachecker/doc/Report.md for more information.

Conduct Experiments

Experimental settings

The settings for full (resp. demo) experiment are described in the XML file bench-defs/cpachecker.xml (resp. bench-defs/cpachecker-demo.xml). This XML file listed the tasks to be benchmarked, and is used by BenchExec for limiting computing resources and scheduling runs. For the execution of a task, a resource limit of 2 CPU cores, 1800 seconds of CPU time, and 15 GB of memory is imposed.

Perform Demo Evaluation

The full evaluation requires more than 10 months of CPU time on our machines with 3.40 GHz cores. Therefore, we provide ways to conduct experiments on a subset of the benchmark tasks.

We selected 45 tasks from the directory sv-benchmarks/c/eca-rers2012/ (the tasks are listed in bench-defs/sets/demo.set). Note that our observations and conclusions in the article were drawn from the full evaluation. The demo evaluation is provided for demonstration purposes only.

To perform evaluation on these selected tasks, run:

make run-demo-exp

This demo experiment is expected to take roughly 2 hours of CPU time. After the run is finished, run the following command to generate an HTML table containing the experimental results:

make gen-demo-table  # output HTML at results/stats.demo.html

If the default resource requirements are too high for your machine, you can specify the options timelimit, memlimit, and cpulimit in the make command. For instance, to run the demo experiment with a time limit of 60 seconds, a memory limit of 3 GB, and a CPU limit of 1 core, run:

make timelimit=60s memlimit=3GB cpulimit=1 run-demo-exp

Some benchmark runs might run out of resource (e.g., timeout or OOM) and their results are inconclusive if the specified resource limits are too low. For your reference, the experimental results obtained from our machines are provided in the directory data-submission/demo-results/.

If, on the other hand, you have enough hardware resources and want to launch parallel benchmark runs, add benchexec-args="-N <num_jobs>" to the make command.

For more usage information about BenchExec, please refer to benchexec -h.

Perform Full Evaluation

To run the full experiment over the 8813 benchmark tasks reported in the article, execute the following command:

make run-full-exp

After the runs are finished, you can generate HTML tables containing the experimental results by:

make gen-full-table                         # results of all tasks (Table 1 and 2)
make gen-nomultiloop-table                  # results of programs with at most 1 loop (Table 3)
make gen-nomultiloop-eca-table              # results of "ECA" programs with at most 1 loop (Table 3)
make gen-nomultiloop-sequentialized-table   # results of "Sequentialized" programs with at most 1 loop (Table 3)
make gen-nomultiloop-loops-table            # results of "Loops" programs with at most 1 loop (Table 3)

The generated HTML tables can be found in results/stats.*.table.html. For your reference, the experimental results obtained from our machines are stored in the directory data-submission/paper-results/. In sections “Our Results - Validating Claims in Prior Publications” and “Our Results - Comparison with Impact and PredAbs below, we provide pre-configured links to easily view the corresponding tables/figures shown in the article.

Alternatively, you could also perform evaluation

by specifying -r <algorithm_name> and -t <tasks_name> via the option benchexec-args, respectively. For example, the following command starts an experiment of DAR on the ReachSafety-BitVectors subcategory.

make benchexec-args="-r dar -t ReachSafety-BitVectors" run-full-exp

The HTML table can then be generated by:

make gen-sub-table

View the Experimental Results

The results (both raw and processed data) of the demo and full evaluation obtained by our machines are in folders data-submission/demo-results/ and data-submission/paper-results/, respectively. The results produced by the make commands in the previous section are written to the directory results/.

Our Results - Validating Claims in Prior Publications

In the transferability study, there are eight hypotheses extracted from the two prior publications. We list them below to make this document more standalone. For details, please refer to the bundled article.

  • Claims in the ISMC Publication

    • H1.A: ISMC is faster than IMC on tasks with property violation.
    • H1.B: ISMC is faster than IMC when IMC finds a proof only at high unrolling bounds.
    • H1.C: Overall, ISMC is faster than IMC (by 30 % in their experiment).

    In our experiments, we confirmed H1.A also hold for software verification, but H1.B and H1.C did not hold.

  • Claims in the DAR Publication

    • H2.A: For DAR, the ratio between iterations using global strengthening to the total number of iterations is less than 0.5 in most tasks.
    • H2.B: IMC finds a proof slower than DAR in many tasks even though it has a smaller convergence length.
    • H2.C: DAR computes more interpolants than IMC.
    • H2.D: DAR’s run-time is more sensitive to the sizes of interpolants than IMC.
    • H2.E: Overall, DAR is faster than IMC (by 36 % in their experiment).

    In our experiments, we confirmed H2.A and H2.C also hold for software verification, but H2.B, H2.D, and H2.E did not hold.

Here we provide pre-configured links to easily view the corresponding tables/figures shown in the article. Please open the links with a browser.

⋄: Timeout cases are excluded from the plots.
*: CPU time in the HTML table is rounded to 3 significant digits

Our Results - Comparison with Impact and PredAbs

Besides validating the claims in the prior publications, we compare IMC, ISMC and DAR with Impact and PredAbs.

Here we provide pre-configured links to easily view the corresponding tables/figures shown in the article. Please open the links with a browser.

Navigate Through the Data

When opening the generated HTML table, you will be guided to the Summary page of the experiment, where detailed settings of the experiment and a summary table of the compared algorithms are displayed.

To see the full table, please navigate to the tab Table. By filtering the status from the drop-down menus, you can see the results of Timeouts, Out of memory, and Other inconclusive of each compared approach as reported in Table 1. Below we briefly explain the meaning of some columns in the table:

  • #bmc-unroll: the number of loop unrollings in the BMC stage
  • converg-len: the convergence length to reach a fixed point
  • #interpolants: the number of interpolants computed during the verification process
  • #avg-itp-atoms: the average number of atoms in the computed interpolants
  • fp-reached: whether a fixed-point was reached during the verification process
  • #dar-global-phases: the number of times that DAR enters the global strengthening stage
  • dar-global-ratio: the ratio between iterations using global strengthening to the total number of iterations (#dar-global-phases / converg-len)

To inspect the log file of an individual task, click on the status of that task. If the log file cannot be displayed, configure your browser according to the printed instructions.

To filter tasks, you can make use of the task filter at the upper-right corner of the page. To view quantile plots, please navigate to tab Quantile Plot and adjust the drop-down menus as you prefer. To view scatter plots, please navigate to tab Scatter Plot, and adjust the x- and y- axes according to your interests.

Known issues

Known issues of the artifact are documented below.

CPU-throttling warnings

When you are running experiments (especially on a laptop), BenchExec might raise the following warning:

2024-XX-XX XX:XX:XX - WARNING - CPU throttled itself during benchmarking due to overheating. Benchmark results are unreliable!

This is normal on a laptop. Please ignore it.

File-not-found warnings

Following warnings might pop up when you start an experiment:

2024-XX-XX XX:XX:XX - WARNING - No files found matching 'ntdrivers-simplified/*.yml'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'openssl/*.yml'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'verifythis/duplets.yml'.
[...redacted...]

These are caused the the directory structure of the SV-COMP benchmark set and can be safely ignored.

Files

DarIsmcTransferability-artifact-FSE24-proceedings.zip

Files (1.9 GB)

Additional details

Dates

Submitted
2024-04-30
Artifact Evaluation