Reproduction Package for SPIN 2024 Article `Verification Witnesses Version 2'
Creators
Description
Abstract
This artifact is a reproduction package for the paper Software Verification Witnesses 2.0 which has been accepted at SPIN 2024.
It consists of all executables, input data and results required to reproduce the experiments and see the raw data from which the results presented in the paper were extracted. Specifically, it contains the tools Symbiotic, CPAchecker, UAutomizer, witness-lint and the input data sv-benchmarks as used in SV-COMP24.
The artifact is based on the SoSy-Lab Virtual Machine (Ubuntu 22.04 LTS) and has been configured such that no extra configuration is necessary and the artifact can run inside it without any problems.
By default, experiments are run with 2 cores and 15 GB of memory for at most 900 seconds. A full reproduction of the experiments requires roughly 8 months of CPU time. For demonstration purposes, a subset of tasks has been selected which should take at most 1 day to complete and will likely finish sooner. To test that everything is working as intended, all experiments can be run for two files which should finish all the experiments in around 30 minutes.
Contents
This artifact contains the following items:
README.md
: This file.*.xml
: Mostly benchmark definition files for running the experiments using BenchExec.config.sh
: Config with parameters for BenchExec.Makefile
: Makefile to run the experiments.setup.sh
: Script to update the modified time of the CPAchecker binary as required by BenchExec and to add the required dependencies to the witness linter.results.zip
: Results of the as presented in the paper experiments. They have been zipped for convenience, since they are around 10 GB in size when uncompressed.change_tasks.py
: Script to change the tasks in the xml files from the full set to a subset to two files.License.txt
: License information for the artifact.cache
: Which contains a helper script to cache the downloading of tools and sv-benchmarks. Is not relevant for the artifact since everything is self-contained but may be relevant for a reproduction.csv-generation.xml
: XML file to generate the csv files from the results. It tells BenchExec what information to put into the csv files.sv-benchmarks
: The SV-COMP24 benchmarks.- Tools
CPAchecker
: CPAchecker version 0af0e41240 in foldercpachecker
.Symbiotic
: Symbiotic version 9c278f9 in folderval-symbiotic-witch
.Symbiotic Witch
: Symbiotic Witch version svcomp24.Witch
: Witch version b011ec9 in foldersymbiotic-witch
using Witch Klee in version 6dabb94.UAutomizer
: UAutomizer version 0.2.4-?-8430d5a-m in folderuautomizer-graphml
and 0.2.4-dev-0e0057c in folderuautomizer-yaml
.witness-lint
: Witness-lint version 2.0.2 in folderwitness_linter
.BenchExec
: BenchExec version 19a85ac in folderbenchexec
.
This readme contains the following sections to help the user to reproduce the experiments:
- TL;DR: A quick guide to reproduce the experiments and analyze the data.
- Environment: Describes the environment in which the artifact was tested and can be run.
- Experiments: Describes how to execute the experiments and generate the results.
- Results: Describes where the results can be found and how to analyze them.
- Known Issues: Describes known issues when executing the artifact.
TL;DR
- Run
setup.sh
to update the modified time of the CPAchecker binary as required by BenchExec. - Change into the directory of the artifact i.e.
software-verification-witnesses-2.0-artifact-SPIN24-final
. - Run
source config.sh
to set the environment variables. - Run
change_tasks.py
with one of the following options to select what task set to run. Be aware that this will create newxml
files and save the original ones in.xml.bkp
files. Per default, all tasks are selected.--all
: Run all tasks.--subset
: Run a representative subset of tasks.--single
: Run a single task with an error and a single task which is correct i.e. one task for violation and one for correctness.--clean
: Restore the originalxml
files.
- Run
make all-experiments
to run all the experiments.- If desired only particular experiments can be run by running
make experiment-*
. The following naming convention has been used:- Experiments related to verification need to be run before the validation or witness analysis experiments.
- Experiments containing
correctness
in the name are related to correctness witnesses and experiments containingviolation
are related to violation witnesses. - Experiments containing
validate
in the name are related to validation of the witnesses. - Experiments containing
verification
in the name are related to witness analysis. - Experiments containing
lint
in the name are related to analysis of witnesses.
- If desired only particular experiments can be run by running
- After the experiments are finished, run
make generate-experiments-table-all
to generate the HTML table with the results. - Open the table and use different filters to validate the results presented in the paper.
- A full explanation on how to validate the results in the paper will be explained in section
Results
.
- A full explanation on how to validate the results in the paper will be explained in section
Environment
To run the artifact BenchExec
and Python >= 3.8
are required. For the particular tools, please see their respective documentation. A non-exhaustive list of rquired packages:
CPAchecker
: Java 17Symbiotic
: Python 3.8, Clang 14, LLVM 14UAutomizer
: Java 11
Everything needed to run the artifact has been installed in the SoSy-Lab Virtual Machine (Ubuntu 22.04 LTS). It is recommended to run the experiments inside that VM, since no installation is necessary.
To set up the environment on the VM for executing the experiments, please follow these steps:
- Execute
setup.sh
to update the modified time of the CPAchecker binary as required by BenchExec. - Change into the directory of the artifact i.e.
software-verification-witnesses-2.0-artifact-SPIN24-final
. - Run
source config.sh
to set the environment variables.
Note: From this point on it is assumed that the working directory is software-verification-witnesses-2.0-artifact-SPIN24-final
inside the artifact.
Experiments
Running the Experiments
To evaluate how witnesses perform between version 1.0 and 2.0 21 different experiments are provided. The experiments are divided into verification, validation and witness analysis tasks and subdivided into violation and correctness tasks.
The provided Makefile documents how to reproduce the different experiments (Makefile targets starting with experiment-
). Experiment targets related to verification are marked with verification
, validation with validate
and witness analysis with lint
. Experiments related to correctness witnesses are marked with correctness
and experiments related to violation witnesses are marked with violation
. The experiments for witness version 1.0 are marked with graphml
and for witness version 2.0 with yaml
. Verification experiments already export both witness versions, so extra runs are not necessary. The following targets need to be run, with the following order:
- Verification experiments
experiment-cpachecker-violation-verification
experiment-symbiotic-verification
experiment-cpachecker-correctness-verification
- Validation experiments
- Correctness
experiment-cpachecker-validate-cpachecker-correctness-graphml
experiment-cpachecker-validate-cpachecker-correctness-yaml
experiment-uautomizer-validate-correctness-graphml
experiment-uautomizer-validate-correctness-yaml
- Violation
experiment-cpachecker-validate-cpachecker-violation-graphml
experiment-cpachecker-validate-cpachecker-violation-yaml
experiment-symbiotic-validate-cpachecker-graphml
experiment-symbiotic-validate-cpachecker-yaml
experiment-symbiotic-validate-symbiotic-graphml
experiment-symbiotic-validate-symbiotic-yaml
experiment-cpachecker-validate-symbiotic-graphml
experiment-cpachecker-validate-symbiotic-yaml
- Correctness
- Witness analysis experiments
- Correctness
experiment-lint-cpachecker-correctness-graphml
experiment-lint-cpachecker-correctness-yaml
- Violation
experiment-lint-cpachecker-violation-graphml
experiment-lint-cpachecker-violation-yaml
experiment-lint-symbiotic-violation-graphml
experiment-lint-symbiotic-violation-yaml
- Correctness
The dependency between verification and validation of tasks has been made explicit as comments in the commands inside the Makefile.
Executing all experiments in the required order can be done by using:
make all-experiments
Experiments can be interrupted (with CTRL-c) at any moment to get the results for the currently completed tasks, this allows the user to continue the evaluation of the pipeline without having to wait for all the experiments to be finished.
Light Evaluation
To do a light evaluation of the artifact, a subset of tasks can be selected. This can be done by executing the script change_tasks.py
, which modifies the xml files to only contain a subset of tasks.
Run change_tasks.py
with one of the following options to select what task set to run. Be aware that this will create new xml
files and save the original ones in .xml.bkp
files. Per default all tasks are selected.
--all
: Run all tasks. This will run all tasks and will take a long time to finish. It is required to reproduce the full results.--subset
: Run a subset of tasks. This subset provides an overview of the results and should finish in a reasonable amount of time.--single
: Run a single task with an error and a single task which is correct i.e. one task for violation and one for correctness. The purpose of this option is to test that everything is working as intended, in particular when using the artifact not in the recommended VM this will test that all requirements have been installed correctly.
Experiment Requirements
The experiments were run with 2 cores and 15 GB of memory and a Timeout of 900 seconds. The timeout is decreased to 90 secodns for the validation of violation witnesses. If so desired, these parameters can be adjusted in the *.xml
files or in the config.sh
file. Be aware that CPAchecker and UAutomizer require at least 2 Cores. Reasonable reduced resources are 2 cores and 8 GB of memory and 90 seconds timeout.
Results
Generating the Results
After the experiments are finished, the results can be generated by running the following command:
make generate-experiments-table-all
This will generate three HTML tables and csv file containing all the results of the experiments in the folder results-processed
.
Results used for the Paper
The raw results which were used to obtain the numbers in the paper can be found in the results.zip
file. Once unzipped this file will contain the following folders:
correctness
: contains the results for all experiments related to correctness witnesses. This folder contains the raw data to answer RQ1 in the paper for correctness witnesses.violation
: contains the results for all experiments related to violation witnesses. This folder contains the raw data to answer RQ1 in the paper for violation witnesses.witness-analysis
: contains the results for all experiments related to analysis metrics of the witnesses, where they are divided between violation, correctness and witness analysis. This folder contains the raw data to answer RQ2 in the paper.tables
: contains the processed data which was used to generate the results in the paper. It contains to files:{correctness,violation,witness-analysis}.html
The tables are in HTML format and can be opened in a web browser.{correctness,violation,witness-analysis}.csv
The html table as a CSV file and can be opened in a spreadsheet program.
Each of the three tables {correctness,violation,witness-analysis}.html
is responsible for part of the data.
correctness.html
contains the results for the validation of correctness witnesses i.e. RQ1 for correctness witnesses.violation.html
contains the results for the validation of violation witnesses i.e. RQ1 for violation witnesses.witness-analysis.html
contains the results for the analysis of the witnesses i.e. RQ2.
Validate the Results in the Paper
Once you have a HTML table and a corresponding csv file with all the results, you can analyze the HTML table by opening it in a browser. Please be aware that due to the size of the table the browser may take a while to load and update the table.
For a full description on how to interact with the HTML table, please see the documentation inside BenchExec.
The plots and tables in the paper represent the data which is contained in the html tables in results/tables
in an appropriate format for a paper. The HTML table is the ideal way to analyze the results, since it is interactive and allows for expressing complex filters. In particular having the data in the paper validating it is easy using corresponding filters.
For each of the corresponding tables and plots in the paper, the filters will be given. There are two ways to do this, by appending something to the URL of the table i.e. changing *.table.html
in the browser URL to *.table.html#FILTER
or by selecting the filter manually inside the table. The Summary
tab of the table contains a summary of the results, the tab Quantile Plot
contains a quantile plot and the tab Table
contains the raw data in table form. Hovering over the header in the Table
tab will show what file this colum of the table corresponds to.
Reproduce the Results for RQ1
Correctness Witnesses
To do this analysis please use the table correctness.html
in the results/tables
folder.
To validate the results for RQ1 related to correctness witnesses, you need to filter the column containing the results of the experiment experiment-cpachecker-correctness-verification
usually called cpachecker-correctness-verification.*
in the tab Table
for all tasks which were correctly solved. This can be done by selecting status and filtering by correct
under category
. This will give you the number of tasks which were correctly solved by CPAchecker for tasks which are expected to not reach the location marked by reach_error
. The filter for the table is given by #/?filter=0(0*status*(category(in(correct))))
Returning to the Summary
tab, you can see the numbers of Table 4 by looking at the results for the columns {cpachecker,uautomizer}-validate-cpachecker-correctness-{graphml,yaml}.*
for CPAchecker and UAutomizer as validators for witness versions 1.0 and 2.0. Results marked as true
correspond to confirmed witnesses and results marked as false
correspond to refuted witnesses.
Changing to the Quantile Plot
tab and selecting cputime
you will be able to see the quantile plots for the results of the validation of the correctness ywitnesses, which correspond to Figure 3 in the paper.
Violation Witnesses
To do this analysis please use the table violation.html
in the results/tables
folder.
For violation witnesses, it is the same as for correctness witnesses, but since there are two verifiers, both need to be considered. The filter also needs to be done for tasks with an expected verdict of false
instead of true
. The columns to look at are cpachecker-violation-verification.*
and symbiotic-verification.*
for CPAchecker and Symbiotic as Verifiers respectively.
When filtering for the tasks CPAchecker solved correctly the relevant columns in the tab Summary
are {cpachecker,symbiotic}-validate-cpachecker-violation-{graphml,yaml}.*
for the different altenatives. These numbers correspond to the ones in Table 5 in the Paper for CPAchecker as a Verifier. In the tab Quantile Plot
you will see the results of Figure 5.a. For CPAchecker as a verifier the filter is given by: #/?filter=2(0*status*(category(in(correct))))
.
When filtering for Symbiotic as a Verifier the relevant columns are {cpachecker,symbiotic}-validate-symbiotic-violation-{graphml,yaml}.*
for Table 5. The tab Quantile Plot
contains the results of Figure 5.b. For Symbiotic as a verifier the filter is given by: #/table?filter=9(0*status*(category(in(correct))))
.
Validate the Results for RQ2
For this analysis please use the table witness-analysis.html
in the results/tables
folder.
To obtain the numbers for the Tables related to RQ2 and not only validate them, a small analysis needs to be run on the csv file with the same name as the html table, this is not part of the artifact.
Correctness Witnesses
To validate the results in Table 6 you can use the Table
tab to set different amount for the different metrics of the witnesses. There is field below each metric were a range can be given. This will allow you to see that the numbers correspond to the minimum, median and maximum. Since inputting the range :mminimum
will leave only a single minimum value, the range maximum:
will leave only a single maximum value and the range median:
will divide the dataset into two halves, which can be seen in the Summary
tab.
In contrast to the results for validation, no filtering is required, since all witnesses exported by CPAchecker are used. Be aware that CPAchecker exports a correctness witness even if the task timeouts, in order to show the user what the state of the current analysis is. The numbers presented here are based on all correctness witnesses exported by CPAchecker and not only those for tasks which were solved correctly.
The relevant columns for this analysis are lint-correctness-witnesses-{graphml,yaml}.*
for witnesses version 1.0 and 2.0 respectively.
Violation Witnesses
To validate the results in Table 7 the same procedure as for correctness witnesses can be used. The only difference is that the numbers are based on violation witnesses and not correctness witnesses.
The relevant columns for this analysis are lint-violation-witnesses-{cpachecker,symbiotic}-{graphml,yaml}.*
for witnesses produce by CPAchecker/Symbiotic in version 1.0 and 2.0 respectively.
Known Issues
Warnings
The benchmark files *.xml
are the ones used in SV-COMP 2024 and therefore contain the tasks for all properties. The configuration file config.sh
filters them out such that only the reachability tasks are used. This is done after BenchExec has started processing the files, which is the cause of most warnings.
During the execution of the experiments, some warnings may appear. For example:
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'sv-benchmarks/c/SoftwareSystems-uthash-MemCleanup.set'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'memsafety-broom/*.yml'.
2024-XX-XX XX:XX:XX - WARNING - No files found matching 'sv-benchmarks/c/SoftwareSystems-DeviceDriversLinux64-Termination.set'.
2024-XX-XX XX:XX:XX - WARNING - CPU throttled itself during benchmarking due to overheating. Benchmark results are unreliable!
Both warnings can be ignored. The first one, since those files are not related to the property being analyzed, which is unreach-call and therefore do not affect the reproduction of the results. The second one, may affect the results, but should not happen when reproducing the results outside a VM.
The following warning can also be safely ignored, since it only says that no witness was found for the task. This is expected for tasks which could not be verified correctly or tasks which were not used during verification, for example, for other properties than reachability of an error function:
2024-XX-XX XX:XX:XX - WARNING - Pattern ... in required tag did not match any file for task ...
Errors
When BenchExec tries to create an overlay mount while there is a shared folder mounted inside the VM, the following error may occur:
2024-XX-XX XX:XX:XX - ERROR - Failed to create overlay mount for /home/...: invalid argument ...
In this case, the shared folder should be unmounted and the experiments should be run again.
Files
witnesses-2.0-artifact-SPIN24-proceedings.zip
Files
(4.0 GB)
Name | Size | Download all |
---|---|---|
md5:9c559fee4251515a7c08bf64963ec287
|
4.0 GB | Preview Download |
Additional details
Dates
- Submitted
-
2024-03-16SPIN 2024 (proceedings paper)