Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification
Description
Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification
Note: This document is written in Markdown format. For the best viewing experience, please open it with a Markdown reader or editor.
Foreword
This document describes how to reproduce the experiments detailed in the paper submitted at CAV'25.
The artifact is available at the following URL: https://zenodo.org/records/15314337/files/169_artifact.zip?download=1
The SHA256sum of the archive is the following: ad653e47612651bf0381eb05e907cb1f7478ad136380ce3c96cbdc0b68d25222
The md5sum of the archive is the following: b01a1fcaf47cf21b09bc88ab560c0746
The associated DOI is 10.5281/zenodo.15314337
License
This artifact is considered part of the CAISAR platform and, as such, shares its original license. It is included into the attached LICENSE file.
Content
This archive contains the following files:
-
LICENSE -
Dockerfileprovides a recipe for buildingCAISARand other companion provers: nnenum, Marabou, PyRAT and SAVer. -
patchesfolder contains temporary fixes for the benchmarks to run properly insideDocker:relax-shapes-maraboupy.patchrelax-multiple-nn-nnenum.patch
These will automatically be applied during the building of
CAISARinsideDocker. -
Experiment files under the following folders:
acas/acas_experiments.shsequencing/sequencing_experiments.shsvms/shsolve.sh
-
svmsfolder contains additional data:- SVM model with a linear kernel
linear-1k.ovo - 100 data points instances under the
svms/singlesfolder, stored asinput_$i.csv theory.param.whydescribing theWhyMLspecification that corresponds toSAVerovosetting
- SVM model with a linear kernel
Claims addressed by the artifact
Unnormalized ACAS-Xu (Claim 1)
Relevant paper section: 3.1
This artifact checks the ability for CAISAR to check normalized and unnormalized properties of the ACAS-Xu benchmark, using automated graph editing to integrate the normalization and denormalization directly inside neural networks.
Verification of SVMs (Claim 2)
Relevant paper section: 3.2
This artifact displays how one can write a specification on checking support vector machines (SVMs) adversarial robustness and verify it, without the need for a dedicated prover. To do so, it compares the result of using the neural network-tailored prover PyRAT and the SVM-tailored one SAVer.
More precisely, the script will output for each instance the following string:
Saver output: XYZ v PyRAT output: ANSW
which should be as follows:
- If
XYZ=111thenANSWmust beValid. - If
XYZ=010thenANSWmust beInvalid. - If
XYZ=101thenANSWmust beUnknown.
Composition of neural networks (Claim 3)
Relevant paper section: Fig. 2
This artifact displays how one can write specifications of properties that include composition of neural network; and check those specifications.
Automated Graph Editing of Neural Networks (Claim 4)
This artifact displays how CAISAR automatically edits neural networks to integrate part of the specification directly inside them, to circumvent provers limitations (in particular, on handling single neural networks only).
Usage
Requirements
- A Docker installation. The provided
Dockerfilewas tested againstDockerversion 26 and higher, on aGNU/Linuxhost system (in particular,Ubuntu 24 LTS). Different settings may work but may require some tweaks - About 35GB of space
- A network connection for the initial setup of the
Dockerimage
Installation
The installation requires installing heavy machine learning libraries like torch inside multiple virtual environment, as well as compiling a moderate-size OCaml project. Expect installation to take around 15 and 20 minutes on a 2024 high-end laptop.
To build the experimental environment, run the following command at the root of the repository:
sudo docker build -t caisar/cav25artifact .
On patching provers
The artifact contains two patches that address technical limitations in the underlying provers without altering their core functionality. One patch for Marabou removes a restriction requiring inputs to have identical shapes, while the other patch for nnenum eliminates a constraint that prevented neural network inputs from appearing multiple times in the computational graph. While this somewhat contradicts our claim that CAISAR works without modifying the underlying provers, these changes are purely focused on the front-end of these tools, and in particular do not affect their theoretical foundations. We are preparing issues and pull requests for both Marabou and nnenum to discuss and incorporate these improvements into their main codebases.
Experiments
All commands in this section are intended to be run inside the Docker container. They should all be preempted by the following two commands:
mkdir logs
docker run -it --mount type=bind,source=$(realpath logs),target=/artifact/logs --network="host" caisar/cav25artifact
ACAS-Xu (Claim 1)
For each prover (i.e. Marabou, nnenum and PyRAT), one run for each ACAS-Xu property is executed. The timeout for each run is set to 120 seconds. The reviewers are encouraged to modify timeout (default 120s) and runs (default 1) variables in the concerned script to better suit their setting. For example, in the paper, runs was set to 3 to consider the average execution time.
Run the following command to execute the ACAS-Xu benchmark (expected to last around 2 hours):
./scripts/acas_experiments.sh
Expected result: provers should answer the same result (or Timeout) on both normalized and unnormalized inputs, except for:
maraboupyon properties 4,5, and 9, for which it answers withValidandInvalid,nnenumon properties 2 and 6, for which it answers withValidandUnknown. TheUnknownanswers are due to a crash innnenum.
There should be no disagreement on the same property among provers.
Run the following command to generate the file table_solvers_standalone.pdf containing a comprehensive table that summarizes all benchmark results:
./scripts/generate_table.sh
To run a verification query only on a given property with a much shorter time, use the following command:
caisar verify -L . caisar/examples/acasxu/acasxu.why -D model_filename:nets/onnx/<onnx> -t 120s -m 8GB --goal <goal> --prover <prover>
where:
-
<onnx>is one of:ACAS_1_1.onnxfor properties from 1 to 6ACAS_1_9.onnxfor property 7,ACAS_2_9.onnxfor property 8,ACAS_3_3.onnxfor property 9,ACAS_4_5.onnxfor property 10
-
<goal>is either::P$ifor verifying queries on normalized inputs:runP$i\'vcfor verifying queries on unnormalized inputs
where
$iis an integer between 1 and 10 that corresponds to a property in theACAS-Xubenchmark. -
<prover>is one ofmaraboupy,nnenum,PyRAT. When usingPyRATon:P$igoals, add--prover-altern ACASto the command, while on:runP$i\'vcones add--prover-altern ACASd.
For instance, to check the first property on normalized inputs with the prover nnenum, use the following:
caisar verify -L . caisar/examples/acasxu/acasxu.why -D model_filename:nets/onnx/ACAS_1_1.onnx -t 120s -m 8GB --goal :P1 --prover nnenum
The -t option controls the timeout allocated to CAISAR, so reducing it will make the experiments shorter (at the risk of having provers timeout before providing an answer).
SVMs (Claim 2)
Run the command (expected to last around 5 minutes):
./scripts/shsolve.sh
The run should terminate with the following output:
Nb errors: 0
The script may take a parameter in input for the data point to consider. By default, 10 data points are considered (100 is the max number accepted). Provide a lower value than 10 to shorten the execution time.
For detailed results, run the command:
csvlook logs/saver_v_pyrat.csv
that should output the following table (time values in last two columns may vary):
| Instance | SAVer | Pyrat | DurSAVer(ms) | DurPyr(ms) |
| -------- | ----- | ------- | ------------ | ---------- |
| 0 | 111 | Valid | 35 | 19,101 |
| 1 | 10 | Invalid | 49 | 20,740 |
| 2 | 111 | Valid | 53 | 18,552 |
| 3 | 111 | Valid | 41 | 22,001 |
| 4 | 111 | Valid | 45 | 20,464 |
| 5 | 111 | Valid | 47 | 19,300 |
| 6 | 111 | Valid | 43 | 20,546 |
| 7 | 111 | Valid | 48 | 19,802 |
| 8 | 10 | Invalid | 43 | 20,111 |
| 9 | 111 | Valid | 42 | 20,957 |
Composition (Claim 3)
Run the command (expected to last around 1 hour):
./scripts/sequencing_experiments.sh
To shorten the execution time, modify the timeout variable at line 5 to a lower value (for instance, 30s for a timeout of 30 seconds), although provers are then expected to all timeout.
For the results, run the command:
csvlook logs/sequencing.csv
that should output the following table:
| Prover | Output |
| --------- | ------------------------ |
| maraboupy | Goal sequencing: Invalid |
| nnenum | Goal sequencing: Timeout |
| PyRAT | Goal sequencing: Timeout |
Inspection of Automated Graph Editing of Neural Networks (Claim 4)
One can check for all above experiments the generated ONNX file using the netron program, included in the Dockerfile. By visualizing the graph of the generated neural network compared to the base one, reviewers can check that CAISAR indeed generates modified neural networks.
Run the command:
./scripts/compare_nets.sh
to run a verification query that requires a neural network to be modified. Then run the netron program on the base neural network and the modified one to compare their ONNX (graph).
On the usability of CAISAR
CAISAR has a user manual available online that describes how CAISAR can be used beyond the scope of reproducing the experiments we presented in the paper. One can for instance check local robustness on MNIST with several provers.
Files
169_artifact.zip
Files
(262.4 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:b01a1fcaf47cf21b09bc88ab560c0746
|
262.4 kB | Preview Download |
Additional details
Funding
- Agence Nationale de la Recherche
- DEEPGREEN - Plateforme de Deep Learning open source et indépendante dédiée à l'embarqué ANR-23-DEGR-0001
- Agence Nationale de la Recherche
- SAIF - Safe AI through Formal Methods ANR-23-PEIA-0006
Software
- Repository URL
- https://git.frama-c.com/pub/caisar
- Programming language
- OCaml , Python
- Development Status
- Active