Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification

Alberti, Michele; Bobot, François; Girard-Satabin, Julien Henri-Aurélien; Grastien, Alban; Varasse, Aymeric; Chihani, Zakaria; CEA LIST

doi:10.5281/zenodo.15314337

Published 2025 | Version v7

Software Open

Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification

1. CEA LIST

Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification

Note: This document is written in Markdown format. For the best viewing experience, please open it with a Markdown reader or editor.

Foreword

This document describes how to reproduce the experiments detailed in the paper submitted at CAV'25.

The artifact is available at the following URL: https://zenodo.org/records/15314337/files/169_artifact.zip?download=1

The SHA256sum of the archive is the following: ad653e47612651bf0381eb05e907cb1f7478ad136380ce3c96cbdc0b68d25222

The md5sum of the archive is the following: b01a1fcaf47cf21b09bc88ab560c0746

The associated DOI is 10.5281/zenodo.15314337

License

This artifact is considered part of the CAISAR platform and, as such, shares its original license. It is included into the attached LICENSE file.

Content

This archive contains the following files:

LICENSE
Dockerfile provides a recipe for building CAISAR and other companion provers: nnenum, Marabou, PyRAT and SAVer.
patches folder contains temporary fixes for the benchmarks to run properly inside Docker:
- relax-shapes-maraboupy.patch
- relax-multiple-nn-nnenum.patch
These will automatically be applied during the building of CAISAR inside Docker.
Experiment files under the following folders:
- acas/acas_experiments.sh
- sequencing/sequencing_experiments.sh
- svms/shsolve.sh
svms folder contains additional data:
- SVM model with a linear kernel linear-1k.ovo
- 100 data points instances under the svms/singles folder, stored as input_$i.csv
- theory.param.why describing the WhyML specification that corresponds to SAVer ovo setting

Claims addressed by the artifact

Unnormalized `ACAS-Xu` (Claim 1)

Relevant paper section: 3.1

This artifact checks the ability for CAISAR to check normalized and unnormalized properties of the ACAS-Xu benchmark, using automated graph editing to integrate the normalization and denormalization directly inside neural networks.

Verification of SVMs (Claim 2)

Relevant paper section: 3.2

This artifact displays how one can write a specification on checking support vector machines (SVMs) adversarial robustness and verify it, without the need for a dedicated prover. To do so, it compares the result of using the neural network-tailored prover PyRAT and the SVM-tailored one SAVer.

More precisely, the script will output for each instance the following string:

Saver output: XYZ v PyRAT output: ANSW

which should be as follows:

If XYZ=111 then ANSW must be Valid.
If XYZ=010 then ANSW must be Invalid.
If XYZ=101 then ANSW must be Unknown.

Composition of neural networks (Claim 3)

Relevant paper section: Fig. 2

This artifact displays how one can write specifications of properties that include composition of neural network; and check those specifications.

Automated Graph Editing of Neural Networks (Claim 4)

This artifact displays how CAISAR automatically edits neural networks to integrate part of the specification directly inside them, to circumvent provers limitations (in particular, on handling single neural networks only).

Usage

Requirements

A Docker installation. The provided Dockerfile was tested against Docker version 26 and higher, on a GNU/Linux host system (in particular, Ubuntu 24 LTS). Different settings may work but may require some tweaks
About 35GB of space
A network connection for the initial setup of the Docker image

Installation

The installation requires installing heavy machine learning libraries like torch inside multiple virtual environment, as well as compiling a moderate-size OCaml project. Expect installation to take around 15 and 20 minutes on a 2024 high-end laptop.

To build the experimental environment, run the following command at the root of the repository:

sudo docker build -t caisar/cav25artifact .

On patching provers

The artifact contains two patches that address technical limitations in the underlying provers without altering their core functionality. One patch for Marabou removes a restriction requiring inputs to have identical shapes, while the other patch for nnenum eliminates a constraint that prevented neural network inputs from appearing multiple times in the computational graph. While this somewhat contradicts our claim that CAISAR works without modifying the underlying provers, these changes are purely focused on the front-end of these tools, and in particular do not affect their theoretical foundations. We are preparing issues and pull requests for both Marabou and nnenum to discuss and incorporate these improvements into their main codebases.

Experiments

All commands in this section are intended to be run inside the Docker container. They should all be preempted by the following two commands:

mkdir logs

docker run -it --mount type=bind,source=$(realpath logs),target=/artifact/logs --network="host" caisar/cav25artifact

`ACAS-Xu` (Claim 1)

For each prover (i.e. Marabou, nnenum and PyRAT), one run for each ACAS-Xu property is executed. The timeout for each run is set to 120 seconds. The reviewers are encouraged to modify timeout (default 120s) and runs (default 1) variables in the concerned script to better suit their setting. For example, in the paper, runs was set to 3 to consider the average execution time.

Run the following command to execute the ACAS-Xu benchmark (expected to last around 2 hours):

./scripts/acas_experiments.sh

Expected result: provers should answer the same result (or Timeout) on both normalized and unnormalized inputs, except for:

maraboupy on properties 4,5, and 9, for which it answers with Valid and Invalid,
nnenum on properties 2 and 6, for which it answers with Valid and Unknown. The Unknown answers are due to a crash in nnenum.

There should be no disagreement on the same property among provers.

Run the following command to generate the file table_solvers_standalone.pdf containing a comprehensive table that summarizes all benchmark results:

./scripts/generate_table.sh

To run a verification query only on a given property with a much shorter time, use the following command:

caisar verify -L . caisar/examples/acasxu/acasxu.why -D model_filename:nets/onnx/<onnx> -t 120s -m 8GB --goal <goal> --prover <prover>

where:

<onnx> is one of:
- ACAS_1_1.onnx for properties from 1 to 6
- ACAS_1_9.onnx for property 7,
- ACAS_2_9.onnx for property 8,
- ACAS_3_3.onnx for property 9,
- ACAS_4_5.onnx for property 10
<goal> is either:
- :P$i for verifying queries on normalized inputs
- :runP$i\'vc for verifying queries on unnormalized inputs
where $i is an integer between 1 and 10 that corresponds to a property in the ACAS-Xu benchmark.
<prover> is one of maraboupy, nnenum, PyRAT. When using PyRAT on :P$i goals, add --prover-altern ACAS to the command, while on :runP$i\'vc ones add --prover-altern ACASd.

For instance, to check the first property on normalized inputs with the prover nnenum, use the following:

caisar verify -L . caisar/examples/acasxu/acasxu.why -D model_filename:nets/onnx/ACAS_1_1.onnx -t 120s -m 8GB --goal :P1 --prover nnenum

The -t option controls the timeout allocated to CAISAR, so reducing it will make the experiments shorter (at the risk of having provers timeout before providing an answer).

SVMs (Claim 2)

Run the command (expected to last around 5 minutes):

./scripts/shsolve.sh

The run should terminate with the following output:

Nb errors: 0

The script may take a parameter in input for the data point to consider. By default, 10 data points are considered (100 is the max number accepted). Provide a lower value than 10 to shorten the execution time.

For detailed results, run the command:

csvlook logs/saver_v_pyrat.csv

that should output the following table (time values in last two columns may vary):

| Instance | SAVer | Pyrat   | DurSAVer(ms) | DurPyr(ms) |
| -------- | ----- | ------- | ------------ | ---------- |
|        0 |   111 | Valid   |           35 |     19,101 |
|        1 |    10 | Invalid |           49 |     20,740 |
|        2 |   111 | Valid   |           53 |     18,552 |
|        3 |   111 | Valid   |           41 |     22,001 |
|        4 |   111 | Valid   |           45 |     20,464 |
|        5 |   111 | Valid   |           47 |     19,300 |
|        6 |   111 | Valid   |           43 |     20,546 |
|        7 |   111 | Valid   |           48 |     19,802 |
|        8 |    10 | Invalid |           43 |     20,111 |
|        9 |   111 | Valid   |           42 |     20,957 |

Composition (Claim 3)

Run the command (expected to last around 1 hour):

./scripts/sequencing_experiments.sh

To shorten the execution time, modify the timeout variable at line 5 to a lower value (for instance, 30s for a timeout of 30 seconds), although provers are then expected to all timeout.

For the results, run the command:

csvlook logs/sequencing.csv

that should output the following table:

| Prover    | Output                   |
| --------- | ------------------------ |
| maraboupy | Goal sequencing: Invalid |
| nnenum    | Goal sequencing: Timeout |
| PyRAT     | Goal sequencing: Timeout |

Inspection of Automated Graph Editing of Neural Networks (Claim 4)

One can check for all above experiments the generated ONNX file using the netron program, included in the Dockerfile. By visualizing the graph of the generated neural network compared to the base one, reviewers can check that CAISAR indeed generates modified neural networks.

Run the command:

./scripts/compare_nets.sh

to run a verification query that requires a neural network to be modified. Then run the netron program on the base neural network and the modified one to compare their ONNX (graph).

On the usability of `CAISAR`

CAISAR has a user manual available online that describes how CAISAR can be used beyond the scope of reproducing the experiments we presented in the paper. One can for instance check local robustness on MNIST with several provers.

Files

169_artifact.zip

Files (262.4 kB)

Name	Size	Download all
169_artifact.zip md5:b01a1fcaf47cf21b09bc88ab560c0746	262.4 kB	Preview Download

Additional details

Agence Nationale de la Recherche
DEEPGREEN - Plateforme de Deep Learning open source et indépendante dédiée à l'embarqué ANR-23-DEGR-0001
Agence Nationale de la Recherche
SAIF - Safe AI through Formal Methods ANR-23-PEIA-0006

Repository URL: https://git.frama-c.com/pub/caisar
Programming language: OCaml , Python
Development Status: Active

	All versions	This version
Views	406	175
Downloads	106	23
Data volume	28.2 MB	6.0 MB

Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification

Authors/Creators

Description

Artifact for CAV submission 169 - The CAISAR Platform: Extending the Reach of Machine Learning Specification and Verification

Foreword

License

Content

Claims addressed by the artifact

Unnormalized ACAS-Xu (Claim 1)

Verification of SVMs (Claim 2)

Composition of neural networks (Claim 3)

Automated Graph Editing of Neural Networks (Claim 4)

Usage

Requirements

Installation

On patching provers

Experiments

ACAS-Xu (Claim 1)

SVMs (Claim 2)

Composition (Claim 3)

Inspection of Automated Graph Editing of Neural Networks (Claim 4)

On the usability of CAISAR

Files

169_artifact.zip

Files (262.4 kB)

Additional details

Funding

Software

Unnormalized `ACAS-Xu` (Claim 1)

`ACAS-Xu` (Claim 1)

On the usability of `CAISAR`