---
title: "What are the Odds? Artifact"
author: Patrick Wienhöft
geometry: margin=2cm
date: \vspace{-1em}
header-includes:
  - \usepackage{lmodern}
---

This is a short documentation for the implementation references in the paper "What Are the Odds? Improving the Foundations of Statistical Model Checking"

# Data from the paper

Our artifact contains a folder `qest25-paper-data`.
This has all the data we obtained when running the experiments in the way described below.

# Setup

We recommend using the provided Docker image to run the experiments in this artifact which we explain in the next section.
All you require is an installation of [Docker](https://www.docker.com/).
If you do **not** want to use the Docker container and prefer a native execution, instructions can be found at the end of this readme.

We build upon the [Stormpy Docker image](https://hub.docker.com/r/movesrwth/stormpy) which comes with prebuilt Stormpy binding, so there should be no need for you to install Storm and/or Stormpy.

To get started, unzip the archive in this artifact and build the Docker image like this:

```sh
tar -xf artifact.zip
cd artifact
docker build -t wato .
```

This might take a couple of minutes when running for the first time.

## Troubleshooting

**<span style="color: #71add8;"> NOTE: </span>** If you are using an Apple  system with a silicon chip (i.e. M1/M2...) you may encounter compatibility issues when running the Docker image built for x86 (Intel/AMD) architectures. To ensure compatibility, explicitly specify the platform during the build process by running:
```sh
docker build --platform=linux/amd64 -t wato .
```
This tells Docker to emulate the amd64 (Intel/AMD) architecture instead of the native ARM architecture used by Apple Silicon. In general, all `docker build` and `docker run` commands should include the `--platform=linux/amd64` flag.

**<span style="color: #71add8;"> NOTE: </span>** If you are using an Windows system you will want to build and run the artifact in a PowerShell terminal to access several important utilities. Git Bash (MinGW) will likely not work due to path conversion issues with Docker.

**<span style="color: #71add8;"> NOTE: </span>** In our testing, Windows users sometimes experienced issues with differences in line endings between UNIX and Windows. To fix this, you can convert the line endings using
```sh
dos2unix <problematic_file>
```
 Specifically, if you encounter the error 
```sh
cannot access 'models/probabilistic-models-1.0/bin/probabilistic-models'$'\r': No such file or directory
models/probabilistic-models-1.0/bin/probabilistic-models: 80: Syntax error: newline unexpected (expecting ")")
```
run
```sh
dos2unix .\src\models\probabilistic-models-1.0\bin\probabilistic-models
```
and if you encounter
```sh
: not foundy.sh: 3:
```
run
```sh
dos2unix .\src\docker_entry.sh
```

# Smoke test

After building the Docker container as explained in the [Docker Setup](#docker) above, you can run the container in detached mode. Then, you can attach a terminal to the container. Depending on your OS, run:

Unix (Linux/Mac without M1/M2 chip)

```sh
docker run --name wato -v $(pwd)/src:/app/src -d wato sleep infinity
```

Apple silicon chip (M1/M2)

```sh
docker run --name wato -v $(pwd)/src:/app/src --platform=linux/amd64 -d wato sleep infinity
```

Windows (PowerShell)

```sh
docker run --name wato -v ${PWD}/src:/app/src -d wato sleep infinity
```

Afterwards, run

```sh
docker ps
# You should see something like this:

# CONTAINER ID   IMAGE     COMMAND            CREATED         STATUS         PORTS     NAMES
# 7f6bf2208fd5   wato      "sleep infinity"   3 seconds ago   Up 3 seconds             wato
docker exec -it wato /bin/bash
```

Your command line should now only show a `root@<CONTAINED ID>:/app/src#` in the new line. To verify that everything is set up correctly, run

```sh
python --version
# Python 3.8.2
java --version
# openjdk 17.0.14 2025-01-21
# OpenJDK Runtime Environment (build 17.0.14+7-Ubuntu-120.04)
# OpenJDK 64-Bit Server VM (build 17.0.14+7-Ubuntu-120.04, mixed mode, sharing)
storm --version
# Storm 1.6.3
# ...
```

Finally, you can execute our experiments on the `consensus` example and exit the terminal

```sh
python learn.py MDP --mdpfile models/PRISM/consensus.2-2.prism --property disagree --logfile consensus-2-2.dat --epsilon 0.3 --full
exit
```

**<span style="color: #71add8;"> NOTE: </span>** The output is intended to update continuously to show progress. This is implemented using some special control characters which may not be handled properly by some terminal applications. If this is the case, you will unfortunately not properly see the progress and have to simply trust the process. This only affects the terminal output but not the logged results.

The output for should look this (numbers may vary):

```
Pre-processing MDP...

Invocation: process --model models/PRISM/consensus.2-2.prism --properties models/PRISM/consensus.props --property disagree --collapse MEC --output models/PRISM/transformed/consensus.2-2_mec
MDP was successfully built!

Desired precision: 0.3
Starting sampling process...

baseline achieved precision 0.29362351385742363 -- done in 38236 sample runs
full achieved precision 0.29370551126700317 -- done in 14432 sample runs
cp achieved precision 0.29428884823759577 -- done in 16016 sample runs
small_supp achieved precision 0.29294045784371014 -- done in 26400 sample runs
independence achieved precision 0.297912650401856 -- done in 14960 sample runs
chains achieved precision 0.2987757335864202 -- done in 15456 sample runs
nwr achieved precision 0.2963400374790059 -- done in 14608 sample runs
structure achieved precision 0.2996023197173313 -- done in 15272 sample runs

See logs/logs/ablation/consensus-2-2/consensus-2-2_1.dat for logs
Done!
```

Finally, on your system you should see that `artifact/src/logs/logs/consensus-2-2.dat` in CSV format was created and contains the same numbers as shown in the command line output (in particular column 3 `num_runs` and column 5 `epsilon`).

To clean up your Docker, you can remove the container by running (otherwise it will continue running in the background indefinitely)

```sh
docker rm -f wato
```

## Full experiments

### Quick start

The easiest (but least customizable) way to obtain our results is 

```sh
tar -xf artifact.zip
cd artifact
docker rm -f wato .
docker build -t wato .
docker run --name wato -v $(pwd)/src:/app/src -it wato
```
Again, with Apple silicon chips, instead run

```sh
docker build --platform=linux/amd64 -t wato .
docker run --name wato -v $(pwd)/src:/app/src --platform=linux/amd64 -it wato
```

And for Windows

```sh
docker build -t wato .
docker run --name wato -v ${PWD}/src:/app/src -it wato
```

Running this will take a couple of hours, especially for the `firewire_dl`and `wlan_dl` models.

This will run our ablation study (see Tables 2 and 4 in the paper) with one repetition (as opposed to 10 in the paper). For each model you will find a log file in `logs/logs/ablation/<model>/<model>_1.dat` as well as `logs/logs/ablation/<model>.dat` which is an average of all the runs per problem instance (which is the same here since we only did one run), all of which are in CSV format. The data for Table 3 is also contained in the CSV. For Table 1, there should be `logs/logs/ablation/average.dat` which contains the minimum, average (geometric mean), and maximum values of all the `improvement_factor` columns for the models.

### Inspecting results

The values in the CSV files directly correspond to the values in our tables.
However, as this may be cumbersome to compare, for convenience, create the ``\LaTeX`` file `src/logs/logs/tables.tex` which is a full ``\LaTeX`` document which you can build into a PDF using your favourite (locally installed) ``\TeX`` compiler. The resulting tables should match the tables in our paper.

### Runtime comparison

When comparing the runtime in the CSV files (i.e. the ones shown in Table 4) you will probably notice that **the shown runtime is significantly lower than the runtime you experienced**. **This is intended**. We quickly explain the reason for this (which is also included in the paper): The general outline of an model-based SMC algorithm is taking a model as well as a precision value ``0<\varepsilon<1`` as input, and then

- Gather samples
- Compute confidence interval for value
- If confidence interval size is ``\leq \varepsilon`` terminate, otherwise repeat

The question of how many samples are gathered before a confidence interval is computed is orthogonal to the problem we investigate in our paper. In order to not have this influence our runtime analysis, we exclude the time required for computing confidence intervals that do not lead to termination. Computing these confidence intervals is however usually the most expensive part of model-based SMC. In our experiments the models require ~100 iterations of this loop leading to these large discrepancies between the given and experienced runtimes.

### Customizing experiments

#### More runs

If you want to average over 10 runs (like in our paper), you can modify the bash script `src/docker_entry.sh` to call `main.py 10` instead of `main.py 1` (or whatever other number of repetitions you prefer). You will have to rebuild the Docker image after making these changes, i.e. run

```sh
docker rm -f wato
docker build -t wato .
docker run --name wato -v $(pwd)/src:/app/src -it wato
```

As before, adapt the commands if you use an M1/M2 chip or Windows.

#### Restricting examples

The long runtime of the experiments is largely due to the `wlan_dl` and `firewire_dl` models which took around 5 and 1 hour, respectively, per run on our native setup. If you want to recreate our results without these examples, you can comment out or delete the respective lines in `src/main.py` (lines 13 and 10).  Again, you will have to rebuild the Docker image as above.

#### Custom examples

If you want to change the parameters of our experiments (e.g. other properties and other values of epsilon), you can directly edit the flags in the `src/main.py` and rebuild as above.

Another possibility, which also allows you to use custom MDP files in the `.prism` format, is to again start a terminal inside the Docker container like this (skipping the first line if you still have a container running; and adapting it if you're on an M1/M2 or Windows):

```sh
docker run --name wato -v $(pwd)/src:/app/src -d wato sleep infinity
docker exec -it wato /bin/bash
```

Once inside the container, run

```sh
cd src
python learn.py MDP --mdpfile models/PRISM/<model>.prism --property <prop> --logfile <logfile> --epsilon <precision> [--full] [--minimization]
```

Here, replace `<model>` with the name of your model, and `<prop>` with the name of your property, and `<precision>` by your desired precision.

If your property is a minimization property (i.e. `Pmin=? [...]` in the PRISM language), you **must** add the `--minimization` flag to obtain correct results.

If the flag `--full` is set, an ablation study is performed. Otherwise, only two runs are done: the baseline and one with all our improvements.

Lastly, `<logfile>` specifies the name of the logfile. The path of the logfile with be `logs/logs/ablation/<logfile>` if `--full` is set, and `logs/logs/results/<logfile>` otherwise. We recommend following the structure of our examples and specifying `logs/logs/ablation/<model>.dat` if you want a single run, or `logs/logs/ablation/<model>/<model>_i.dat` for the i-th run if you do multiple runs. For the latter, make sure to create the folder `logs/logs/ablation/<model>` beforehand.

When following this structure you can average and display the results by running

```sh
cd logs/logs/ablation
python average_per_model.py  # <-- skip this if you only have a single run per model
python average.py
cd ..
python to_tex_table.py
```

## Reproducing Figures 1 and 5

Finally, Figures 1 and 5 of our paper can be reproduced inside the Docker container by running the following (again adapting the `docker run` command to M1/M2/Windows if necessary)

```sh
docker run --name wato -v $(pwd)/src:/app/src -d wato sleep infinity
docker exec -it wato /bin/sh
python sample_complexity_plots.py
python sample_complexity_plots_3d.py
```

which creates `ratio_phat_delta0.01.csv`, `ratio_eps_delta0.01.csv` directly inside the `src` directory which we used to create Figure 1 in ``\LaTeX``. For convenience, we also use matplotlib to produce `.png` files with the same name, plotting the same data. For the 3D plot of Figure 5 we directly use the matplotlib generated image `ratio_3d.png` from the same directory.

# Native Python Installation

To run the artifact natively, you need a working Python installation (we used Python 3.9), as well as an installation of the [Storm model checker](https://www.stormchecker.org/) with [Stormpy bindings](https://moves-rwth.github.io/stormpy/installation.html). If you are relying on specific versions of certain Python packages for other projects, we suggest you setup a [vitrual environment](https://docs.python.org/3/library/venv.html) for this artifact. Using this setup you can perform the experiments via same steps as in the Docker container.

For the smoke test you can run
```sh
tar -xf artifact.zip
cd artifact/src
python learn.py MDP --mdpfile models/PRISM/consensus.2-2.prism --property disagree --logfile consensus-2-2/consensus-2-2_1.dat --epsilon 0.3 --full
```

And to reproduce the full data set 
```sh
tar -xf artifact.zip
cd artifact/src
python main.py 1
# alternatively, if you want x repetitions, run 'python main.py x'
cd logs/logs/ablation
python average_per_model.py
python average.py
cd ..
python to_tex_table.py
```

