Software Open Access

# Artifact for the OOPSLA'20 paper "Regex Matching with Counting-Set Automata"

Lukáš Holík; Ondřej Lengál; Olli Saarikivi; Lenka Turoňová; Margus Veanes; Tomáš Vojnar

### Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:creator>Lukáš Holík</dc:creator>
<dc:creator>Ondřej Lengál</dc:creator>
<dc:creator>Olli Saarikivi</dc:creator>
<dc:creator>Lenka Turoňová</dc:creator>
<dc:creator>Margus Veanes</dc:creator>
<dc:creator>Tomáš Vojnar</dc:creator>
<dc:date>2020-08-07</dc:date>
<dc:description>Artifact for the paper "Regex Matching with Counting-Set Automata" (OOPSLA'20)

This is an artifact for the paper "Regex Matching with Counting-Set Automata" at OOPSLA'20.

This artifact is supposed to be run on the virtual machine Artifact Evaluation VM - Ubuntu 18.04 LTS available at http://doi.org/10.5281/zenodo.2759473 . The recommended virtualization software is VirtualBox (we used version 6.1.12).

Please make sure to have at least 30 GiB allocated on your computer for the VM (the disc image will grow automatically). Let us warn you that running the (full) experiments on 1 CPU may take a time in the order of tens of hours and may cause your computer (in particular a laptop) to get hot (possibly overheat and turn off).

Note: see the file ~/howto_vbox_shared_folder.txt on how to set up a shared folder between the host and the guest OS (it is simple). It can make transferring of files from/to the VM easier.

Getting Started

Preparing VM

Download the VM from http://doi.org/10.5281/zenodo.2759473 and import it into VirtualBox (we recommend at least 8 GiB of memory per CPU (4 GiB might also work, though some experiments may terminate sooner due to out-of-memory) --- if you allocate more CPUs, the benchmarks will run in parallel ; it is also a good idea not to do other demanding things on your host OS while the experiments are runnning, otherwise the OSes will be fighting for RAM).
Start the VM, turn on Terminal (in the left bar), enable network connection, and download the artifact zip file.

OR:

Start the VM, turn on Terminal (in the left bar) and mount the shared folder according to ~/howto_vbox_shared_folder.txt.
Copy the artifact zip file from the shared folder to \$HOME. Then run the following:

unzip &lt;artifact&gt;.zip
cd &lt;artifact&gt;/

Installing Packages

Go to the root directory of the artifact and run

sudo ./install_requirements.sh

Take a walk (~20 minutes).

There might be some issues reported with installing some packages (some nasty stuff happens due to the need to update libc). The issues should not matter, since the installed tools can be used.

Preparing the Benchmarks

Download the dataset from https://doi.org/10.5281/zenodo.3974360 , unzip it and copy to the right location (you may need to enable network connection).

unzip benchmark-cnt-set-automata.zip
mv benchmark-cnt-set-automata/bench/* run/

Kicking the Tires

The following sequence of commands should check that everything is working and run a small subset of experiments, and generate a preliminary report.

cd run/
./make_short.sh               (prepares short version of experiments)
./run_short_benchmarks.sh
...
(take a walk ~20 mins)
...
./run_short_processing.sh
cd ../results
./generate-report.R
firefox results.html

You should see a web page with incomplete results of the experiments (consider increasing the resolution of the VM).

Step by Step Instructions

Running the Full Experiments

cd run/
./run_benchmarks.sh

Take a long walk (possibly a trip Paris or any other place that you have always wanted to visit --- this may take a few tens of hours, based on your setup, so you may even manage to leave the quarantine before the experiments finish ;-) --- seriously, it might take two or three days ; you can, however, save the state of the VM and restore it later to continue with the experiments). You can change the timeout in run/run_benchmarks.sh to obtain partial results faster or remove some lines from run/bench-*.txt.

Processing the Results of Experiments

Before viewing the results, we recommend to change the resolution of the VM to a higher one.

(in run/)
./run_processing.sh

cd ../results/
./generate-report.R
firefox results.html

Supported Claims

The artifact reproduces the following parts of the paper:

Fig. 5
Table 1

Since the machine running the artifact will most probably differ from the one we used to run the experiments, exact times, numbers of timeouts, etc. will most probably differ, but the trends should stay the same.

Extra Notes

Installing Outside of the Provided VM

It should not be difficult to set up the environment on a Linux OS reasonably close to the one in the referenced VM. The needed Linux packages are

python3
R
pandoc
libre2-dev
grep
mono (version at least 5.*)

Python packages:

pyyaml
tabulate

R packages:

rmarkdown
knitr
ggplot2
ggExtra
gridExtra
pastecs

You can follow the commands in the installation script to see what needs to be done.

Running Other Experiments

The experiments to run are stored in the run/bench-*.txt files, in a CSV-like format pattern;input-file where pattern can use escape characters as used in CSVs (compatible with Python's csv module). If you have a file FILE with your own benchmarks, you can run the following command in the run/ directory:

cat FILE | ./pycobench -t TIMEOUT -o OUTPUT pattern_match.yaml

where TIMEOUT is the timeout (in seconds) and OUTPUT is a file that logs results of experiments. See ./pycobench -h for more details. ./pycobench by default runs every benchmark (i.e. a line in FILE) with all regex matchers as defined in run/pattern_match.yaml (the default definition runs them in the mode where they count the number of matching lines).

When the command finishes, you need to process the output to collect the runtimes and numbers of matches to a format where there is single line for every benchmarks using the following commands:

cat OUTPUT | ./san_output.sh | ./proc_results.py &gt; results.csv

You can import the resulting CSV file in a spreadsheet editor. Note that there might be some problems with delimiters (such as ";" in the regexes), so you might first consider sanitizing the CSV to get rid of regexes by the ./sanitize-csv.py script.</dc:description>
<dc:identifier>https://zenodo.org/record/3975566</dc:identifier>
<dc:identifier>10.5281/zenodo.3975566</dc:identifier>
<dc:identifier>oai:zenodo.org:3975566</dc:identifier>
<dc:language>eng</dc:language>
<dc:relation>doi:10.5281/zenodo.3975565</dc:relation>
<dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
<dc:subject>regular expression</dc:subject>
<dc:subject>repetition operator</dc:subject>
<dc:subject>finite automaton</dc:subject>
<dc:subject>symbolic automaton</dc:subject>
<dc:subject>counting-set automaton</dc:subject>
<dc:title>Artifact for the OOPSLA'20 paper "Regex Matching with Counting-Set Automata"</dc:title>
<dc:type>info:eu-repo/semantics/other</dc:type>
<dc:type>software</dc:type>
</oai_dc:dc>

210
41
views