Fuzz4All: Universal Fuzzing with Large Language Models

This is the implementation of our research paper, "Fuzz4All: Universal Fuzzing with Large Language Models", accepted at ICSE 2024.

Setup

OS: A Linux System with Docker Support;
Hardware: GPU support. (this is optional but very highly recommended as without GPU support, the speed will be extremely slow)

Before you start, please make sure you have installed Docker (https://docs.docker.com/get-docker/) and nvidia-container-toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). To test if it is successfully installed, run

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Note: The command above and many following scripts will invoke docker commands. If your user is not in the docker group, you may encounter issues running them. To resolve this, you may either add your user to the docker group instruction (preferred), or run the commands with sudo sudo docker <CMD>.

<details> <summary>Q1: Why docker?</summary>

LLMs can generate arbitrary test programs, some of them that may attempt to change the source code and even cause damage to the file system. Therefore, we provide a docker environment for running Fuzz4All.

We highly recommend running Fuzz4All in a sandbox environment like docker. However, you may create a conda environment to run locally.

<details> <summary>Below are instructions to build our conda environment:</summary>

Please run the following commands line-by-line ONLY IF YOU ARE NOT RUNNING IN DOCKER (you may need to press y):

# Create an environment named `fuzz4all`
conda create -n fuzz4all python=3.10
# Activate
conda activate fuzz4all

# Install required packages
pip install -r requirements.txt
pip install -e .

</details> </details>

We highly recommend running LLMs on GPUs for more efficient fuzzing. In our experiments, we use a 64-core workstation with 256 GB RAM and running Ubuntu 20.04.5 LTS with NVIDIA RTX A6000 GPU

If you don't have GPUs, don't worry, you can still run Fuzz4All on CPU (although it will be much slower to generate tests).

</details>

Set up docker

Overview: What is inside the docker image?

we have constructed a docker image which contains:

the full code for Fuzz4All, full inputs and scripts to reproduce results from the paper
pre-built binaries of the 7 fuzzing targets (GCC, G++, Go, Javac, CVC5, Z3, Qiskit) in order to save time for building them. We did not include Clang, Clang++ since they are too large to include in the docker image and not neccesary to produce the results in the paper. However, we include the script to obtain them (run ./get_clang.sh in /home/)
the build scripts to collect coverage for these targets. (The reason we did not include the pre-built coverage binaries is that they are too large to include in the docker image)

Note we also did not include the full fuzzing inputs generated by Fuzz4All and baseline tools in the docker image due to size limitations. However, the complete fuzzing inputs can be obtained in the complete artifact, furthermore we include the intermediate results (e.g., coverage) and scripts to reproduce the complete results in the paper (see 3. Reproduce results in paper for more detail)

Build the docker image and run:

docker load < Fuzz4All_Docker_Image.tar.gz # this may take a while.
docker run -it --rm --runtime=nvidia --gpus all fuzz4all/fuzz4all:v3 bash

Inside the docker image

cd /home/Fuzz4All
conda activate fuzz4all # load the conda environment

if you don't have a GPU, you can substitute docker run with

docker run -it --rm fuzz4all/fuzz4all:v3 bash
# follow the rest of the instructions

Usage

For each section, we recommend reading through the complete section before running a command as it could be quite time-consuming and may download large files.

We provide 3 different sections in this artifact:

Fast Mode (Reproduction Level 1): scripts to quickly reproduce the main results in the paper using intermediate data collected in our runs.
Run Fuzz4All: scripts to run and test Fuzz4All on selected targets and collect various bug, coverage and validity metrics
Full Experiments (Reproduction Level 2): scripts to reproduce the complete results in the paper from scratch.

Fast Mode (Reproduction Level 1)

Due to the high cost of running the full experiments, in this section, we provided simple scripts based on intermediate data to quickly reproduce the results in the paper.

Main Results (Table 2, Figure 4) - Coverage and Number of Valid Programs

conda activate fuzz4all
cd /home/Fuzz4All
rm -r /home/Fuzz4All/fig/coverage-* # remove the old figures
python tools/coverage/plot_full_run_coverage.py

This will produce the 6 figures in Figure 4 (in fig/coverage_{target}.pdf) using the coverage data collected in our runs

you can download the figures by running the following command in your host machine:

docker cp <containerId>:/home/Fuzz4All/fig .

To generate the tables in Table 2, run the following command:

cd /home/Fuzz4All
python tools/coverage/draw_table.py

This will render/output the table in Table 2 using coverage, validity and number of programs collected during our runs

Targeted Fuzzing (Table 3) - Coverage and Hit Rate

cd /home/Fuzz4All
python tools/targeted/draw_table.py

This will render/output the tables in Table 3 using both coverage and hit rate collected during our targeted runs

Ablation Study (Table 4) - Coverage and Number of Valid Programs

cd /home/Fuzz4All
python tools/ablation/draw_table.py

This will render/output the table in Table 4 using both coverage and valid rate collected during our ablation runs

Run Fuzz4All

Now you are ready to run Fuzz4All. First activate the pre-built conda environment:

conda activate fuzz4all
cd /home/Fuzz4All

Second lets determine what is the best fuzzing parameters for you depending on your GPU hardware. Here are the default parameters:

export FUZZING_BATCH_SIZE=30
export FUZZING_MODEL="bigcode/starcoderbase"
export FUZZING_DEVICE="gpu"

If your gpu has around >=48GB memory, you can use the default parameters used in the paper
If your gpu has >=32 GB, you can try using a lower batch size (e.g., export FUZZING_BATCH_SIZE=5)
If your gpu has <32 GB, you can switch to a smaller model with a smaller batch size (e.g., export FUZZING_MODEL="bigcode/starcoderbase-1b" export FUZZING_BATCH_SIZE=5)
If there are still out of memory errors, unfortunately you may need to switch to CPU mode and use smaller models (e.g., export FUZZING_MODEL="bigcode/starcoderbase-1b; export FUZZING_DEVICE="cpu")

Note currently we only support either bigcode/starcoderbase-1b or bigcode/starcoderbase in terms of model (in none-distributed setup), however Fuzz4All can be modified easily to support other models/architectures.

Before you start, you need to export OpenAI API key to the environment variable OPENAI_API_KEY, here we provide a key (which will be removed after end of evaluation period) for you to test:

export OPENAI_API_KEY=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

1. Simple Script.

Now you want to start fuzzing right away? We provide a start script to test model is running smoothly.

./scripts/demo_run.sh

Here we are fuzzing g++ compiler with a pre-defined prompt (without need for autoprompting, we will test that later)

You may be prompted to enter your huggingface credentials, please see this for more detail: https://huggingface.co/docs/hub/security-tokens and follow the error message instructions.

# after accepting their license agreement with your huggingface account on huggingface.co 
huggingface-cli login
# paste your token here from https://huggingface.co/settings/tokens 
Add token as git credential? (Y/n) 
# respond with n

If this is your first time running Fuzz4All, it will take some time to download the model (~20 minutes) and then it will start fuzzing. You should see similar outputs to the following:

BATCH_SIZE: 30
MODEL_NAME: bigcode/starcoderbase
DEVICE: gpu
...
=== Target Config ===
language: cpp
folder: outputs/demo/
...
====================
[INFO] Initializing ... this may take a while ...
[INFO] Loading model ...
=== Model Config ===
model_name: bigcode/starcoderbase
...
====================
[INFO] Model Loaded
[INFO] Without any input prompt ...
[INFO] Done
 (resuming from 0)
[VERBOSE] /* Please create a very short program which uses new C++ features in a complex way */
#include <iostream>
...
[VERBOSE] /* Please create a very short program which uses new C++ features in a complex way */
#include <iostream>
int main()
{
    int x{5};
    x *= ++x + x * (13- x %(2+ x / 3));
    std::cout << x << std::endl;
    return 0;
}
/* Please create a mutated program that modifies the previous generation */
#include <iostream>

Fuzzing •  30% ━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━  30/100 • 0:02:56

This means the model is running smoothly!

In this script we generate ~100 fuzzing inputs to test the g++ compiler and you can find the generated test programs in outputs/demo/.

2. Full Pipeline.

Now you are ready to run Fuzz4All on all targets (with arbitrary inputs through autoprompting)!

Fuzz4All is configured easily through config files. The one used for our experiment are store in configs/. The config file controls various aspects of Fuzz4All including the fuzzing language, time, autoprompting strategy, etc. Please see any example config file in configs/ for more detail.

In general, you can run Fuzz4All with the following command:

python Fuzz4All/fuzz.py --config {config_file.yaml} main_with_config \ 
                        --folder outputs/fuzzing_outputs \
                        --batch_size {batch_size} \
                        --model_name {model_name} \
                        --target {target_name}

where {config_file.yaml} is the config file you want to use, {batch_size} is the batch size you want to use, {model_name} is the model name you want to use, and {target_name} is the target binary you want to fuzz.

you may choose to build/download your own binary ({target_name}) for fuzzing, in this docker container we have already provided the pre-built binaries.

For targeted fuzzing (i.e., fuzzing a specific API or library of a language), you can modify the config file to point to the specific API/library documentation you want the model to generate prompts for. Please see configs/targeted for examples of such configs.

Due to the number of different configurations, for ease of running, we provide a script to run the full pipeline of Fuzz4All (each run will take 24 hours, corresponding to the full run in RQ1):

./scripts/full_run.sh {target} 
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.

Note: please make sure you choose an appropriate batch size and model name for your GPU memory size (see above). We recommend first running with a smaller model (i.e., export FUZZING_MODEL="bigcode/starcoderbase-1b) and a smaller batch size (e.g., export FUZZING_BATCH_SIZE=5) to make sure it is not too slow if you are running on a smaller, less powerful GPU.

You should see similar outputs to the following (notice the autoprompting step):

BATCH_SIZE: 30
MODEL_NAME: bigcode/starcoderbase
DEVICE: gpu
...
=== Target Config ===
language: smt2
folder: outputs/full_run/cvc5/
...
====================
[INFO] Initializing ... this may take a while ...
[INFO] Loading model ...
=== Model Config ===
model_name: bigcode/starcoderbase
...
====================
[INFO] Model Loaded
[INFO] Use auto-prompting prompt ...
Generating prompts... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:07:30
[INFO] Done
 (resuming from 0)
[VERBOSE] ; SMT2 is an input language commonly used by SMT solvers, with its syntax based on S-expressions. The multi-sorted logic accommodates a simple type system to confirm that terms from contrasting sorts
aren't the equal. Uninterpreted functions can be declared, with the function symbol being an uninterpreted one. SMT2 supports various theories, including integer and real arithmetic, with basic logical
connectives, quantifiers, and attribute annotations. An SMT2 theory includes sort and function symbol declarations and assertions of facts about them. Terms can be checked against these theories to determine their
validity, with successful queries returning "unsat".
; Please create a short program which uses complex SMT2 logic for an SMT solver
(set-logic ALL)
...
(set-logic ALL)
(assert (forall ((n Int)) (=> (> n 0) (= n (* 2 n)))))
(check-sat)
(exit)
; Please create a short program which uses complex SMT2 logic for an SMT solver
(set-logic ALL)

Fuzzing •   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━     30/100000 • 0:02:26

You can also ctrl-c to stop the pipeline and resume it later by running the same command again.

after running the command, you can find the generated fuzzing programs in outputs/full_run/{target}/.

<details> <summary>Here is the structure of the output directory: </summary>

- outputs/full_run/{target}/
    - prompts 
        - best_prompt.txt: the best prompt found by `Fuzz4All` for the target.
        - greedy_prompt.txt
        - prompt_0.txt
        - prompt_1.txt
        - prompt_2.txt
        - scores.txt: keep track of the scores of each prompt (used to select the best prompt).
    - 0.fuzz
    - 1.fuzz
    ... # 
    - log.txt
    - log_generation.txt
    - log_validation.txt

</details>

Most notably, we log both the generation and validation process in log_generation.txt and log_validation.txt respectively.

Furthermore, log.txt provides an overview of the fuzzing process (including any potential bugs found by Fuzz4All)

Potential bugs will look like this in log.txt:

[VERBOSE] 2345.fuzz has potential error! # this indicates that file 2345.fuzz may have a potential bug

3. Coverage Collection

build the particular target with coverage instrumentation.

Note: these steps can be quite time-consuming (for example ~1 hour for gcc/g++) and can increase the size of the docker image by a lot, please run with caution.

To avoid long build time and space, we recommend testing the cvc5 target first which is the smallest and most reasonable to build (~10 minutes).

conda activate fuzz4all # ensure you active the environment as the coverage build requires some python packages
./scripts/build_cvc5_coverage.sh

collect coverage for the target with the coverage build (we've provided a simple set of programs to test coverage for smt2)

cd /home/Fuzz4All
python tools/coverage/SMT/collect_coverage.py --folder outputs/demo_coverage_cvc5 --interval 10

you should see the coverage report (either coverage.csv or coverage.txt) in outputs/demo_coverage_cvc5 and the output of the python script should be similar to:

Scanning dependencies of target coverage-reset
Resetting code coverage counters to zero.
[0.090s] [info]: Found 735 coverage files (.gcda)
[0.064s] [info]: Removed 735 .gcda files
Built target coverage-reset
18158 4729
18573 4786
Fuzzing •  29% ━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━  29/101 • 0:00:46

note that this script will also compute the number of valid programs in valid.txt

repeat for other targets or baselines (next section)

e.g., for gcc

./scripts/build_gcc_coverage.sh
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder {c_fuzzing_run_folder}

Full Experiments (Reproduction Level 2)

In this section, we provide and demonstrate our scripts/settings to reproduce the results in the paper, please ensure you have run at least the simple script in the previous section.

Main Results (Table 2, Figure 4) - Coverage and Number of Valid Programs

To produce the main results of the paper, directly use the previous script in the Full Pipeline section:

cd /home/Fuzz4All
conda activate fuzz4all
./scripts/full_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.

This will generate the full 24 hours fuzzing run for one target. Repeat for 5 times to rerun the results from the paper.

Next we can collect the coverage using similar script in the Coverage Collection section:

./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/full_run/gcc

This will generate both the coverage and validity report for each fuzzing run.

Next we can use the intermediate data script to generate the figures.

cd /home/Fuzz4All
python tools/coverage/plot_full_run_coverage.py

This will produce the 6 figures in Figure 4 (in fig/coverage_{target}.pdf) using the coverage data collected in our runs

# go to top level directory
cd /home/Fuzz4All
python tools/coverage/draw_table.py

This will render/output the table in Table 2 using coverage, validity and number of programs collected during our runs

Targeted Fuzzing (Table 3) - Coverage and Hit Rate

To produce generate the targeted fuzzing inputs for all 6 languages, we provide the following script:

./scripts/targeted_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.

This will produce the 3 targeted fuzzing runs (each generating 10,000 fuzzing inputs) with results saved in outputs/targeted

Next we can collect the coverage using similar script in the Coverage Collection section:

./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/ablation/c_goto
... # repeat for other target runs

Furthermore, to generate the hit rate results, you can run the following scripts after all target runs are completed:

python tools/targeted/target_hit.py

The following commands will render/output the tables in Table 3 using both coverage and hit rate collected during our targeted runs:

# go to top level directory
cd /home/Fuzz4All
python tools/targeted/draw_table.py

Ablation Study (Table 4) - Coverage and Number of Valid Programs

To produce generate the ablation results 6 languages, we provide the following script:

./scripts/ablation_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.

This will run the 5 different ablation settings (each generating 10,000 fuzzing inputs) with results saved in outputs/ablation. Repeat for 4 times to rerun the results from the paper.

Next we can collect the coverage as well as validity rate using similar script in the Coverage Collection section:

./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/ablation/c_std_no_input
... # repeat for other ablation runs

The following commands will render/output the tables in Table 4 using both coverage and valid rate collected during our ablation runs

# go to top level directory
cd /home/Fuzz4All
python tools/ablation/draw_table.py

Baseline Comparison

In our paper, we use well-known and tested baselines with no modification:

In particular, GrayC, Hephaestus, and MorphQ we directly use their docker images and run with default settings provided in their artifacts
For CSmith, YarpGen, TypeFuzz, an Go-Fuzz we use their most recent stable version on GitHub.

Due to the size limitation of the docker image, we do not include the source code of these baselines in our docker image however we provide the full fuzzing inputs generated by these baselines available to download here: https://uillinoisedu-my.sharepoint.com/:u:/g/personal/chunqiu2_illinois_edu/EYBgOE7eUk5Dm8K4TkZuAbkBTM-MRIdEkXOMFBvZPP4JZA?e=ln8orG

Evidence of bug finding

In RQ4, we claimed Fuzz4All has detected 98 bugs in total with 64 confirmed by developers as previously unknown.

note this number is different from the initial submission, see the new version of the paper for most recent number

We provide a list bugs and bug reports in bugs/ folder. We also provide the bugs in the top-level of this artifact.