This is the implementation of our research paper, "Fuzz4All: Universal Fuzzing with Large Language Models", accepted at ICSE 2024.
OS: A Linux System with Docker Support;
Hardware: GPU support. (this is optional but very highly recommended as without GPU support, the speed will be extremely slow)
Before you start, please make sure you have installed Docker (https://docs.docker.com/get-docker/) and nvidia-container-toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). To test if it is successfully installed, run
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Note: The command above and many following scripts will invoke docker commands. If your user is not in the docker group, you may encounter issues running them. To resolve this, you may either add your user to the docker group instruction (preferred), or run the commands with sudo sudo docker <CMD>
.
<details> <summary>Q1: Why docker?</summary>
LLMs can generate arbitrary test programs, some of them that may attempt to change the source code
and even cause damage to the file system. Therefore, we provide a docker environment for running Fuzz4All
.
We highly recommend running Fuzz4All
in a sandbox environment like docker. However, you may create a conda environment to run locally.
<details> <summary>Below are instructions to build our conda environment:</summary>
Please run the following commands line-by-line ONLY IF YOU ARE NOT RUNNING IN DOCKER (you may need to press y
):
# Create an environment named `fuzz4all`
conda create -n fuzz4all python=3.10
# Activate
conda activate fuzz4all
# Install required packages
pip install -r requirements.txt
pip install -e .
</details> </details>
<details> <summary>Q2: Why GPU? </summary>
We highly recommend running LLMs on GPUs for more efficient fuzzing. In our experiments, we use a 64-core workstation with 256 GB RAM and running Ubuntu 20.04.5 LTS with NVIDIA RTX A6000 GPU
If you don't have GPUs, don't worry, you can still run Fuzz4All
on CPU (although it will be much slower to generate tests).
</details>
we have constructed a docker image which contains:
Fuzz4All
, full inputs and scripts to reproduce results from the paper./get_clang.sh
in /home/
)Note we also did not include the full fuzzing inputs generated by
Fuzz4All
and baseline tools in the docker image due to size limitations. However, the complete fuzzing inputs can be obtained in the complete artifact, furthermore we include the intermediate results (e.g., coverage) and scripts to reproduce the complete results in the paper (see 3. Reproduce results in paper for more detail)
docker load < Fuzz4All_Docker_Image.tar.gz # this may take a while.
docker run -it --rm --runtime=nvidia --gpus all fuzz4all/fuzz4all:v3 bash
Inside the docker image
cd /home/Fuzz4All
conda activate fuzz4all # load the conda environment
if you don't have a GPU, you can substitute docker run
with
docker run -it --rm fuzz4all/fuzz4all:v3 bash
# follow the rest of the instructions
For each section, we recommend reading through the complete section before running a command as it could be quite time-consuming and may download large files.
We provide 3 different sections in this artifact:
Fuzz4All
on selected targets and collect various bug, coverage and validity metricsDue to the high cost of running the full experiments, in this section, we provided simple scripts based on intermediate data to quickly reproduce the results in the paper.
conda activate fuzz4all
cd /home/Fuzz4All
rm -r /home/Fuzz4All/fig/coverage-* # remove the old figures
python tools/coverage/plot_full_run_coverage.py
This will produce the 6 figures in Figure 4 (in fig/coverage_{target}.pdf
) using the coverage data collected in our runs
you can download the figures by running the following command in your host machine:
docker cp <containerId>:/home/Fuzz4All/fig .
To generate the tables in Table 2, run the following command:
cd /home/Fuzz4All
python tools/coverage/draw_table.py
This will render/output the table in Table 2 using coverage, validity and number of programs collected during our runs
cd /home/Fuzz4All
python tools/targeted/draw_table.py
This will render/output the tables in Table 3 using both coverage and hit rate collected during our targeted runs
cd /home/Fuzz4All
python tools/ablation/draw_table.py
This will render/output the table in Table 4 using both coverage and valid rate collected during our ablation runs
Now you are ready to run Fuzz4All
. First activate the pre-built conda environment:
conda activate fuzz4all
cd /home/Fuzz4All
Second lets determine what is the best fuzzing parameters for you depending on your GPU hardware. Here are the default parameters:
export FUZZING_BATCH_SIZE=30
export FUZZING_MODEL="bigcode/starcoderbase"
export FUZZING_DEVICE="gpu"
export FUZZING_BATCH_SIZE=5
)export FUZZING_MODEL="bigcode/starcoderbase-1b"
export FUZZING_BATCH_SIZE=5
)export FUZZING_MODEL="bigcode/starcoderbase-1b; export FUZZING_DEVICE="cpu"
)Note currently we only support either bigcode/starcoderbase-1b or bigcode/starcoderbase in terms of model (in none-distributed setup), however Fuzz4All can be modified easily to support other models/architectures.
Before you start, you need to export OpenAI API key to the environment variable OPENAI_API_KEY
,
here we provide a key (which will be removed after end of evaluation period) for you to test:
export OPENAI_API_KEY=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
Now you want to start fuzzing right away? We provide a start script to test model is running smoothly.
./scripts/demo_run.sh
Here we are fuzzing g++
compiler with a pre-defined prompt (without need for autoprompting, we will test that later)
You may be prompted to enter your huggingface credentials, please see this for more detail: https://huggingface.co/docs/hub/security-tokens and follow the error message instructions.
# after accepting their license agreement with your huggingface account on huggingface.co
huggingface-cli login
# paste your token here from https://huggingface.co/settings/tokens
Add token as git credential? (Y/n)
# respond with n
If this is your first time running Fuzz4All
, it will take some time to download the model (~20 minutes)
and then it will start fuzzing. You should see similar outputs to the following:
BATCH_SIZE: 30
MODEL_NAME: bigcode/starcoderbase
DEVICE: gpu
...
=== Target Config ===
language: cpp
folder: outputs/demo/
...
====================
[INFO] Initializing ... this may take a while ...
[INFO] Loading model ...
=== Model Config ===
model_name: bigcode/starcoderbase
...
====================
[INFO] Model Loaded
[INFO] Without any input prompt ...
[INFO] Done
(resuming from 0)
[VERBOSE] /* Please create a very short program which uses new C++ features in a complex way */
#include <iostream>
...
[VERBOSE] /* Please create a very short program which uses new C++ features in a complex way */
#include <iostream>
int main()
{
int x{5};
x *= ++x + x * (13- x %(2+ x / 3));
std::cout << x << std::endl;
return 0;
}
/* Please create a mutated program that modifies the previous generation */
#include <iostream>
Fuzzing • 30% ━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/100 • 0:02:56
This means the model is running smoothly!
In this script we generate ~100 fuzzing inputs to test the g++
compiler
and you can find the generated test programs in outputs/demo/
.
Now you are ready to run Fuzz4All
on all targets (with arbitrary inputs through autoprompting)!
Fuzz4All
is configured easily through config files. The one used for our experiment are store in configs/
.
The config file controls various aspects of Fuzz4All
including the fuzzing language, time, autoprompting strategy, etc.
Please see any example config file in configs/
for more detail.
In general, you can run Fuzz4All
with the following command:
python Fuzz4All/fuzz.py --config {config_file.yaml} main_with_config \
--folder outputs/fuzzing_outputs \
--batch_size {batch_size} \
--model_name {model_name} \
--target {target_name}
where {config_file.yaml}
is the config file you want to use, {batch_size}
is the batch size you want to use,
{model_name}
is the model name you want to use, and {target_name}
is the target binary you want to fuzz.
you may choose to build/download your own binary ({target_name}) for fuzzing, in this docker container we have already provided the pre-built binaries.
For targeted fuzzing (i.e., fuzzing a specific API or library of a language), you can modify the config file to point to the
specific API/library documentation you want the model to generate prompts for. Please see configs/targeted
for examples of such configs.
Due to the number of different configurations, for ease of running,
we provide a script to run the full pipeline of Fuzz4All
(each run will take 24 hours, corresponding to the full run in RQ1):
./scripts/full_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.
Note: please make sure you choose an appropriate batch size and model name for your GPU memory size (see above). We recommend first running with a smaller model (i.e.,
export FUZZING_MODEL="bigcode/starcoderbase-1b
) and a smaller batch size (e.g.,export FUZZING_BATCH_SIZE=5
) to make sure it is not too slow if you are running on a smaller, less powerful GPU.
You should see similar outputs to the following (notice the autoprompting step):
BATCH_SIZE: 30
MODEL_NAME: bigcode/starcoderbase
DEVICE: gpu
...
=== Target Config ===
language: smt2
folder: outputs/full_run/cvc5/
...
====================
[INFO] Initializing ... this may take a while ...
[INFO] Loading model ...
=== Model Config ===
model_name: bigcode/starcoderbase
...
====================
[INFO] Model Loaded
[INFO] Use auto-prompting prompt ...
Generating prompts... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:07:30
[INFO] Done
(resuming from 0)
[VERBOSE] ; SMT2 is an input language commonly used by SMT solvers, with its syntax based on S-expressions. The multi-sorted logic accommodates a simple type system to confirm that terms from contrasting sorts
aren't the equal. Uninterpreted functions can be declared, with the function symbol being an uninterpreted one. SMT2 supports various theories, including integer and real arithmetic, with basic logical
connectives, quantifiers, and attribute annotations. An SMT2 theory includes sort and function symbol declarations and assertions of facts about them. Terms can be checked against these theories to determine their
validity, with successful queries returning "unsat".
; Please create a short program which uses complex SMT2 logic for an SMT solver
(set-logic ALL)
...
(set-logic ALL)
(assert (forall ((n Int)) (=> (> n 0) (= n (* 2 n)))))
(check-sat)
(exit)
; Please create a short program which uses complex SMT2 logic for an SMT solver
(set-logic ALL)
Fuzzing • 0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 30/100000 • 0:02:26
You can also ctrl-c to stop the pipeline and resume it later by running the same command again.
after running the command, you can find the generated fuzzing programs in outputs/full_run/{target}/
.
<details> <summary>Here is the structure of the output directory: </summary>
- outputs/full_run/{target}/
- prompts
- best_prompt.txt: the best prompt found by `Fuzz4All` for the target.
- greedy_prompt.txt
- prompt_0.txt
- prompt_1.txt
- prompt_2.txt
- scores.txt: keep track of the scores of each prompt (used to select the best prompt).
- 0.fuzz
- 1.fuzz
... #
- log.txt
- log_generation.txt
- log_validation.txt
</details>
Most notably, we log both the generation and validation process in log_generation.txt
and log_validation.txt
respectively.
Furthermore, log.txt
provides an overview of the fuzzing process (including any potential bugs found by Fuzz4All
)
Potential bugs will look like this in log.txt
:
[VERBOSE] 2345.fuzz has potential error! # this indicates that file 2345.fuzz may have a potential bug
Note: these steps can be quite time-consuming (for example ~1 hour for gcc/g++) and can increase the size of the docker image by a lot, please run with caution.
To avoid long build time and space, we recommend testing the cvc5
target first which is the smallest and most reasonable to build (~10 minutes).
conda activate fuzz4all # ensure you active the environment as the coverage build requires some python packages
./scripts/build_cvc5_coverage.sh
cd /home/Fuzz4All
python tools/coverage/SMT/collect_coverage.py --folder outputs/demo_coverage_cvc5 --interval 10
you should see the coverage report (either coverage.csv
or coverage.txt
) in outputs/demo_coverage_cvc5
and the output of the python script should be similar to:
Scanning dependencies of target coverage-reset
Resetting code coverage counters to zero.
[0.090s] [info]: Found 735 coverage files (.gcda)
[0.064s] [info]: Removed 735 .gcda files
Built target coverage-reset
18158 4729
18573 4786
Fuzzing • 29% ━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29/101 • 0:00:46
note that this script will also compute the number of valid programs in valid.txt
e.g., for gcc
./scripts/build_gcc_coverage.sh
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder {c_fuzzing_run_folder}
In this section, we provide and demonstrate our scripts/settings to reproduce the results in the paper, please ensure you have run at least the simple script in the previous section.
To produce the main results of the paper, directly use the previous script in the Full Pipeline section:
cd /home/Fuzz4All
conda activate fuzz4all
./scripts/full_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.
This will generate the full 24 hours fuzzing run for one target. Repeat for 5 times to rerun the results from the paper.
Next we can collect the coverage using similar script in the Coverage Collection section:
./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/full_run/gcc
This will generate both the coverage and validity report for each fuzzing run.
Next we can use the intermediate data script to generate the figures.
cd /home/Fuzz4All
python tools/coverage/plot_full_run_coverage.py
This will produce the 6 figures in Figure 4 (in fig/coverage_{target}.pdf
) using the coverage data collected in our runs
# go to top level directory
cd /home/Fuzz4All
python tools/coverage/draw_table.py
This will render/output the table in Table 2 using coverage, validity and number of programs collected during our runs
To produce generate the targeted fuzzing inputs for all 6 languages, we provide the following script:
./scripts/targeted_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.
This will produce the 3 targeted fuzzing runs (each generating 10,000 fuzzing inputs) with results saved in outputs/targeted
Next we can collect the coverage using similar script in the Coverage Collection section:
./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/ablation/c_goto
... # repeat for other target runs
Furthermore, to generate the hit rate results, you can run the following scripts after all target runs are completed:
python tools/targeted/target_hit.py
The following commands will render/output the tables in Table 3 using both coverage and hit rate collected during our targeted runs:
# go to top level directory
cd /home/Fuzz4All
python tools/targeted/draw_table.py
To produce generate the ablation results 6 languages, we provide the following script:
./scripts/ablation_run.sh {target}
# where {target} can be one of gcc, g++, go, javac, cvc5, qiskit.
This will run the 5 different ablation settings (each generating 10,000 fuzzing inputs) with results saved in outputs/ablation
.
Repeat for 4 times to rerun the results from the paper.
Next we can collect the coverage as well as validity rate using similar script in the Coverage Collection section:
./scripts/build_gcc_coverage.sh # follow other build scripts for other targets
cd /home/Fuzz4All
python tools/coverage/C/collect_coverage.py --folder outputs/ablation/c_std_no_input
... # repeat for other ablation runs
The following commands will render/output the tables in Table 4 using both coverage and valid rate collected during our ablation runs
# go to top level directory
cd /home/Fuzz4All
python tools/ablation/draw_table.py
In our paper, we use well-known and tested baselines with no modification:
GrayC
, Hephaestus
, and MorphQ
we directly use their docker images and run with default settings provided in their artifactsCSmith
, YarpGen
, TypeFuzz
, an Go-Fuzz
we use their most recent stable version on GitHub.Due to the size limitation of the docker image, we do not include the source code of these baselines in our docker image however we provide the full fuzzing inputs generated by these baselines available to download here: https://uillinoisedu-my.sharepoint.com/:u:/g/personal/chunqiu2_illinois_edu/EYBgOE7eUk5Dm8K4TkZuAbkBTM-MRIdEkXOMFBvZPP4JZA?e=ln8orG
In RQ4, we claimed Fuzz4All
has detected 98 bugs in total with 64 confirmed by developers as previously unknown.
note this number is different from the initial submission, see the new version of the paper for most recent number
We provide a list bugs and bug reports in bugs/
folder. We also provide the bugs in the top-level of this artifact.