OOPSLA Artifact — Precise Static Modeling of Ethereum ``Memory''

These instructions are intended for using the artifact for our accepted OOPSLA'20 paper entitled "Precise Static Modeling of Ethereum ``Memory''.

Part 1: Getting Started

Prerequisites

  1. Our artifact is bundled for AMD64 Linux using an Ubuntu 18.04 Docker image. If you do not already have Docker installed, follow the official installation instructions.

  2. Decompress our Docker image:

  bunzip2 eth-memory-modeling.tar.bz2
  1. Now load the image into Docker as follows:
  docker load -i eth-memory-modeling.tar
  1. Run the Docker image. This will give you a bash shell in the working directory /home/reviewer/gigahorse-toolchain. Except where specified, the remainder of the commands in this README should be run in this Docker shell with this path as the working directory.
  docker run -it eth-memory-modeling

Running our tool in the Docker Environment

This README is also available in markdown form at ~/README.md.

To facilitate testing, a random subset of 2K unique smart contract bytecodes is bundled with the Docker image in the directory ~/contracts_2k_randoms.
Each file in ~/contracts_2k_random contains the ASCII hexadecimal representation of one EVM bytecode program (i.e., smart contract).

The examples in this README operate on this bundled subset of contracts. They may be used to approximate the results of our full experiments.

Alternatively, all 70,063 contracts of our dataset are contained in ~/contracts-all.tar.bz2.

Running the following commands will place our full evaluation set in ~/contracts_all:

  cd ~
  bunzip2 ~/contracts-all.tar.bz2
  tar -xvf contracts-all.tar

Running our analysis and clients on a single contract

Running our analysis consists of three stages:

  1. Generating decompilation input facts
  2. Decompilation with Gigahorse, producing input facts for the clients
  3. Running the memory-modeling-based Gigahorse clients

In order to run our memory modeling analysis and its clients on a single contract a helper script is provided under ~/gigahorse-toolchain/runSingle.sh.

It should be noted that the commands in the ~/gigahorse-toolchain/runSingle.sh script invokes the interpreted mode of souffle, which runs significantly slower than the compiled mode which the bulk analysis (used when evaluating our analysis) uses.

Running ~/runSingle.sh on a sample contract (ex: ~/contracts_2k_random/0440797b18a56d76a93f4d4059765f38.hex) will result in the execution of the following commands:

1  cd ~/gigahorse-toolchain/
2  ./generatefacts ~/contracts_2k_random/0440797b18a56d76a93f4d4059765f38.hex decomp-facts
3  \rm -rf decomp-out
4  mkdir decomp-out && cd decomp-out
5  souffle -F ../decomp-facts ../logic/decompiler.dl
6  souffle ../../clients/function_inliner.dl
7  souffle ../../clients/ethainter-new.dl
8  souffle ../../clients/repeatedcalls.dl
9  souffle ../../clients/eip1884impact.dl

The command at line 2 translates the input contract bytecode into a set of Souffle fact files in the directory ./decomp-facts.

The command at line 5 runs the Gigahorse decompiler, loading input fact files from ./decomp-facts. It produces a set of output relations as CSV files in ./decomp-out.

The next command runs the function inliner, transforming the analysis facts in a non reversible manner by performing 2x function inlining.

The last 3 commands run the 3 memory modeling clients described in section 5 of our paper.

Each client produces output relations to disk, as CSV files:

  • ethainter-new.dl (the first relation is the new ethainter client described in section 5.2, while the rest are preexisting ethainter client analyses):
    • Vulnerability_TaintedERC20Transfer.csv
    • Vulnerability_AccessibleSelfdestruct.csv
    • Vulnerability_TaintedSelfdestruct.csv
    • Vulnerability_UncheckedTaintedStaticcall.csv
    • TaintedOwnerVariable.csv
  • repeatedcalls.dl:
    • RepeatedCalls.csv
  • eip1884impact.dl:
    • FallbackWillFail.csv

An entry in one of these relations corresponds to one flagged occurrence of that vulnerability/bad smell in the input contract.

Batch Analysis

To perform a batch decompilation and analysis of all 2K example contracts, Gigahorse provides a convenience script called bulk_analyze.py, which can be run as follows:

$ python3.8 bulk_analyze.py -j 8 -d ~/contracts_2k_random -C ../clients/function_inliner.dl,../clients/memory_modeling.dl

Here, we have selected to run 8 parallel analysis processes to process all bytecode files in ~/contracts. After decompilation, function inling will be performed, followed by the core memory modeling analysis, for each contract.

The analysis script will print progress to stdout, and print an aggregate summary of results upon completion, with more detailed per-contract results written to the file results.json.

An example of the summary of the results mentioned earlier, indicating successful execution of the previous command is the following:

...
1994: b2509f5c1ed4f9bac31ae80cf01e6e07.hex completed in 0.07 + 0.95 + 0.12 secs
1998: 55138f5c77517e4ab9a8385e1a0ad1a9.hex completed in 0.03 + 0.43 + 0.06 secs
1999: da5836e2c0dda0dcb3162529646b95be.hex completed in 0.02 + 0.45 + 0.06 secs
ad2e0e312a3c9b11d3ac7ffdf8cba162.hex timed out.
6eb5c177c74262b721685f6893eff290.hex timed out.
7faccca439d5e5f3d8cf6bbfa14c4fb3.hex timed out.
aa155a2dc6961f896023bc27be867a96.hex timed out.

Finishing...                

2000 of 2000 contracts flagged.

  ActualReturnArgs: 71.30%       
  AllCALLsClassified: 98.20%

  ...

  TAC_Var: 99.10%
  TAC_Variable_BlockValue: 99.10%
  TAC_Variable_Value: 99.10%
  TIMEOUT: 0.90%
  Verbatim_AllVsModeledMLOADs: 99.10%
  Verbatim_AllVsModeledMSTOREs: 99.10%
  Verbatim_CDLAllVSStaticVSArr: 99.10%
  Verbatim_MemConsStmtsLengths: 99.10%
  assertVarIsArray: 2.20%
  bytecode: 99.10%
  inliner: 832.55%
  inliner2: 485.75%
  preTrans: 0.25%

The percentages next to the relation names indicate the percentage of contracts that the rule produced some results for (output CSV files were non-empty).

Part 2: Step-by-Step Reproducing the Paper Results

We ran the artifact on two different setups:

  1. A server with two Intel Xeon Gold 6136 3.00GHz CPUs and 640GB of RAM. We used this setup to run the analyses on our whole evaluation set under ~/contracts_all using 24 concurrent processes.
  2. A laptop with an Intel Core i7-3612QM 2.10GHz CPU and 16GB of RAM. We consider 16GB of RAM a requirement to run this artifact. We used this setup to run the analyses on the 2,000 randomly selected contracts under under ~/contracts_2k_random using 8 concurrent processes.

We perform bulk analysis in two stages in order to seperate the fact generation performed using the gigahorse decompiler, from the execution of our memory-modeling-based client analyses, and inspect their running times and timeouts individually.

Performing decompilation and all client analyses:

Setup 1:
  python3.8 bulk_analyze.py -j 24 -d ~/contracts_all -C ../clients/function_inliner.dl --restart
  python3.8 bulk_analyze.py -j 24 -d ~/contracts_all -C ../clients/ethainter-new.dl,../clients/eip1884impact.dl,../clients/repeatedcalls.dl --rerun_clients
Setup 2:
  python3.8 bulk_analyze.py -j 8 -d ~/contracts_2k_random -C ../clients/function_inliner.dl --restart
  python3.8 bulk_analyze.py -j 8 -d ~/contracts_2k_random -C ../clients/ethainter-new.dl,../clients/eip1884impact.dl,../clients/repeatedcalls.dl --rerun_clients

After running the two stages of the bulk analysis, file results.json will have detailed per-contract results.

The following table contains the running times of our two setups for the two stages of the bulk_analysis.

Setup 1 Setup 2
decompilation 109 mins 13.5 mins
client analyses 20 mins 4.5 mins

Executing the commands above and inspecting the running times supports the claims made in the Analysis Scalability paragraph in Section 6.1.

Inspecting the client analyses results

After performing the bulk analysis of the memory-modeling-based clients, their results can be inspected at the summary printed by bulk_analyze.py at the end of its execution. The claims regarding the number or percentage of contracts that report at least one result for the respective analyses can be found at:

  • Taint Analysis (relation Vulnerability_TaintedERC20Transfer): Section 6.2.1 (lines 922-923)
  • Gas of Fallback Functions (relation FallbackWillFail): Section 6.3 (lines 953-954) Note that in the originally submitted version of the paper the number of smart contracts reported refers to smart contract instances and not unique bytecodes like the statistics provided at the rest of the paper. We will fix this inconsistency in the revised version of the paper. In terms of unique bytecodes the expected number is 195 (0.27%) when analysing our complete dataset, and 3 (0.15%) when the subset of 2k contracts are analyzed.
  • Repeated Calls (relation RepeatedCalls): Section 6.4 (line 995)

Keep in mind that, when running on the provided subset of contracts, these numbers will be approximates of what is reported in the paper.

Quantitative Evaluation of Memory Modeling

A helper script ~/print-metrics.py is provided, parsing the JSON output file of bulk_analyze.py and printing the metrics presented in Section 6.1, populating table 1 and figures 4 and 5. It can be invoked as:

python3.8 ~/print-metrics.py results.json

Client: Taint Analysis

Tainted ERC20 Token transfer

Apart from the statistics about the results of the Vulnerability_TaintedERC20Transfer relation we mentioned earlier, we also provide the contract sources and bytecodes of the manually inspected contracts (figure 6) are available under ~/manual-inspection/TaintedERC20Transfer.

Effects on pre-existing Ethainter Clients

In order to compare different implementations of ethainter on the same set of contracts, script ~/infoflow-comparison.py accepts the JSON outputs of the two implementations, and prints their differences regarding the supported ethainter vulnerabilities (contracts flagged by one and not the other).

First, the old ethainter client (without our memory modeling) needs to be used to analyse our evaluation set.

Performing bulk analysis using the old ethainter client (found at ~/clients/ethainter-old-inlined.dl):

Setup 1:
  python3.8 bulk_analyze.py -j 24 -d ~/contracts_all -C ../clients/ethainter-old-inlined.dl --rerun_clients -r ethainter-old.json
Setup 2:
  python3.8 bulk_analyze.py -j 8 -d ~/contracts_2k_random -C ../clients/ethainter-old-inlined.dl --rerun_clients -r ethainter-old.json

On startup of the bulk analysis script a lot of warnings will be reported by souffle. They do not affect the results of the analysis and can be ignored.

Execution of ~/infoflow-comparison.py:

  python3.8 ~/infoflow-comparison.py results.json ethainter-old.json

Gas of Fallback Functions: FallbackRetracer

The results of the FallbackRetracer tool cannot be reproduced because they require access to an archival Ethereum node which takes up over 4.5 TBs of SSD storage and has a sync time of around a month.

Repeated Calls

The contract sources and bytecodes of the manually inspected contracts (figure 8) are available under ~/manual-inspection/RepeatedCallsOurs and ~/manual-inspection/RepeatedCallsSecurify.

Comparison with Securify

As this experiment relies on manual inspection, we run Securify on sources to get source mappings. However, this complicates the workflow significantly:

  • Source files may contain multiple contracts; one needs to find which contract matches the bytecode deployed at the given address.
  • Getting the exact compiler version the contract was originally compiled with is really important. Different compiler versions (even different commits under the same version) can produce significantly different bytecode, thus affecting the results of the analysis. We encountered this disparity while conducting the experiment.

This critical metadata was acquired manually by the human agents carrying out the experiment, who installed the correct Solidity compiler version and configured Securify to use it for its analysis, on a per-contract basis. The results were also filtered, to only include warnings for code that is in the "main" contract or any super-contract.

Due to these conditions we will not be able to provide a mechanism to automatically replicate the experiment results.