This is the artifact for the paper "Identifying Java Calls in Native
Code via Binary Scanning" (ISSTA 2020).

The aim of this artifact is to reproduce the evaluation results
(Figures 16 and 17 in the paper).

The artifact is packaged as a VirtualBox image running a lightweight
version of Ubuntu 18.04. The virtual machine needs four CPU cores, 20
GB RAM, and up to 31 GB disk space (maximum space of the dynamically
allocated hard disk image). Some benchmarks may need less memory, see
section "Troubleshooting" for instructions on running a subset of the
benchmarks. The artifact has been tested on an AMD Ryzen 7 3700U
laptop with 32 GB RAM, running Ubuntu 18.04 and VirtualBox 5.2.34.


== Installation ==

Start VirtualBox and import file native-artifact.ova via menu File ->
Import Appliance. Start the virtual machine and the desktop will
appear, logged in as "user" (password: "user").


== Runnning the Benchmarks ==

To run the benchmarks, open a terminal (LXTerminal) and run the
following commands:

    cd ~/artifact/doop
    bin/bench-native-scanner.sh analyze

When the "analyze" step above has finished, run the following command
to generate the results table:

    bin/bench-native-scanner.sh report

This command should print a table containing the data of Figures 16
and 17 in the paper.

The four XCorpus benchmarks are available in ~/artifact/xcorpus (Java
code) and ~/artifact/xcorpus-native-extension (native code companion
repo). The native methods that are the ground truth of the benchmarks
are found in the feature analysis output of XCorpus:

  ~/artifact/xcorpus/misc/featureanalysis/output/analysis-results-details.csv

The two Android benchmarks are available in
~/artifact/doop-benchmarks/android-benchmarks and are replicated from
the HeapDL artifact benchmarks (https://bitbucket.org/yanniss/doop-benchmarks).


== Examining/Reusing Analysis Results ==

For each analyzed benchmark, analysis data can be found in directory
~/artifact/doop/out/context-insensitive/<ID>, where <ID> indicates the
analysis run. Each directory contains (a) the output "database"
relations (as tab-separated .csv files) and (b) the Datalog analysis
logic that was compiled (.dl files). To determine which analysis <ID>
to examine:

(a) For the Android benchmarks, three analyses run: a baseline
analysis, a HeapDL-based analysis (to compute base recall), and the
"scanner" analysis of the paper.

(b) For the XCorpus benchmarks, two analyses run: a baseline analysis
and the "scanner" analysis of the paper.

All Doop analysis logic can be found under
~/artifact/doop/souffle-logic. Changing logic rules will force
recompilation of the analysis logic when the analyses run again.

For each benchmark, file ~/artifact/doop/extra-entry-points-<ID>.log
contains the extra entry points computed.


== Troubleshooting ==

Q. Are some benchmarks easier to run?

A. The two Android benchmarks (Chrome/Instagram) are quite heavy while
log4j from the XCorpus benchmarks is the smallest benchmark. Example
RAM usage of each benchmark follows:

--------------------------
| Benchmark     | Memory |
--------------------------
| aspectj-1.6.9 |  5.6GB |
| log4j-1.2.16  |  3.0GB |
| lucene-4.3.0  |  4.7GB |
| tomcat-7.0.2  |  3.7GB |
| chrome        | 11.0GB |
| instagram     | 13.0GB |
--------------------------

Q. Can I run a subset of the benchmarks?

A. Benchmarks can be disabled by editing the commands at the bottom of
script bin/bench-native-scanner.sh. For example, to disable the Chrome
benchmark, comment out (with "#") its "runDoop" line; to disable
AspectJ, comment out "analyzeAspectJ", etc.

Q. What are some indicative analysis times on laptop hardware?

A. Below you can find the times measured on the test machine, running
the benchmark script once, in VirtualBox:

-------------------------------------------------------------
| Benchmark     | Analysis time (sec) | Factgen time (sec)  |
|               | (base, scanner)     | (base, scanner)     |
-------------------------------------------------------------
| chrome        | 533, 587            | 172, 406            |
-------------------------------------------------------------
| instagram     | 622, 769            | 182, 190            |
-------------------------------------------------------------
| aspectj-1.6.9 | 269, 267            | 151, 145            |
-------------------------------------------------------------
| log4j-1.2.16  | 72, 73              | 95, 82              |
-------------------------------------------------------------
| lucene-4.3.0  | 134, 358            | 99, 112             |
-------------------------------------------------------------
| tomcat-7.0.2  | 90, 330             | 115, 123            |
-------------------------------------------------------------

Observed analysis times can often vary between analysis runs without
changing metrics such as recall, app-reachable, or the number of added
entry points. Note that VirtualBox adds roughly 2x overhead compared
to native execution.

Q. Entry points for tomcat differ slightly from those published in the
submitted paper (e.g. 319 vs. 308).

A. This is due to recent tweaks in the analysis logic. This difference
does not significantly alter the main metric ("app-reachable"),
compared to the number in the submitted paper. We will update the paper.

Q. The Android benchmarks use the simpler scanning method, not
Radare2, why?

A. Compared to the simpler scanning method (Section 3, Figures 6 and 7
in the paper), using Radare2 on the two Android benchmarks offers no
improvement for the "app-reachable"/recall metrics that are the focus
of the evaluation section. However, Radare2 uses more memory and is
thus disabled to make it easier to run these two analyses inside
VirtualBox. To enable Radare2 for these two benchmarks, add flag
"--use-radare" in step "3." of function "runDoop()" in script
bench-native-scanner.sh.