This is the artifact for the paper "Identifying Java Calls in Native Code via Binary Scanning" (ISSTA 2020). The aim of this artifact is to reproduce the evaluation results (Figures 16 and 17 in the paper). The artifact is packaged as a VirtualBox image running a lightweight version of Ubuntu 18.04. The virtual machine needs four CPU cores, 20 GB RAM, and up to 31 GB disk space (maximum space of the dynamically allocated hard disk image). Some benchmarks may need less memory, see section "Troubleshooting" for instructions on running a subset of the benchmarks. The artifact has been tested on an AMD Ryzen 7 3700U laptop with 32 GB RAM, running Ubuntu 18.04 and VirtualBox 5.2.34. == Installation == Start VirtualBox and import file native-artifact.ova via menu File -> Import Appliance. Start the virtual machine and the desktop will appear, logged in as "user" (password: "user"). == Runnning the Benchmarks == To run the benchmarks, open a terminal (LXTerminal) and run the following commands: cd ~/artifact/doop bin/bench-native-scanner.sh analyze When the "analyze" step above has finished, run the following command to generate the results table: bin/bench-native-scanner.sh report This command should print a table containing the data of Figures 16 and 17 in the paper. The four XCorpus benchmarks are available in ~/artifact/xcorpus (Java code) and ~/artifact/xcorpus-native-extension (native code companion repo). The native methods that are the ground truth of the benchmarks are found in the feature analysis output of XCorpus: ~/artifact/xcorpus/misc/featureanalysis/output/analysis-results-details.csv The two Android benchmarks are available in ~/artifact/doop-benchmarks/android-benchmarks and are replicated from the HeapDL artifact benchmarks (https://bitbucket.org/yanniss/doop-benchmarks). == Examining/Reusing Analysis Results == For each analyzed benchmark, analysis data can be found in directory ~/artifact/doop/out/context-insensitive/, where indicates the analysis run. Each directory contains (a) the output "database" relations (as tab-separated .csv files) and (b) the Datalog analysis logic that was compiled (.dl files). To determine which analysis to examine: (a) For the Android benchmarks, three analyses run: a baseline analysis, a HeapDL-based analysis (to compute base recall), and the "scanner" analysis of the paper. (b) For the XCorpus benchmarks, two analyses run: a baseline analysis and the "scanner" analysis of the paper. All Doop analysis logic can be found under ~/artifact/doop/souffle-logic. Changing logic rules will force recompilation of the analysis logic when the analyses run again. For each benchmark, file ~/artifact/doop/extra-entry-points-.log contains the extra entry points computed. == Troubleshooting == Q. Are some benchmarks easier to run? A. The two Android benchmarks (Chrome/Instagram) are quite heavy while log4j from the XCorpus benchmarks is the smallest benchmark. Example RAM usage of each benchmark follows: -------------------------- | Benchmark | Memory | -------------------------- | aspectj-1.6.9 | 5.6GB | | log4j-1.2.16 | 3.0GB | | lucene-4.3.0 | 4.7GB | | tomcat-7.0.2 | 3.7GB | | chrome | 11.0GB | | instagram | 13.0GB | -------------------------- Q. Can I run a subset of the benchmarks? A. Benchmarks can be disabled by editing the commands at the bottom of script bin/bench-native-scanner.sh. For example, to disable the Chrome benchmark, comment out (with "#") its "runDoop" line; to disable AspectJ, comment out "analyzeAspectJ", etc. Q. What are some indicative analysis times on laptop hardware? A. Below you can find the times measured on the test machine, running the benchmark script once, in VirtualBox: ------------------------------------------------------------- | Benchmark | Analysis time (sec) | Factgen time (sec) | | | (base, scanner) | (base, scanner) | ------------------------------------------------------------- | chrome | 533, 587 | 172, 406 | ------------------------------------------------------------- | instagram | 622, 769 | 182, 190 | ------------------------------------------------------------- | aspectj-1.6.9 | 269, 267 | 151, 145 | ------------------------------------------------------------- | log4j-1.2.16 | 72, 73 | 95, 82 | ------------------------------------------------------------- | lucene-4.3.0 | 134, 358 | 99, 112 | ------------------------------------------------------------- | tomcat-7.0.2 | 90, 330 | 115, 123 | ------------------------------------------------------------- Observed analysis times can often vary between analysis runs without changing metrics such as recall, app-reachable, or the number of added entry points. Note that VirtualBox adds roughly 2x overhead compared to native execution. Q. Entry points for tomcat differ slightly from those published in the submitted paper (e.g. 319 vs. 308). A. This is due to recent tweaks in the analysis logic. This difference does not significantly alter the main metric ("app-reachable"), compared to the number in the submitted paper. We will update the paper. Q. The Android benchmarks use the simpler scanning method, not Radare2, why? A. Compared to the simpler scanning method (Section 3, Figures 6 and 7 in the paper), using Radare2 on the two Android benchmarks offers no improvement for the "app-reachable"/recall metrics that are the focus of the evaluation section. However, Radare2 uses more memory and is thus disabled to make it easier to run these two analyses inside VirtualBox. To enable Radare2 for these two benchmarks, add flag "--use-radare" in step "3." of function "runDoop()" in script bench-native-scanner.sh.