Published June 16, 2020 | Version 1.0.1
Dataset Open

NJR-1 Dataset

Description

NJR is a Normalized Java Resource. The NJR-1 dataset consists of 293 Java programs that can be used with several analysis tools.

 

TOOLS THAT RUN ON NJR-1

Each program runs successfully with the following 14 Java static analysis tools:

  1. SpotBugs (https://spotbugs.github.io)
  2. Wala (https://wala.github.io)
  3. Doop (https://bitbucket.org/yanniss/doop)
  4. Soot (https://github.com/soot-oss/soot)
  5. Petablox (https://github.com/petablox/petablox)
  6. Infer (https://fbinfer.com)
  7. Error-Prone (http://errorprone.info)
  8. Checker-Framework (https://checkerframework.org)
  9. Opium (Opal-framework) (https://www.opal-project.de)
  10. Spoon (https://spoon.gforge.inria.fr)
  11. PMD (https://pmd.github.io)
  12. CheckStyle (https://checkstyle.org)
  13. JavaParser (https://javaparser.org/)
  14. Codeguru* (https://aws.amazon.com/codeguru)

In addition to these static analysis tools, the NJR dataset has also been tested with 9 other tools that operate on Java bytecode.

  1. Jacoco (https://www.jacoco.org): Dynamic analysis tool
  2. Wiretap (https://github.com/ucla-pls/wiretap): Dynamic analysis tool
  3. JReduce (https://github.com/ucla-pls/jreduce): Bytecode reduction tool
  4. QueryMax (https://doi.org/10.5281/zenodo.5551128): Preprocessor for application code analysis
  5. Call-Graph Pruner (https://doi.org/10.5281/zenodo.5177161): Static call-graph pruning tool
  6. FootPatch (https://github.com/squaresLab/footpatch): Automated Repair Tool
  7. Procyon (https://github.com/ststeiger/procyon): Decompiler
  8. CFR (https://www.benf.org/other/cfr/): Decompiler
  9. Fernflower (https://github.com/fesh0r/fernflower): Decompiler

 

BENCHMARK PROGRAMS

The NJR programs are repositories picked from a set of Java-8 projects on Github that compile and run successfully. Each of these programs comes with a jar file, the compiled bytecode files, compiled library files, and the Java source code. The availability of the files in both jar-file form, as well as source code form (with the compiled library classes) is a major reason the dataset works with so many tools, without requiring any extra effort. These features make NJR-1 a great benchmark for any kind of Java static analysis, dynamic analysis, or tool building.

Internally, each benchmark program has the following structure:

  • src: directory with source files.
  • classes: directory with class files.
  • lib: compiled third party library classes (source files not available, since libraries are distributed as class-files).
  • jarfile: jar file containing the compiled application classes and third-party library classes.
  • info: directory with information about the program. It includes the following files.
    • classes: list of application classes (excludes third-party library classes).
    • mainclasses: list of main classes that can be run.
    • sources: list of source file names.
    • declarations: list of method declarations categorized by source file name.

The benchmarks already come with a compiled JAR file, but some tools need to compile and run the benchmarks. The following simple commands can be used for compilation and running (replace <jarfilename> with the file in the jarfile directory. replace <mainclassname> with any of the classes from info/mainclasses):

javac -d compiled_classes -cp lib @info/sources

java -cp jarfile/<jarfilename> <mainclassname>

 

FILES AVAILABLE FOR DOWNLOAD

There are 4 files available for download: njr-1_dataset.zip, scripts.zip, benchmark_stats.zip, and a Readme.

njr-1_dataset.zip has the actual dataset programs. scripts.zip contains Python3 scripts for each tool, to run it on the entire dataset. The Readme details the version number, download link and setup instructions for each tool. The benchmark_stats.zip file lists some statistics for the benchmark programs.

 

STATISTICS

Here are some summary statistics about the benchmark programs:

  • The mean number of application classes: 97
  • Each program executes at least 100 unique application methods at runtime.
  • The mean lines of application source code: 9911
  • The mean number of 3rd party library classes: 2608
  • The mean (estimated) lines of 3rd party library source code: 250,000
  • The mean number of static edges in the application call graph: 1404
  • The mean number of dynamic edges in the application call graph: 469 

 

NOTES

Note: Zenodo shows some changes for this repository. However, all the changes involve updating the scripts folder, as more tools get tested on the dataset. The programs in the dataset themselves remain unchanged.

*Note 2: Codeguru Reviewer is a paid, proprietary tool by Amazon. Our experiments show that it runs successfully on all the benchmarks in this dataset. However, we don't include any scripts to replicate this run because of its paid nature.

To cite this dataset, please cite the following paper:
Jens Palsberg and Cristina V. Lopes, NJR: a Normalized Java Resource. 
In Proceedings of ACM SIGPLAN International Workshop on State Of the Art in Program Analysis (SOAP), 2018.

Notes

Funded by the following NSF grant (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1823360&HistoricalAwards=false)

Files

benchmark_stats.zip

Files (2.6 GB)

Name Size Download all
md5:5bb32cc08289615e7727359b52ab1bf7
64.7 kB Preview Download
md5:683b6ec448574e32bab268b7f026b7c4
2.6 GB Preview Download
md5:7e54768c13d88aa6712f68bf0c3a5efe
14.7 kB Preview Download
md5:97f435f5651a982e3eaf858065268c13
641.3 kB Preview Download

Additional details

Related works

Is supplement to
Conference paper: 10.1145/3236454.323650110.1145/3236454.3236501 (DOI)

Funding

U.S. National Science Foundation
CRI: CI-New: Collaborative Research: NJR: A Normalized Java Resource 1823360