There is a newer version of this record available.

Dataset Open Access

DeepDataFlow

Chris Cummins

This dataset contains 493k LLVM-IRs taken from a wide range of projects and source programming languages, and includes labels for several compiler data analyses. We also include the logs for the machine learning jobs which produced our published experimental results.

The uncompressed dataset uses the following layout:

  • labels/
    • Directory containing machine learning features and labels for programs for compiler data flow analyses.
    • labels/<analysis>/<source>.<id>.<lang>.ProgramFeaturesList.pb
      • ProgramFeaturesList protocol buffer containing a list of features resulting from running a data flow analysis on a program.
  • graphs/
    • Directory containing ProGraML representations of LLVM IRs.
    • graphs/<source>.<id>.<lang>.ProgramGraph.pb
      • ProgramGraph protocol buffer of an LLVM IR in the ProGraML representation.
  • ll/
    • Directory containing LLVM-IR files.
    • ir/<source>.<id>.<lang>.ll
      • An LLVM IR in text format, as produced by clang -emit-llvm -S or equivalent.
  • test/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the test set.
  • train/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the training set.
  • val/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the validation set.
  • vocal/
    • Directory containing vocabulary files.
    • vocab/<type>.csv
      • A vocabulary file, which lists unique node texts, their frequency in the dataset, and the cumulative proportion of total unique node texts that is covered.

For further information please see our ProGraML repository.

Files (8.8 GB)
Name Size
classifyapp_2020.05.06.tar.bz2
md5:8398df80abc564cf74143ef4740ec833
1.9 GB Download
dataflow_logs_20.06.01.tar.bz2
md5:10ad56f31bafa85a96d896f4ea0b387f
265.6 MB Download
devmap_2020.06.27.tar.bz2
md5:4491105b61eb534ce42f7d342e88af27
10.6 MB Download
graphs_20.06.01.tar.bz2
md5:5812e41db6f11720454003762e7a8b0b
3.8 GB Download
labels_datadep_20.06.01.tar.bz2
md5:ec91e691882eb658be138fd0fbed1b26
69.8 MB Download
labels_domtree_20.06.01.tar.bz2
md5:d515819b6041b27eb9f592d46761639f
69.0 MB Download
labels_liveness_20.06.01.tar.bz2
md5:c3879f4c3fa1d339a3aad7cd9d4c2188
124.8 MB Download
labels_reachability_20.06.01.tar.bz2
md5:96bbc6a8d44fe6b17a8f7f76ea40148e
84.1 MB Download
labels_subexpressions_20.06.01.tar.bz2
md5:128f0e67fb9bd2b72ede055ab236c49e
71.4 MB Download
llvm_bc_20.06.01.tar.bz2
md5:76815e3344101a504b224f10175b7dfa
1.3 GB Download
llvm_ir_20.06.01.tar.bz2
md5:a9303e635f60b521119c2801972b6781
1.1 GB Download
  • Cummins, C., Fisches, Z. V., Ben-Nun, T., Hoefler, T., & Leather, H. (2020). ProGraML: Graph-based Deep Learning for Program Optimization and Analysis. arXiv preprint arXiv:2003.10536.

83
693
views
downloads
All versions This version
Views 8377
Downloads 693196
Data volume 571.9 GB252.6 GB
Unique views 6964
Unique downloads 277168

Share

Cite as