There is a newer version of the record available.

Published June 1, 2020 | Version 2020.06.01
Dataset Open

DeepDataFlow

Creators

  • 1. University of Edinburgh

Description

This dataset contains 493k LLVM-IRs taken from a wide range of projects and source programming languages, and includes labels for several compiler data analyses. We also include the logs for the machine learning jobs which produced our published experimental results.

The uncompressed dataset uses the following layout:

  • labels/
    • Directory containing machine learning features and labels for programs for compiler data flow analyses.
    • labels/<analysis>/<source>.<id>.<lang>.ProgramFeaturesList.pb
      • ProgramFeaturesList protocol buffer containing a list of features resulting from running a data flow analysis on a program.
  • graphs/
    • Directory containing ProGraML representations of LLVM IRs.
    • graphs/<source>.<id>.<lang>.ProgramGraph.pb
      • ProgramGraph protocol buffer of an LLVM IR in the ProGraML representation.
  • ll/
    • Directory containing LLVM-IR files.
    • ir/<source>.<id>.<lang>.ll
      • An LLVM IR in text format, as produced by clang -emit-llvm -S or equivalent.
  • test/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the test set.
  • train/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the training set.
  • val/
    • A directory containing symlinks to graphs in the graphs/ directory, indicating which graphs should be used as part of the validation set.
  • vocal/
    • Directory containing vocabulary files.
    • vocab/<type>.csv
      • A vocabulary file, which lists unique node texts, their frequency in the dataset, and the cumulative proportion of total unique node texts that is covered.

For further information please see our ProGraML repository.

Files

Files (8.8 GB)

Name Size Download all
md5:8398df80abc564cf74143ef4740ec833
1.9 GB Download
md5:10ad56f31bafa85a96d896f4ea0b387f
265.6 MB Download
md5:4491105b61eb534ce42f7d342e88af27
10.6 MB Download
md5:5812e41db6f11720454003762e7a8b0b
3.8 GB Download
md5:ec91e691882eb658be138fd0fbed1b26
69.8 MB Download
md5:d515819b6041b27eb9f592d46761639f
69.0 MB Download
md5:c3879f4c3fa1d339a3aad7cd9d4c2188
124.8 MB Download
md5:96bbc6a8d44fe6b17a8f7f76ea40148e
84.1 MB Download
md5:128f0e67fb9bd2b72ede055ab236c49e
71.4 MB Download
md5:76815e3344101a504b224f10175b7dfa
1.3 GB Download
md5:a9303e635f60b521119c2801972b6781
1.1 GB Download

Additional details

Related works

Is cited by
Preprint: arXiv:2003.10536 (arXiv)

References

  • Cummins, C., Fisches, Z. V., Ben-Nun, T., Hoefler, T., & Leather, H. (2020). ProGraML: Graph-based Deep Learning for Program Optimization and Analysis. arXiv preprint arXiv:2003.10536.