Published October 4, 2019 | Version v2
Dataset Open

Code and Build Artifacts Dataset

  • 1. Two Six Labs

Description

The code and build artifacts are a compilation of source code projects and their related build outputs. The build process, which consisted of running the make command, successfully ran on 3,049 different GitHub projects. Over 30,000 build outputs were produced from C and C++ projects. The build outputs are the results of running a particular project's make command. These derivatives can include executables, .o files, .so files, .a files, or other project-specific build artifacts. The output was accepted as long as the make command completed without error; thus, there is no guarantee that every project will contain every type of artifact. These data provide an association between source code and the build artifacts of that code. The data directory contains one directory for each project downloaded from GitHub. These project directories are named with the GraphQL ID from GitHub's GraphQL API. In each of these GraphQL-ID labeled directories, there is a license.txt, a url.txt, source directory, and a derivatives directory. The license.txt contains the license for the original project, the url.txt contains a link to the original project on GitHub, the source directory contains the original code, and the derivatives directory contains the outputs from building the project, which include the previously mentioned files.

Files

Files (7.5 GB)

Name Size Download all
md5:3839a3efa75a4d36ffe4ef0885e235e7
7.5 GB Download