Published March 5, 2019 | Version 1.0.0
Dataset Open

The Software Heritage Graph Dataset

  • 1. Inria, France
  • 2. Athens University of Economics and Business, Greece
  • 3. University Paris Diderot and Inria, France

Description

Software Heritage is the largest existing public archive of software source
code and accompanying development history: it currently spans more than five
billion unique source code files and one billion unique commits, coming from
more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated
Merkle DAG representation of the Software Heritage archive. The dataset links
together file content identifiers, source code directories, Version Control
System (VCS) commits tracking evolution over time, up to the full states of VCS
repositories as observed by Software Heritage during periodic crawls. The
dataset’s contents come from major development forges (including GitHub and
GitLab), FOSS distributions (e.g., Debian), and language-specific package
managers (e.g., PyPI).  Crawling information is also included, providing
timestamps about when and where all archived source code artifacts have been
observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including
downloadable CSV dumps and Apache Parquet files for local use, as well as a
public instance on Amazon Athena interactive query service for ready-to-use
powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter
for using the archive data
, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

  • Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. 
    The Software Heritage Graph Dataset: Public software development under one roof
    In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019
    preprintbibtex

You can also refer to the above paper for more information the dataset and sample queries.

Files

athena.zip

Files (2.5 TB)

Name Size Download all
md5:d1f77570664ab7cce3baba7e4fe1f706
2.1 kB Preview Download
md5:b12c4f438ddfd219ab5958150442a4b8
230.7 GB Download
md5:359c0e800f17bc76b7b031cff9c96b7a
516.4 GB Download
md5:b1d77d90920c3b33c3a7508eba985b47
108.7 GB Download
md5:4ebdd81c88f65c5114508749f6d4b261
186.0 GB Download
md5:5d60cb3a1107a7a7f40c1483101349c3
417.5 MB Download
md5:dc1f47d0dd34aceb630a25efcb555876
2.3 GB Download
md5:46a517c78774291aac94d824b0a42cef
3.1 GB Download
md5:1ae2608cb289849d9513b53e31dcafeb
96.8 MB Download
md5:44e59add29889f7c56e3f89d820d4f59
1.5 GB Download
md5:ea5eabc59ad881d0419657b8a817fbdf
107.4 GB Download
md5:76e8c0721b14cfa1ff552e501a92cff9
50.9 GB Download
md5:d857028e82c258085153c1e79f485236
4.3 MB Download
md5:b3119f604d4c9126cddf5552427f0bc4
1.6 GB Download
md5:739488ef204729e99750ad65b716f635
5.2 GB Download
md5:458ebf45f39dca7692b4972833a78d90
4.2 GB Download
md5:6fa71cd1515bc7be0d2a23a7563633ea
3.5 kB Preview Download
md5:685c6f77d4e7ec296a64fecc52fb4168
290.8 GB Download
md5:03187f06b0d1c57eb6a90bfaea77ac9b
490.9 GB Download
md5:941bc475e88a1009245fb96a87ad212e
120.8 GB Download
md5:a6acd0be1536ce6d727ab74bd12f02ee
202.6 GB Download
md5:12fdf82e1c3451e81dd9137c6a728b00
405.4 MB Download
md5:fc1968d3ec5c14d541749cdc22cfa898
1.5 GB Download
md5:0683cba1bf83c57b8d45ff09b564442a
3.3 GB Download
md5:f7c92d038c0990428fc8d6ff2e08a518
53.3 MB Download
md5:074a622e9bff1fe0c384c3fc70677b07
1.5 GB Download
md5:c541834922e8efb23801fdc87ecf6338
113.4 GB Download
md5:ec5fce2327d9318d02d35716d9c6f097
40.2 GB Download
md5:c499fcad3804dad1713211bd55fd9d03
4.1 MB Download
md5:e395b3475c6e6830962399ef2d9c1fac
1.7 GB Download
md5:6aaa27a7bf1d42f03f72d21c3f516e7a
5.9 GB Download
md5:e93b3587f65417da717dc1d90d41db42
3.3 GB Download
md5:e641430767869bef5bfa6c606cf4c3f3
189 Bytes Download
md5:0e22533457eeff25035fa6f286d89fdc
3.2 kB Preview Download
md5:4dcab9a3a848fbf6fe8b5f51e45e3a63
13.1 MB Download