Dataset Open Access

The Software Heritage Graph Dataset

Antoine Pietri; Diomidis Spinellis; Stefano Zacchiroli

Software Heritage is the largest existing public archive of software source
code and accompanying development history: it currently spans more than five
billion unique source code files and one billion unique commits, coming from
more than 80 million software projects.

This is the Software Heritage graph dataset: a fully-deduplicated
Merkle DAG representation of the Software Heritage archive. The dataset links
together file content identifiers, source code directories, Version Control
System (VCS) commits tracking evolution over time, up to the full states of VCS
repositories as observed by Software Heritage during periodic crawls. The
dataset’s contents come from major development forges (including GitHub and
GitLab), FOSS distributions (e.g., Debian), and language-specific package
managers (e.g., PyPI).  Crawling information is also included, providing
timestamps about when and where all archived source code artifacts have been
observed in the wild.

The Software Heritage graph dataset is available in multiple formats, including
downloadable CSV dumps and Apache Parquet files for local use, as well as a
public instance on Amazon Athena interactive query service for ready-to-use
powerful analytical processing.

By accessing the dataset, you agree with the Software Heritage Ethical Charter
for using the archive data
, and the terms of use for bulk access.

If you use this dataset for research purposes, please cite the following paper:

  • Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. 
    The Software Heritage Graph Dataset: Public software development under one roof
    In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019
    preprintbibtex

You can also refer to the above paper for more information the dataset and sample queries.

Files (2.5 TB)
Name Size
athena.zip
md5:d1f77570664ab7cce3baba7e4fe1f706
2.1 kB Download
parquet_content.tar
md5:b12c4f438ddfd219ab5958150442a4b8
230.7 GB Download
parquet_directory.tar
md5:359c0e800f17bc76b7b031cff9c96b7a
516.4 GB Download
parquet_directory_entry_dir.tar
md5:b1d77d90920c3b33c3a7508eba985b47
108.7 GB Download
parquet_directory_entry_file.tar
md5:4ebdd81c88f65c5114508749f6d4b261
186.0 GB Download
parquet_directory_entry_rev.tar
md5:5d60cb3a1107a7a7f40c1483101349c3
417.5 MB Download
parquet_origin.tar
md5:dc1f47d0dd34aceb630a25efcb555876
2.3 GB Download
parquet_origin_visit.tar
md5:46a517c78774291aac94d824b0a42cef
3.1 GB Download
parquet_person.tar
md5:1ae2608cb289849d9513b53e31dcafeb
96.8 MB Download
parquet_release.tar
md5:44e59add29889f7c56e3f89d820d4f59
1.5 GB Download
parquet_revision.tar
md5:ea5eabc59ad881d0419657b8a817fbdf
107.4 GB Download
parquet_revision_history.tar
md5:76e8c0721b14cfa1ff552e501a92cff9
50.9 GB Download
parquet_skipped_content.tar
md5:d857028e82c258085153c1e79f485236
4.3 MB Download
parquet_snapshot.tar
md5:b3119f604d4c9126cddf5552427f0bc4
1.6 GB Download
parquet_snapshot_branch.tar
md5:739488ef204729e99750ad65b716f635
5.2 GB Download
parquet_snapshot_branches.tar
md5:458ebf45f39dca7692b4972833a78d90
4.2 GB Download
README.md
md5:6fa71cd1515bc7be0d2a23a7563633ea
3.5 kB Download
sql_content.csv.gz
md5:685c6f77d4e7ec296a64fecc52fb4168
290.8 GB Download
sql_directory.csv.gz
md5:03187f06b0d1c57eb6a90bfaea77ac9b
490.9 GB Download
sql_directory_entry_dir.csv.gz
md5:941bc475e88a1009245fb96a87ad212e
120.8 GB Download
sql_directory_entry_file.csv.gz
md5:a6acd0be1536ce6d727ab74bd12f02ee
202.6 GB Download
sql_directory_entry_rev.csv.gz
md5:12fdf82e1c3451e81dd9137c6a728b00
405.4 MB Download
sql_origin.csv.gz
md5:fc1968d3ec5c14d541749cdc22cfa898
1.5 GB Download
sql_origin_visit.csv.gz
md5:0683cba1bf83c57b8d45ff09b564442a
3.3 GB Download
sql_person.csv.gz
md5:f7c92d038c0990428fc8d6ff2e08a518
53.3 MB Download
sql_release.csv.gz
md5:074a622e9bff1fe0c384c3fc70677b07
1.5 GB Download
sql_revision.csv.gz
md5:c541834922e8efb23801fdc87ecf6338
113.4 GB Download
sql_revision_history.csv.gz
md5:ec5fce2327d9318d02d35716d9c6f097
40.2 GB Download
sql_skipped_content.csv.gz
md5:c499fcad3804dad1713211bd55fd9d03
4.1 MB Download
sql_snapshot.csv.gz
md5:e395b3475c6e6830962399ef2d9c1fac
1.7 GB Download
sql_snapshot_branch.csv.gz
md5:6aaa27a7bf1d42f03f72d21c3f516e7a
5.9 GB Download
sql_snapshot_branches.csv.gz
md5:e93b3587f65417da717dc1d90d41db42
3.3 GB Download
sql_swh_import.sql
md5:e641430767869bef5bfa6c606cf4c3f3
189 Bytes Download
sql_swh_import_scripts.zip
md5:0e22533457eeff25035fa6f286d89fdc
3.2 kB Download
swh-environment.tar.gz
md5:4dcab9a3a848fbf6fe8b5f51e45e3a63
13.1 MB Download
1,315
674
views
downloads
All versions This version
Views 1,3151,315
Downloads 674674
Data volume 74.6 TB74.6 TB
Unique views 1,2191,219
Unique downloads 367367

Share

Cite as