Dataset Open Access
Antoine Pietri;
Diomidis Spinellis;
Stefano Zacchiroli
Software Heritage is the largest existing public archive of software source
code and accompanying development history: it currently spans more than five
billion unique source code files and one billion unique commits, coming from
more than 80 million software projects.
This is the Software Heritage graph dataset: a fully-deduplicated
Merkle DAG representation of the Software Heritage archive. The dataset links
together file content identifiers, source code directories, Version Control
System (VCS) commits tracking evolution over time, up to the full states of VCS
repositories as observed by Software Heritage during periodic crawls. The
dataset’s contents come from major development forges (including GitHub and
GitLab), FOSS distributions (e.g., Debian), and language-specific package
managers (e.g., PyPI). Crawling information is also included, providing
timestamps about when and where all archived source code artifacts have been
observed in the wild.
The Software Heritage graph dataset is available in multiple formats, including
downloadable CSV dumps and Apache Parquet files for local use, as well as a
public instance on Amazon Athena interactive query service for ready-to-use
powerful analytical processing.
By accessing the dataset, you agree with the Software Heritage Ethical Charter
for using the archive data, and the terms of use for bulk access.
If you use this dataset for research purposes, please cite the following paper:
You can also refer to the above paper for more information the dataset and sample queries.
Name | Size | |
---|---|---|
athena.zip
md5:d1f77570664ab7cce3baba7e4fe1f706 |
2.1 kB | Download |
parquet_content.tar
md5:b12c4f438ddfd219ab5958150442a4b8 |
230.7 GB | Download |
parquet_directory.tar
md5:359c0e800f17bc76b7b031cff9c96b7a |
516.4 GB | Download |
parquet_directory_entry_dir.tar
md5:b1d77d90920c3b33c3a7508eba985b47 |
108.7 GB | Download |
parquet_directory_entry_file.tar
md5:4ebdd81c88f65c5114508749f6d4b261 |
186.0 GB | Download |
parquet_directory_entry_rev.tar
md5:5d60cb3a1107a7a7f40c1483101349c3 |
417.5 MB | Download |
parquet_origin.tar
md5:dc1f47d0dd34aceb630a25efcb555876 |
2.3 GB | Download |
parquet_origin_visit.tar
md5:46a517c78774291aac94d824b0a42cef |
3.1 GB | Download |
parquet_person.tar
md5:1ae2608cb289849d9513b53e31dcafeb |
96.8 MB | Download |
parquet_release.tar
md5:44e59add29889f7c56e3f89d820d4f59 |
1.5 GB | Download |
parquet_revision.tar
md5:ea5eabc59ad881d0419657b8a817fbdf |
107.4 GB | Download |
parquet_revision_history.tar
md5:76e8c0721b14cfa1ff552e501a92cff9 |
50.9 GB | Download |
parquet_skipped_content.tar
md5:d857028e82c258085153c1e79f485236 |
4.3 MB | Download |
parquet_snapshot.tar
md5:b3119f604d4c9126cddf5552427f0bc4 |
1.6 GB | Download |
parquet_snapshot_branch.tar
md5:739488ef204729e99750ad65b716f635 |
5.2 GB | Download |
parquet_snapshot_branches.tar
md5:458ebf45f39dca7692b4972833a78d90 |
4.2 GB | Download |
README.md
md5:6fa71cd1515bc7be0d2a23a7563633ea |
3.5 kB | Download |
sql_content.csv.gz
md5:685c6f77d4e7ec296a64fecc52fb4168 |
290.8 GB | Download |
sql_directory.csv.gz
md5:03187f06b0d1c57eb6a90bfaea77ac9b |
490.9 GB | Download |
sql_directory_entry_dir.csv.gz
md5:941bc475e88a1009245fb96a87ad212e |
120.8 GB | Download |
sql_directory_entry_file.csv.gz
md5:a6acd0be1536ce6d727ab74bd12f02ee |
202.6 GB | Download |
sql_directory_entry_rev.csv.gz
md5:12fdf82e1c3451e81dd9137c6a728b00 |
405.4 MB | Download |
sql_origin.csv.gz
md5:fc1968d3ec5c14d541749cdc22cfa898 |
1.5 GB | Download |
sql_origin_visit.csv.gz
md5:0683cba1bf83c57b8d45ff09b564442a |
3.3 GB | Download |
sql_person.csv.gz
md5:f7c92d038c0990428fc8d6ff2e08a518 |
53.3 MB | Download |
sql_release.csv.gz
md5:074a622e9bff1fe0c384c3fc70677b07 |
1.5 GB | Download |
sql_revision.csv.gz
md5:c541834922e8efb23801fdc87ecf6338 |
113.4 GB | Download |
sql_revision_history.csv.gz
md5:ec5fce2327d9318d02d35716d9c6f097 |
40.2 GB | Download |
sql_skipped_content.csv.gz
md5:c499fcad3804dad1713211bd55fd9d03 |
4.1 MB | Download |
sql_snapshot.csv.gz
md5:e395b3475c6e6830962399ef2d9c1fac |
1.7 GB | Download |
sql_snapshot_branch.csv.gz
md5:6aaa27a7bf1d42f03f72d21c3f516e7a |
5.9 GB | Download |
sql_snapshot_branches.csv.gz
md5:e93b3587f65417da717dc1d90d41db42 |
3.3 GB | Download |
sql_swh_import.sql
md5:e641430767869bef5bfa6c606cf4c3f3 |
189 Bytes | Download |
sql_swh_import_scripts.zip
md5:0e22533457eeff25035fa6f286d89fdc |
3.2 kB | Download |
swh-environment.tar.gz
md5:4dcab9a3a848fbf6fe8b5f51e45e3a63 |
13.1 MB | Download |
All versions | This version | |
---|---|---|
Views | 2,122 | 2,122 |
Downloads | 1,280 | 1,280 |
Data volume | 115.6 TB | 115.6 TB |
Unique views | 1,962 | 1,962 |
Unique downloads | 806 | 806 |