There is a newer version of this record available.

Dataset Open Access

GitTables 1.7M

Madelon Hulsebos; Çağatay Demiralp; Paul Groth

Summary

GitTables (https://gittables.github.io) is a corpus of currently 1.7M relational tables extracted from CSV files on GitHub, we aim to grow this to at least 10M tables. Each file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns were also annotated with >2K semantic types from Schema.org and DBpedia (provided separately). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.

We believe GitTables can facilitate many use-cases, among which:

  • Data integration, search and validation.
  • Data visualization and analysis recommendation.
  • Schema analysis and completion for e.g. database or knowledge base design.

If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io.

Dataset contents

The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL) are attached to the metadata of the parquet file. 

In summary, this dataset can be characterized as follows:

Statistic Value
# tables 1.7M
average # columns 25
average # rows 209
# annotated tables (at least 1 column annotation) 1.0M (DBpedia), 1.5M (Schema.org)
# unique semantic types 1218 (DBpedia), 924 (Schema.org)

Future releases

Future releases will include the following:

  • Licensed tables along with their licenses
  • Improved curation (e.g. better parsing, removal of social media content, anonymization)
  • Raw CSV files

Responsible use

This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258.

Version 0.0.4 of GitTables (the current) contains tables extracted from CSV files from public GitHub repositories. Some tables might therefore not be associated with a permissive license. While the new (fully licensed) set is constructed, GitHub's API can be used to retrieve the license associated with each table based on the URL in the metadata.

Please be aware that tables might still exhibit sensitive, harmful or otherwise undesired data. The next release will be curated to mitigate this. If any issue with the content is observed please report this to us using the contact form on https://gittables.github.io so that we can mitigate the issue.

Files (25.6 GB)
Name Size
abstraction_tables.zip
md5:9a5ba3a9ebae9e599b0c92ee9b35980d
313.3 MB Download
dwarf_tables.zip
md5:064ce57a894e3e80cbb273241c5ea276
281.3 MB Download
id_tables.zip
md5:d50a92247e9886637ed59d3fa3360935
755.5 MB Download
living_thing_tables.zip
md5:c39af2a53da82b25c9960bce53f1e8d2
1.9 GB Download
object_tables.zip
md5:d9feba14cecb4bcda882359b9b8447a0
5.3 GB Download
organism_tables.zip
md5:fde0d1439bbbeeabb8f61d608efa128b
291.3 MB Download
parent_tables.zip
md5:47f8e80cd81a09c913c87b670658cc6b
4.8 GB Download
physical_entity_tables.zip
md5:87f34827e0a5fe6a560d4c5f2995e6a8
183.7 MB Download
thing_tables.zip
md5:f8ff62ea4868859c6e66427cbfbc468c
6.2 GB Download
whole_tables.zip
md5:f43b3851ba5044c0c4bd61d572ab650f
5.5 GB Download
2,850
15,152
views
downloads
All versions This version
Views 2,8501,614
Downloads 15,1525,542
Data volume 32.5 TB30.6 TB
Unique views 2,1251,287
Unique downloads 1,285554

Share

Cite as