Dataset Open Access

GitTables 1.7M

Madelon Hulsebos; Çağatay Demiralp; Paul Groth

GitTables is a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. We annotated table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. If you have questions: documentation and contact details are provided on our website: https://gittables.github.io.

This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258. Characteristics about the table corpus (e.g. table sizes and topical distribution) are reported in this paper.

Responsible use

The current versions of GitTables, up to 0.0.4, contain tables extracted from CSV files from public GitHub repositories, hence some tables might not be associated with a license that allows e.g. commercial use. A new version of GitTables with only licensed tables will be released soon, the licenses will be attached to the file metadata. In the meantime, we suggest to use GitHub's License API to retrieve the license associated with the table (you can use the URL in the metadata to do so) to understand what restrictions apply to each table.

Please be aware that this dataset is uncurated, hence the underlying data files might exhibit sensitive, harmful or otherwise undesired data. The spread and exact replication of such content should be avoided, please report any such observations so that we can remove these files accordingly.

It is also important to assess derived artefacts on the presence of any negative bias before deploying or publishing them. In case harmful biases are observed we would like to be notified so that we can mitigate these problems and improve our guidelines for using GitTables. You can report this through the contact form on https://gittables.github.io.

Files (25.6 GB)
Name Size
abstraction_tables.zip
md5:9a5ba3a9ebae9e599b0c92ee9b35980d
313.3 MB Download
dwarf_tables.zip
md5:064ce57a894e3e80cbb273241c5ea276
281.3 MB Download
id_tables.zip
md5:d50a92247e9886637ed59d3fa3360935
755.5 MB Download
living_thing_tables.zip
md5:c39af2a53da82b25c9960bce53f1e8d2
1.9 GB Download
object_tables.zip
md5:d9feba14cecb4bcda882359b9b8447a0
5.3 GB Download
organism_tables.zip
md5:fde0d1439bbbeeabb8f61d608efa128b
291.3 MB Download
parent_tables.zip
md5:47f8e80cd81a09c913c87b670658cc6b
4.8 GB Download
physical_entity_tables.zip
md5:87f34827e0a5fe6a560d4c5f2995e6a8
183.7 MB Download
thing_tables.zip
md5:f8ff62ea4868859c6e66427cbfbc468c
6.2 GB Download
whole_tables.zip
md5:f43b3851ba5044c0c4bd61d572ab650f
5.5 GB Download
353
95
views
downloads
All versions This version
Views 353353
Downloads 9595
Data volume 145.8 GB145.8 GB
Unique views 277277
Unique downloads 4848

Share

Cite as