Dataset Open Access
Madelon Hulsebos; Çağatay Demiralp; Paul Groth
Summary
GitTables (https://gittables.github.io) is a corpus of currently 1.7M relational tables extracted from CSV files on GitHub, we aim to grow this to at least 10M tables. Each file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns were also annotated with >2K semantic types from Schema.org and DBpedia (provided separately). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.
We believe GitTables can facilitate many use-cases, among which:
If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io.
Dataset contents
The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL) are attached to the metadata of the parquet file.
In summary, this dataset can be characterized as follows:
Statistic | Value |
---|---|
# tables | 1.7M |
average # columns | 25 |
average # rows | 209 |
# annotated tables (at least 1 column annotation) | 1.0M (DBpedia), 1.5M (Schema.org) |
# unique semantic types | 1218 (DBpedia), 924 (Schema.org) |
Future releases
Future releases will include the following:
Responsible use
This dataset corresponds to the 1.7M tables used for the analysis in the GitTables paper: https://arxiv.org/abs/2106.07258.
Version 0.0.4 of GitTables (the current) contains tables extracted from CSV files from public GitHub repositories. Some tables might therefore not be associated with a permissive license. While the new (fully licensed) set is constructed, GitHub's API can be used to retrieve the license associated with each table based on the URL in the metadata.
Please be aware that tables might still exhibit sensitive, harmful or otherwise undesired data. The next release will be curated to mitigate this. If any issue with the content is observed please report this to us using the contact form on https://gittables.github.io so that we can mitigate the issue.
Name | Size | |
---|---|---|
abstraction_tables.zip
md5:9a5ba3a9ebae9e599b0c92ee9b35980d |
313.3 MB | Download |
dwarf_tables.zip
md5:064ce57a894e3e80cbb273241c5ea276 |
281.3 MB | Download |
id_tables.zip
md5:d50a92247e9886637ed59d3fa3360935 |
755.5 MB | Download |
living_thing_tables.zip
md5:c39af2a53da82b25c9960bce53f1e8d2 |
1.9 GB | Download |
object_tables.zip
md5:d9feba14cecb4bcda882359b9b8447a0 |
5.3 GB | Download |
organism_tables.zip
md5:fde0d1439bbbeeabb8f61d608efa128b |
291.3 MB | Download |
parent_tables.zip
md5:47f8e80cd81a09c913c87b670658cc6b |
4.8 GB | Download |
physical_entity_tables.zip
md5:87f34827e0a5fe6a560d4c5f2995e6a8 |
183.7 MB | Download |
thing_tables.zip
md5:f8ff62ea4868859c6e66427cbfbc468c |
6.2 GB | Download |
whole_tables.zip
md5:f43b3851ba5044c0c4bd61d572ab650f |
5.5 GB | Download |
All versions | This version | |
---|---|---|
Views | 2,850 | 1,614 |
Downloads | 15,152 | 5,542 |
Data volume | 32.5 TB | 30.6 TB |
Unique views | 2,125 | 1,287 |
Unique downloads | 1,285 | 554 |