GitTables 1M
- 1. University of Amsterdam
- 2. Sigma Computing
Description
Summary
GitTables 1M (https://gittables.github.io) is a corpus of currently 1M relational tables extracted from CSV files in GitHub repositories, that are associated with a license that allows distribution. We aim to grow this to at least 10M tables.
Each parquet file in this corpus represents a table with the original content (e.g. values and header) as extracted from the corresponding CSV file. Table columns are enriched with annotations corresponding to >2K semantic types from Schema.org and DBpedia (provided as metadata of the parquet file). These column annotations consist of, for example, semantic types, hierarchical relations to other types, and descriptions.
We believe GitTables can facilitate many use-cases, among which:
-
Data integration, search and validation.
-
Data visualization and analysis recommendation.
-
Schema analysis and completion for e.g. database or knowledge base design.
If you have questions, the paper, documentation, and contact details are provided on the website: https://gittables.github.io. We recommend using Zenodo's API to easily download the full dataset (i.e. all zipped topic subsets).
Dataset contents
The data is provided in subsets of tables stored in parquet files, each subset corresponds to a term that was used to query GitHub with. The column annotations and other metadata (e.g. URL and repository license) are attached to the metadata of the parquet file. This version corresponds to this version of the paper https://arxiv.org/abs/2106.07258v4.
In summary, this dataset can be characterized as follows:
Statistic |
Value |
# tables |
1M |
average # columns |
12 |
average # rows |
142 |
# annotated tables (at least 1 column annotation) |
723K+ (DBpedia), 738K+ (Schema.org) |
# unique semantic types |
835 (DBpedia), 677 (Schema.org) |
How to download
The dataset can be downloaded through Zenodo's interface directly, or using Zenodo's API (recommended for full download).
Future releases
Future releases will include the following:
-
Increased number of tables (expected at least 10M)
Associated datasets
- GitTables benchmark - column type detection: https://zenodo.org/record/5706316
- GitTables 1M - CSV files: https://zenodo.org/record/6515973
Files
abstraction_tables_licensed.zip
Files
(16.3 GB)
Name | Size | Download all |
---|---|---|
md5:01d0e1297c3f3c136da8bbe9f7ff9f6c
|
183.5 MB | Preview Download |
md5:c899a73b9f31c201ff5b028efcd0ed17
|
3.3 MB | Preview Download |
md5:07ce67ed497012079238c857b3729af7
|
4.0 MB | Preview Download |
md5:c17a9aa80624fd7a2b7888775eeb11e9
|
11.5 MB | Preview Download |
md5:9d6b103927ea4ee6e7982d99a97b0946
|
27.8 MB | Preview Download |
md5:a3dea6b311fa698a9e5769710a6cef79
|
83.9 MB | Preview Download |
md5:9cf4b6d0638e0d1926fa0272a5da6cfd
|
5.1 MB | Preview Download |
md5:8ec50d265135da0acc22c2371d9d5af6
|
25.2 MB | Preview Download |
md5:e1ee6a4ffed2a03ab0183cf3032b436e
|
6.7 MB | Preview Download |
md5:0ea88e32aab30d13f9d00f3ee257764e
|
28.2 MB | Preview Download |
md5:8777e7dffe24e58dae7d731a3c7580ad
|
966.4 kB | Preview Download |
md5:3f058d98d8fa03d6827c88208283a8f7
|
5.8 MB | Preview Download |
md5:ee91dd6f4d79b55f4abfdf318d625713
|
33.7 MB | Preview Download |
md5:66ec18db2fef4a3e8710fa4b0681a265
|
74.2 MB | Preview Download |
md5:107f4196c8db3ea0be40ae47730265b4
|
745.9 MB | Preview Download |
md5:2c874017ce9a8e326438effd0bce7a26
|
135.0 MB | Preview Download |
md5:cd2a6fd4849b49ea336f6a0736bfdfb3
|
3.3 MB | Preview Download |
md5:7ae632e8b75c7be3beb8d2d73dc7ed25
|
21.1 MB | Preview Download |
md5:df934dc1c9c7b19ce5f2bbc1cc995d77
|
29.0 MB | Preview Download |
md5:79a7cb6efab03c55b5c5aa247d09b7b8
|
2.4 MB | Preview Download |
md5:788d769236ee55d85fdaa0f9001baaa9
|
3.0 MB | Preview Download |
md5:f0062b74013b708027caa3edeb1958a8
|
9.8 MB | Preview Download |
md5:c684109a3be54770a21c5cb72f52103d
|
33.3 MB | Preview Download |
md5:43a30b5686325d153f8a8af39a79bd44
|
2.5 MB | Preview Download |
md5:983c71daffe9d098534ca69d0a96803f
|
2.8 MB | Preview Download |
md5:142724919f37d91062f55677329ac78b
|
240.1 MB | Preview Download |
md5:275f5a7f6bf032b2da7311c04dbfb749
|
431.6 MB | Preview Download |
md5:c89c5a472dfb54c578e2523199d84b3f
|
2.6 MB | Preview Download |
md5:9ce1258256a3821dd2f807aef88e26b7
|
1.5 MB | Preview Download |
md5:e0464e8da5550ac11ace6a5b7d44bf2d
|
4.3 MB | Preview Download |
md5:2314c090e6d6fb0c61540da50f91c30c
|
2.0 GB | Preview Download |
md5:297621a79135209427a37b33697d368e
|
252.4 MB | Preview Download |
md5:ccaffefbb7f97fc2fd9fee74efa3321a
|
10.7 MB | Preview Download |
md5:43d792bd00c305df64e7a93ac60b65ac
|
842.6 kB | Preview Download |
md5:89cf031ae8dacf850fe25dd476830a53
|
50.4 MB | Preview Download |
md5:f139baf7a3107910c86c9318f3d504d8
|
20.5 MB | Preview Download |
md5:b5f4468254bbd363bd83787f8d313bc4
|
6.0 MB | Preview Download |
md5:52011a535602ddf51a172ccf1b963aa6
|
1.9 MB | Preview Download |
md5:b3a5cc0dba30747e18de8add8201fc98
|
9.8 MB | Preview Download |
md5:01b8cd21bea40d57452cfc23e42f2ade
|
7.5 MB | Preview Download |
md5:819ef38984cffb68b01397cf2c1c7660
|
3.7 MB | Preview Download |
md5:0eef8405992811378812aa887c631ab4
|
405.5 MB | Preview Download |
md5:1d95b7627a0951fc69deb86d95be81fd
|
442.8 MB | Preview Download |
md5:76cdb2bad9582d23c1f6f4d868218d6c
|
22 Bytes | Preview Download |
md5:28336e76ace33e33747518b76b63d752
|
2.4 MB | Preview Download |
md5:7bd12d78dccf6cb88832aa99f9bbc74a
|
912.3 kB | Preview Download |
md5:f03756251038867e137d68cd572766cf
|
5.6 MB | Preview Download |
md5:27a3606167228c022bf8bd40421b0c17
|
36.9 MB | Preview Download |
md5:2ef20e46c37e52e067182c8e81cc07ba
|
1.1 MB | Preview Download |
md5:ba87146e5b27cbd998340a1df825baf1
|
1.8 MB | Preview Download |
md5:4e0e7301ec320aeda7dedabfb7b541c7
|
2.1 MB | Preview Download |
md5:862ebb8e31d4bbe1923c2cd06db23fe3
|
765.7 MB | Preview Download |
md5:0f115eb9d8f0ccad5221617ea57d58de
|
13.0 MB | Preview Download |
md5:59d47e8a7da4dc2bf33b66038685b375
|
26.9 MB | Preview Download |
md5:f81e9ffdd90f6311b8f472dbc01d266a
|
1.2 GB | Preview Download |
md5:3adf5063467f45324b2e26009c98440c
|
597.3 kB | Preview Download |
md5:b8d3dd1b5df774b8422367474bc1efcc
|
1.0 MB | Preview Download |
md5:f3669555ed1525d01bba54679bf225a8
|
36.8 MB | Preview Download |
md5:9ff4921799084c9cf214cd8961cf5cdf
|
316.4 MB | Preview Download |
md5:0596f1405d3f5b748134e18908f32021
|
773.7 MB | Preview Download |
md5:16bc80400e85a51f58ae61330ed9c9e6
|
426.3 MB | Preview Download |
md5:e98656caeca7a072df8d277c2061ddaa
|
549.2 kB | Preview Download |
md5:8547e9a82cf9580d73256b17c003a805
|
1.1 MB | Preview Download |
md5:b45b69f715f4228e9ab6d3b0dee4ff35
|
16.6 MB | Preview Download |
md5:27424a70a3de975f34fef1c3b1359d12
|
193.6 MB | Preview Download |
md5:1d5f41fa761b14cd2361bb6862206e88
|
161.1 MB | Preview Download |
md5:8a93ac30870ce80c102af12d44c5dbbb
|
1.1 GB | Preview Download |
md5:e9725052f3f4edcc595311e6aa61e88a
|
14.9 MB | Preview Download |
md5:5ae5cc0acfdda755dfc2ee2df265494c
|
23.2 MB | Preview Download |
md5:c6f2a1a94b6d3f1961307f16b9ab68d5
|
12.7 MB | Preview Download |
md5:f9b6f5a7870c5782871f2f4ff03517e1
|
1.9 MB | Preview Download |
md5:dd43cf9379d94c6b320d3aa13b5fe3bf
|
18.4 MB | Preview Download |
md5:80d4a2a01c2021063652eff08e6a44e1
|
602.1 MB | Preview Download |
md5:ebc453d6c92a937eb14ca00db96b7d8b
|
118.8 MB | Preview Download |
md5:86a21db356d5bd0dead8ea0458f1f29c
|
7.0 MB | Preview Download |
md5:66334c5e34def53c3e12bb388858010c
|
40.9 MB | Preview Download |
md5:ab6c7a18dcc5de95b8dcbffd8e97ceb1
|
5.7 MB | Preview Download |
md5:8b7177ddc7aaf1940c18a20ab53fdafd
|
225.1 MB | Preview Download |
md5:fea2cd95cdead7bd0e0209e5e78490a5
|
15.7 MB | Preview Download |
md5:09598e83390661087b2e2c2d183a8e28
|
894.2 MB | Preview Download |
md5:07fdfce4de15cde5fcff33a5a9cebcc2
|
18.6 MB | Preview Download |
md5:ceec7fa340da9ed5d49909c135612e57
|
165.2 MB | Preview Download |
md5:4f7f0027b15452f2bcf1787558c6f595
|
35.5 MB | Preview Download |
md5:eb3252bcbfb79267484332dca19e2d50
|
95.2 kB | Preview Download |
md5:fc6cb5f9f58586c4c48695677e4a291e
|
71.7 MB | Preview Download |
md5:a16aa2c3e3548d35b4b93b8038846b5f
|
12.8 MB | Preview Download |
md5:b1caae5ef01a54f48f1aef45ba1ec3b4
|
3.4 MB | Preview Download |
md5:2ded04348a7dc4e7cd28bcd077be396c
|
1.3 GB | Preview Download |
md5:23459f9417d803f0ebf297dc11dd0b1b
|
1.2 GB | Preview Download |
md5:cba1163c4402ce1eb04cdf0b5fa6a096
|
1.3 MB | Preview Download |
md5:423eaf1a8bae5d22450b94ab31dd109e
|
88.2 MB | Preview Download |
md5:5d95a0ce74d2f7597f749d4f52dca30c
|
30.0 MB | Preview Download |
md5:76cdb2bad9582d23c1f6f4d868218d6c
|
22 Bytes | Preview Download |
md5:43a947d22a5d9224d35f2dcedf0cc86c
|
2.9 MB | Preview Download |
md5:c6c2448b87d8d05f18d4af6dfbca5265
|
5.4 MB | Preview Download |
md5:d397311d8cb81902ab9ca025e015791b
|
961.3 MB | Preview Download |