Clean-Clean ER Datasets with FastText embeddings
Description
This is a collection of 11 real-world datasets for Clean-Clean ER. Every dataset contains the aggregate attribute values per entity (i.e., a concatenation of all individual attribute values) as well as the value for the most frequent and distinctive attribute. In both cases, the corresponding pre-trained fastText embeddings are available, too.
The datasets are the following:
- D1: Restaurants
- D2: Abt-Buy
- D3: Amazon-Google Products
- D4: DBLP-ACM
- D5: IMDB-TMDB
- D6: IMDB-TVDB
- D7: TMDB-TVDB
- D8: Amazon-Walmart
- D9: DBLP-Scholar
- D10: Movies
For more details, please refer to ContinuousFilteringBenchmark and to the corresponding technical report.
The 11th datasets comprises two versions of the DBPedia English Infoboxes. Please refer to this publication for more details.
We have also added 7 synthetic datasets for Dirty ER of increasing size, which are ideal for scalability analyses. These datasets, which contain demographic information, are the following (their names indicate their size):
- D10K
- D50K
- D100K
- D200K
- D300K
- D1M
- D2M
At the moment, the repository contains the original datasets in JSO format. The datasets in CSV format along with their fastText embeddings are currently available here.
Files
D10Aemb.csv
Files
(8.5 GB)
Name | Size | Download all |
---|---|---|
md5:0f86a62fff246e3b3cfc3323454f9a29
|
382.5 MB | Preview Download |
md5:2c1a2e2df34aa815bd8f0c90c57f135c
|
424.7 MB | Preview Download |
md5:13bf35030e191f3422a741653cd63f24
|
254.3 kB | Preview Download |
md5:6974ce6b2d946ff2f7a9855754d6b815
|
5.0 MB | Preview Download |
md5:4ffa7d2ed9bb46fa634e41609e030b1b
|
33.1 MB | Preview Download |
md5:8e7b656cb28a273e7411655e1b40c057
|
754 Bytes | Preview Download |
md5:b9789d0b8c8e8f9484314e6125e43bfe
|
19.3 MB | Preview Download |
md5:f4a046f1e1cd92acc3fd6fadb8ed6e06
|
18.8 MB | Preview Download |
md5:9fb815125b1efe1485b2d48b5705ccce
|
8.6 kB | Preview Download |
md5:0f51c4dd647bf1b95299eeace34d709b
|
26.6 MB | Preview Download |
md5:be8b52b6686a0902edfa3974eea6f7f4
|
54.3 MB | Preview Download |
md5:0d44b82091469f01ca43338a8767ed05
|
9.7 kB | Preview Download |
md5:63550f22135689ce0100cfdeb681eeb1
|
46.2 MB | Preview Download |
md5:19246e0b400f6e418b0d9627980b15ae
|
40.5 MB | Preview Download |
md5:d488b85ebe15972190663b3aec87c038
|
20.3 kB | Preview Download |
md5:f067ecded9cba015992ee4ec69f0416b
|
77.8 MB | Preview Download |
md5:2c7b26415a642bee818f069a7e8fc0db
|
83.1 MB | Preview Download |
md5:e41efd2ecad8c79176611e6253b19e95
|
18.9 kB | Preview Download |
md5:14fc41d053a38e6f79df39688315e6d7
|
77.8 MB | Preview Download |
md5:73d961396b8ba0a195e88534ad40833b
|
118.1 MB | Preview Download |
md5:5d22cf21c1b1ec4b2eb042cd06d7efb1
|
10.4 kB | Preview Download |
md5:d28bc714a42fc0c231109352759e0b39
|
76.2 MB | Preview Download |
md5:cd19d8a41d66ff556140fece6798c2d3
|
85.3 MB | Preview Download |
md5:a9cb4416d6dd69769c85d8a867f2eb63
|
10.6 kB | Preview Download |
md5:6f355535afa6f62f650a28983153472c
|
44.6 MB | Preview Download |
md5:a94dbca68998d443bed7021e0e21f5f6
|
387.9 MB | Preview Download |
md5:1b0e5c5f1a2b8d84a510691aac35932c
|
8.6 kB | Preview Download |
md5:1cab57c861eec63d69424640e6c3a9ff
|
44.4 MB | Preview Download |
md5:f1155d11890debfb20cc91c3eee1a93a
|
1.1 GB | Preview Download |
md5:2237528ecc559359cdafcbce6e35a6e5
|
23.7 kB | Preview Download |
md5:1d1c90108d078170e6240d726f4d0924
|
1.8 GB | Download |
md5:108006f15b96d5d7617eeba1a746c4d9
|
3.3 GB | Download |
md5:fcae3573c1ffd988c39dab11313256c6
|
268.5 MB | Download |