There is a newer version of the record available.

Published March 30, 2022 | Version v3
Dataset Open

Clean-Clean ER Datasets with FastText embeddings

  • 1. University of Athens

Description

This is a collection of 11 real-world datasets for Clean-Clean ER. Every dataset contains the aggregate attribute values per entity (i.e., a concatenation of all individual attribute values) as well as the value for the most frequent and distinctive attribute. In both cases, the corresponding pre-trained fastText embeddings are available, too.

The datasets are the following:

  • D1: Restaurants
  • D2: Abt-Buy
  • D3: Amazon-Google Products
  • D4: DBLP-ACM
  • D5: IMDB-TMDB
  • D6: IMDB-TVDB
  • D7: TMDB-TVDB
  • D8: Amazon-Walmart
  • D9: DBLP-Scholar
  • D10: Movies

For more details, please refer to ContinuousFilteringBenchmark and to the corresponding technical report.

The 11th datasets comprises two versions of the DBPedia English Infoboxes. Please refer to this publication for more details.

We have also added 7 synthetic datasets for Dirty ER of increasing size, which are ideal for scalability analyses. These datasets, which contain demographic information, are the following (their names indicate their size):

  • D10K
  • D50K
  • D100K
  • D200K
  • D300K
  • D1M
  • D2M

At the moment, the repository contains the original datasets in JSO format. The datasets in CSV format along with their fastText embeddings are currently available here.

Files

D10Aemb.csv

Files (8.5 GB)

Name Size Download all
md5:0f86a62fff246e3b3cfc3323454f9a29
382.5 MB Preview Download
md5:2c1a2e2df34aa815bd8f0c90c57f135c
424.7 MB Preview Download
md5:13bf35030e191f3422a741653cd63f24
254.3 kB Preview Download
md5:6974ce6b2d946ff2f7a9855754d6b815
5.0 MB Preview Download
md5:4ffa7d2ed9bb46fa634e41609e030b1b
33.1 MB Preview Download
md5:8e7b656cb28a273e7411655e1b40c057
754 Bytes Preview Download
md5:b9789d0b8c8e8f9484314e6125e43bfe
19.3 MB Preview Download
md5:f4a046f1e1cd92acc3fd6fadb8ed6e06
18.8 MB Preview Download
md5:9fb815125b1efe1485b2d48b5705ccce
8.6 kB Preview Download
md5:0f51c4dd647bf1b95299eeace34d709b
26.6 MB Preview Download
md5:be8b52b6686a0902edfa3974eea6f7f4
54.3 MB Preview Download
md5:0d44b82091469f01ca43338a8767ed05
9.7 kB Preview Download
md5:63550f22135689ce0100cfdeb681eeb1
46.2 MB Preview Download
md5:19246e0b400f6e418b0d9627980b15ae
40.5 MB Preview Download
md5:d488b85ebe15972190663b3aec87c038
20.3 kB Preview Download
md5:f067ecded9cba015992ee4ec69f0416b
77.8 MB Preview Download
md5:2c7b26415a642bee818f069a7e8fc0db
83.1 MB Preview Download
md5:e41efd2ecad8c79176611e6253b19e95
18.9 kB Preview Download
md5:14fc41d053a38e6f79df39688315e6d7
77.8 MB Preview Download
md5:73d961396b8ba0a195e88534ad40833b
118.1 MB Preview Download
md5:5d22cf21c1b1ec4b2eb042cd06d7efb1
10.4 kB Preview Download
md5:d28bc714a42fc0c231109352759e0b39
76.2 MB Preview Download
md5:cd19d8a41d66ff556140fece6798c2d3
85.3 MB Preview Download
md5:a9cb4416d6dd69769c85d8a867f2eb63
10.6 kB Preview Download
md5:6f355535afa6f62f650a28983153472c
44.6 MB Preview Download
md5:a94dbca68998d443bed7021e0e21f5f6
387.9 MB Preview Download
md5:1b0e5c5f1a2b8d84a510691aac35932c
8.6 kB Preview Download
md5:1cab57c861eec63d69424640e6c3a9ff
44.4 MB Preview Download
md5:f1155d11890debfb20cc91c3eee1a93a
1.1 GB Preview Download
md5:2237528ecc559359cdafcbce6e35a6e5
23.7 kB Preview Download
md5:1d1c90108d078170e6240d726f4d0924
1.8 GB Download
md5:108006f15b96d5d7617eeba1a746c4d9
3.3 GB Download
md5:fcae3573c1ffd988c39dab11313256c6
268.5 MB Download