There is a newer version of the record available.

Published March 30, 2022 | Version v4
Dataset Open

Clean-Clean ER Datasets with FastText embeddings

Authors/Creators

  • 1. University of Athens

Description

This is a collection of 11 real-world datasets for Clean-Clean ER. Every dataset contains the aggregate attribute values per entity (i.e., a concatenation of all individual attribute values) as well as the value for the most frequent and distinctive attribute. In both cases, the corresponding pre-trained fastText embeddings are available, too.

The datasets are the following:

  • D1: Restaurants
  • D2: Abt-Buy
  • D3: Amazon-Google Products
  • D4: DBLP-ACM
  • D5: IMDB-TMDB
  • D6: IMDB-TVDB
  • D7: TMDB-TVDB
  • D8: Amazon-Walmart
  • D9: DBLP-Scholar
  • D10: Movies

For more details, please refer to ContinuousFilteringBenchmark and to the corresponding technical report.

The 11th datasets comprises two versions of the DBPedia English Infoboxes. Please refer to this publication for more details.

We have also added 7 synthetic datasets for Dirty ER of increasing size, which are ideal for scalability analyses. These datasets, which contain demographic information, are the following (their names indicate their size):

  • D10K
  • D50K
  • D100K
  • D200K
  • D300K
  • D1M
  • D2M

At the moment, the repository contains the original datasets in JSO format. The datasets in CSV format along with their fastText embeddings are currently available here.

Files

D10Aemb.csv

Files (8.6 GB)

Name Size Download all
md5:0f86a62fff246e3b3cfc3323454f9a29
382.5 MB Preview Download
md5:2c1a2e2df34aa815bd8f0c90c57f135c
424.7 MB Preview Download
md5:4e96abd79d534461b44f204a9bac9167
3.3 MB Preview Download
md5:13bf35030e191f3422a741653cd63f24
254.3 kB Preview Download
md5:46cb17915c7146292f3b82e9247ba77b
277.1 kB Preview Download
md5:2a165507f4fd5658e90bb599119002a5
2.5 MB Preview Download
md5:6974ce6b2d946ff2f7a9855754d6b815
5.0 MB Preview Download
md5:4ffa7d2ed9bb46fa634e41609e030b1b
33.1 MB Preview Download
md5:8e7b656cb28a273e7411655e1b40c057
754 Bytes Preview Download
md5:fe18dca78547d9735a7ea0841d5b6766
816 Bytes Preview Download
md5:db200837133a9f68d12538c2bc6e40eb
15.6 kB Preview Download
md5:298971537252514d33de5bf470176727
105.3 kB Preview Download
md5:23a0aac0f6c7d6afe34bb31db854d4d5
668.2 kB Preview Download
md5:b9789d0b8c8e8f9484314e6125e43bfe
19.3 MB Preview Download
md5:f4a046f1e1cd92acc3fd6fadb8ed6e06
18.8 MB Preview Download
md5:0f598212851b7be1e2a9d8fe0944e904
203.2 kB Preview Download
md5:9fb815125b1efe1485b2d48b5705ccce
8.6 kB Preview Download
md5:ea92bb5cf255c8ab50ff4e8b55c15441
9.6 kB Preview Download
md5:0f51c4dd647bf1b95299eeace34d709b
26.6 MB Preview Download
md5:ce0eabed9a9180a25c4c37fcba879b03
3.6 MB Preview Download
md5:be8b52b6686a0902edfa3974eea6f7f4
54.3 MB Preview Download
md5:0d8c20788d13ae7926d2d763450550fc
1.6 MB Preview Download
md5:0d44b82091469f01ca43338a8767ed05
9.7 kB Preview Download
md5:7679107cb55a887cb3e385de15ccf2ac
10.7 kB Preview Download
md5:b4128930f42d4cb482fe77091d964e99
684.1 kB Preview Download
md5:63550f22135689ce0100cfdeb681eeb1
46.2 MB Preview Download
md5:19246e0b400f6e418b0d9627980b15ae
40.5 MB Preview Download
md5:971cf21e8f25f930e9a6867538803ff8
656.7 kB Preview Download
md5:d488b85ebe15972190663b3aec87c038
20.3 kB Preview Download
md5:adbcb04d590ce991f28d348f73b04575
22.5 kB Preview Download
md5:0789e0e362bc25a44bd546c316480fb1
553.9 kB Preview Download
md5:c50c4e7228ad7e1933570efac46e4790
1.5 MB Preview Download
md5:f067ecded9cba015992ee4ec69f0416b
77.8 MB Preview Download
md5:2c7b26415a642bee818f069a7e8fc0db
83.1 MB Preview Download
md5:e41efd2ecad8c79176611e6253b19e95
18.9 kB Preview Download
md5:dead5b825ac1024f09c7034f86fd3ee2
20.8 kB Preview Download
md5:dd381b46434abcd93ac6dceb8c1db209
1.6 MB Preview Download
md5:14fc41d053a38e6f79df39688315e6d7
77.8 MB Preview Download
md5:73d961396b8ba0a195e88534ad40833b
118.1 MB Preview Download
md5:5d22cf21c1b1ec4b2eb042cd06d7efb1
10.4 kB Preview Download
md5:2280864fb7445f9193f4718b3999c7e6
11.4 kB Preview Download
md5:d28bc714a42fc0c231109352759e0b39
76.2 MB Preview Download
md5:cd19d8a41d66ff556140fece6798c2d3
85.3 MB Preview Download
md5:a9cb4416d6dd69769c85d8a867f2eb63
10.6 kB Preview Download
md5:db24d33044efed8afe38e4fdf1376a86
11.7 kB Preview Download
md5:6f355535afa6f62f650a28983153472c
44.6 MB Preview Download
md5:96e521686c3926364e23ddf6d6d6de6d
5.0 MB Preview Download
md5:a94dbca68998d443bed7021e0e21f5f6
387.9 MB Preview Download
md5:1b0e5c5f1a2b8d84a510691aac35932c
8.6 kB Preview Download
md5:9ace9703bbf791167ebd60fa780ff0b6
9.5 kB Preview Download
md5:3c9a4d0950488a27e34f738772f7f947
14.1 MB Preview Download
md5:1b173ec1764cd09a4d1e1ed4bd8591b4
524.9 kB Preview Download
md5:1cab57c861eec63d69424640e6c3a9ff
44.4 MB Preview Download
md5:f1155d11890debfb20cc91c3eee1a93a
1.1 GB Preview Download
md5:2fd97eafcc0b62fe29267986c2615070
628.9 kB Preview Download
md5:2237528ecc559359cdafcbce6e35a6e5
23.7 kB Preview Download
md5:af96a87304cc1833b3b9a34a8abc736b
26.0 kB Preview Download
md5:1d1c90108d078170e6240d726f4d0924
1.8 GB Download
md5:108006f15b96d5d7617eeba1a746c4d9
3.3 GB Download
md5:fcae3573c1ffd988c39dab11313256c6
268.5 MB Download