Published February 5, 2020 | Version v1
Dataset Open

Data-Driven Domain Discovery (D4) - Evaluation Datasets

  • 1. New York University
  • 2. AT&T Labs-Research

Description

Data used for the evaluation of our data-driven domain discovery algorithm (D4). In our evaluation, we used four different datasets from two repositories - NYC Open Data and State of Utah Open data. Both repositories were downloaded using the Socrata Open Data API on Nov. 22nd 2016 and on Sep. 27th 2019, respectively. The downloads contained 1114 datasets and 1953 datasets, respectively.

NYC Open Data contains datasets from NYC agencies such as Department of Education and Department of Finance. Each dataset is labeled using 13 different labels. Based on these labels, we obtained three datasets: (a) Education, (b) Finance (using labels Economy and Finance) and (c) Services which includes all tables that are not in (a) or (b).

The published data has been pre-processed using D4 (as described in the README file). For our evaluation we only considered columns where the majority of distinct terms are text. The published datasets contain the column metadata file (columns.tsv), the index of unique terms across all datasets in the two repositories (term-index.txt.gz), and the set of equivalence classes derived from the term index (compressed-term-index.txt.gz). We also include files containing terms for 25 ground truth domains that we used to evaluate our algorithm.

 

Files

Files (1.5 GB)

Name Size Download all
md5:f3a8bdd9b6e5a580e70e7f42fe9f07ed
412.5 MB Download
md5:62b332f2eeb17da3511b254852ff2a54
1.1 GB Download