Published May 4, 2023 | Version v1
Dataset Open

Schema.org mark-up data for named entities

Authors/Creators

Description

This dataset contains two files: original_data.zip, and website_5folds.zip

original_data.zip will unpack into three .csv files, Place.csv, CreativeWork.csv, and LocalBusiness.csv. Each file contains one entity on each row, and this entity belongs to a subclass of the class indicated by the file name. There are 8 columns:

  • the first 2 columns are simply the index of the row
  • description_t: the long textual description of the entity
  • schemaorg_class: the schema.org class assigned to the entity
  • name_tpage_domain: always empty
  • name_t: the name of the entity
  • page_domain: the website where the entity mark-up data is found
  • label: an index for the schemaorg_class
  • description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)

website_5folds.zip is a transformation of the original_data.zip. It unzips into three folders, Place, LocalBusiness, and CreativeWork. Inside each folder, there are five folders: 0, 1, 2, 3 and 4 indicating five folds. Inside each of the numbered sub-folder there is a train.csv and test.csv file. Then each csv file contains one entity on each row, with the following columns:

  • the first column is simply the index of the row
  • schemaorg_class: the schema.org class assigned to the entity
  • name_t: the name of the entity
  • description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)
  • page_domain: the name of the entity plus the processed domain name. The process includes parsing the domain URL, extract the host name, applying word segmentation (tescobank -> tesco bank), and removing stopwords and TLDs (co, uk, com, fr)

As mentioned, website_5folds.zip is a transformation of the original_data.zip and in fact contains multiple replications of original_data.zip. It is created for 5 fold validation experiment while ensuring that there are no overlap in the page_domain of entities in training and test sets. 

Files

original_data.zip

Files (497.1 MB)

Name Size Download all
md5:21686e3f29da8ac42c757e437cdda2e2
169.0 MB Preview Download
md5:8b0ed64477b1cb6a5c95dd3448eee8d0
328.1 MB Preview Download