Schema.org mark-up data for named entities
Authors/Creators
Description
This dataset contains two files: original_data.zip, and website_5folds.zip
original_data.zip will unpack into three .csv files, Place.csv, CreativeWork.csv, and LocalBusiness.csv. Each file contains one entity on each row, and this entity belongs to a subclass of the class indicated by the file name. There are 8 columns:
- the first 2 columns are simply the index of the row
- description_t: the long textual description of the entity
- schemaorg_class: the schema.org class assigned to the entity
- name_tpage_domain: always empty
- name_t: the name of the entity
- page_domain: the website where the entity mark-up data is found
- label: an index for the schemaorg_class
- description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)
website_5folds.zip is a transformation of the original_data.zip. It unzips into three folders, Place, LocalBusiness, and CreativeWork. Inside each folder, there are five folders: 0, 1, 2, 3 and 4 indicating five folds. Inside each of the numbered sub-folder there is a train.csv and test.csv file. Then each csv file contains one entity on each row, with the following columns:
- the first column is simply the index of the row
- schemaorg_class: the schema.org class assigned to the entity
- name_t: the name of the entity
- description: this is the name of the entity (name_t) plus the first sentence of its description (from description_t)
- page_domain: the name of the entity plus the processed domain name. The process includes parsing the domain URL, extract the host name, applying word segmentation (tescobank -> tesco bank), and removing stopwords and TLDs (co, uk, com, fr)
As mentioned, website_5folds.zip is a transformation of the original_data.zip and in fact contains multiple replications of original_data.zip. It is created for 5 fold validation experiment while ensuring that there are no overlap in the page_domain of entities in training and test sets.
Files
original_data.zip
Files
(497.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:21686e3f29da8ac42c757e437cdda2e2
|
169.0 MB | Preview Download |
|
md5:8b0ed64477b1cb6a5c95dd3448eee8d0
|
328.1 MB | Preview Download |