Published April 4, 2018 | Version 1.0
Dataset Open

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction - dataset

Description

Brief description

The zip file contains two folders. The "websites" folder includes crawled web pages from real websites, like a agatameble.pl (an e-shop website), filmweb.pl (a website about films), and ptaki.info (a website about birds). The "reference-seeds" folder contains three subfolders, i.e. agatameble.pl, filmweb.pl, and ptaki.info. Each subfolder contains reference-seeds.csv file. The file contains data, i.e. reference instances - carefully labelled ground-truth of corresponding values in each web page of given websites mentioned above.

Reference

I would appreciate it if you cite the following paper when using the dataset:

Marcin Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction, Knowledge and Information Systems, Volume 54, Issue 3, p. 711–776, 2018, (pdf Open Access – http://rdcu.be/u88F lub DOI http://dx.doi.org/10.1007/s10115-017-1097-2)

Files

bigrams-reference-data-sets.zip

Files (6.9 MB)

Name Size Download all
md5:343265823f08d21ac49d52c9fe2b07d2
6.9 MB Preview Download

Additional details

Related works

Is supplement to
10.1007/s10115-017-1097-2 (DOI)