The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction - dataset
Creators
Description
Brief description
The zip file contains two folders. The "websites" folder includes crawled web pages from real websites, like a agatameble.pl (an e-shop website), filmweb.pl (a website about films), and ptaki.info (a website about birds). The "reference-seeds" folder contains three subfolders, i.e. agatameble.pl, filmweb.pl, and ptaki.info. Each subfolder contains reference-seeds.csv file. The file contains data, i.e. reference instances - carefully labelled ground-truth of corresponding values in each web page of given websites mentioned above.
Reference
I would appreciate it if you cite the following paper when using the dataset:
Marcin Mirończuk The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction, Knowledge and Information Systems, Volume 54, Issue 3, p. 711–776, 2018, (pdf Open Access – http://rdcu.be/u88F lub DOI http://dx.doi.org/10.1007/s10115-017-1097-2)
Files
bigrams-reference-data-sets.zip
Files
(6.9 MB)
Name | Size | Download all |
---|---|---|
md5:343265823f08d21ac49d52c9fe2b07d2
|
6.9 MB | Preview Download |
Additional details
Related works
- Is supplement to
- 10.1007/s10115-017-1097-2 (DOI)