Published March 4, 2015 | Version v3.0.0
Dataset Open

Nerwip Corpus

  • 1. Galatasaray University

Description

Description. This corpus contains 408 Wikipedia articles. Those are biographies, manually annotated to highlight entities of the following types: Dates, Locations, Organizations and Persons. It was designed to be used by our tool Nerwip, in order to evaluate and compare existing NER tools on biographic data.

The other files are NER tools-related data (models, dictionaries, etc.), needed by Nerwip to detect entities. If you want to use the tool, you need to unzip these files as explained in the README file associated to Nerwip on GitHub.

It was constituted by Burcu Küpelioğlu during her end of study project, and then cleaned and corrected by Samet Atdağ during his MSc, to get a total of 250 articles (v3). Vincent Labatut then completed it further, to reach 408 articles (v4).

Source code. The source code of our tool Nerwip is available online: https://github.com/CompNet/nerwip

License. The dataset is shared under a Creative Commons 0 license.

Citation. If you use this corpus, please cite the following article:


@InProceedings{Atdag2013,
  author    = {Atdağ, Samet and Labatut, Vincent},
  title     = {A Comparison of Named Entity Recognition Tools Applied to Biographical Texts},
  booktitle = {2\textsuperscript{nd} International Conference on Systems and Computer Science},
  year      = {2013},
  pages     = {228-233},
  address   = {Lille, FR},
  publisher = {IEEE Publishing},
  doi       = {10.1109/IcConSCS.2013.6632052},
}

Files

nerwip-3-data.zip

Files (726.1 MB)

Name Size Download all
md5:308b47c2270d44c2d11b301ba7a5a64f
2.2 MB Preview Download
md5:6fcd034fe7dc173fe8b3280f3e0c8b2b
8.5 MB Preview Download
md5:6f065096a9de8951745aa0c99bb51381
715.4 MB Preview Download

Additional details

Related works

Is documented by
Conference paper: 10.1109/IcConSCS.2013.6632052 (DOI)
Is required by
Software: https://github.com/CompNet/nerwip (URL)
Obsoletes
Dataset: 10.6084/m9.figshare.1289791 (DOI)