Nerwip Corpus
Description
Description. This corpus contains 408 Wikipedia articles. Those are biographies, manually annotated to highlight entities of the following types: Dates, Locations, Organizations and Persons. It was designed to be used by our tool Nerwip, in order to evaluate and compare existing NER tools on biographic data.
The other files are NER tools-related data (models, dictionaries, etc.), needed by Nerwip to detect entities. If you want to use the tool, you need to unzip these files as explained in the README file associated to Nerwip on GitHub.
It was constituted by Burcu Küpelioğlu during her end of study project, and then cleaned and corrected by Samet Atdağ during his MSc, to get a total of 250 articles (v3). Vincent Labatut then completed it further, to reach 408 articles (v4).
Source code. The source code of our tool Nerwip is available online: https://github.com/CompNet/nerwip
License. The dataset is shared under a Creative Commons 0 license.
Citation. If you use this corpus, please cite the following article:
- A Comparison of Named Entity Recognition Tools Applied to Biographical Texts, S. Atdağ & V. Labatut, 2013. ⟨hal-00849797⟩ - DOI: 10.1109/IcConSCS.2013.6632052
@InProceedings{Atdag2013,
author = {Atdağ, Samet and Labatut, Vincent},
title = {A Comparison of Named Entity Recognition Tools Applied to Biographical Texts},
booktitle = {2\textsuperscript{nd} International Conference on Systems and Computer Science},
year = {2013},
pages = {228-233},
address = {Lille, FR},
publisher = {IEEE Publishing},
doi = {10.1109/IcConSCS.2013.6632052},
}
Files
nerwip-3-data.zip
Additional details
Related works
- Is documented by
- Conference paper: 10.1109/IcConSCS.2013.6632052 (DOI)
- Is required by
- Software: https://github.com/CompNet/nerwip (URL)
- Obsoletes
- Dataset: 10.6084/m9.figshare.1289791 (DOI)