NER4conllu (Named Entites from Ancora Corpus)
Description
Dataset per a Named Entity Recognition del corpus Ancora adaptat al format collu.
This is a dataset for Named Eentity Recognition (NER) from Ancora corpus adapted to the conllu format for Machine Learning purposes.
Since multiwords (including Named Entites) in the original Ancora corpus are aggregated as a single lexical item using underscores (e.g. "Ajuntament_de_Barcelona") we splitted them to align with word-per-line .conllu format, and added conventional Begin-Inside-Outside (IOB) tags to mark and classify Named Entites. We did not filter out the different categories of NEs from Ancora (weak and strong). We did 6 minor edits by hand.
For Licencing reasons, we distribute them in a separate file (with exactly the same length as the original files) to be added to each of the the universal dependencies treebank split files as an eleventh column, so the systems have all rellevant information to learn from, such as POS, Lemma or dependencies labels and relations.
To realign the tags, use the UD_Catalan_Ancora Treebank (version 2.7), and apply, for each of the splits, the following command:
paste ca_ancora-ud-test.conllu ancora_test.NER4conllu > ancora_test.NER4conllu.remapped.conllu
If you use this resource in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}
Files
ner4conllu.zip
Files
(90.0 kB)
Name | Size | Download all |
---|---|---|
md5:c09a46af0cf8e73d217108319791322a
|
90.0 kB | Preview Download |