Published February 10, 2021 | Version 1.0.1
Dataset Open

NER4conllu (Named Entites from Ancora Corpus)

Description

Dataset per a Named Entity Recognition del corpus Ancora adaptat al format collu.

This is a dataset for Named Eentity Recognition (NER) from Ancora corpus adapted to the conllu format for Machine Learning purposes.

Since multiwords (including Named Entites) in the original Ancora corpus are aggregated as a single lexical item using underscores (e.g. "Ajuntament_de_Barcelona") we splitted them to align with word-per-line .conllu format, and added conventional Begin-Inside-Outside (IOB) tags to mark and classify Named Entites. We did not filter out the different categories of NEs from Ancora (weak and strong). We did 6 minor edits by hand.

For Licencing reasons, we distribute them in a separate file (with exactly the same length as the original files) to be added to each of the the universal dependencies treebank split files as an eleventh column, so the systems have all rellevant information to learn from, such as POS, Lemma or dependencies labels and relations.

To realign the tags, use the UD_Catalan_Ancora Treebank (version 2.7), and apply, for each of the splits, the following command:

paste ca_ancora-ud-test.conllu ancora_test.NER4conllu > ancora_test.NER4conllu.remapped.conllu

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Files

ner4conllu.zip

Files (90.0 kB)

Name Size Download all
md5:c09a46af0cf8e73d217108319791322a
90.0 kB Preview Download