There is a newer version of the record available.

Published May 19, 2021 | Version v2
Dataset Open

Romanian Named Entity Recognition in the Legal domain (LegalNERo)

  • 1. Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy

Description

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. 
It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents.
Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established). 

The LegalNERo corpus is available in different formats: span-based, token-based and RDF. 
The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.

CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html .
Part-of-speech tagging was realized using UDPIPE. 
Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field.
Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last).
Automatic processing was performed through the RELATE platform (https://relate.racai.ro).

ANN files conform to BRAT format (https://brat.nlplab.org/).
 
The archive contains: 

- ann_LEGAL_PER_LOC_ORG_TIME_overlap 
    Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. 
    Overlapping annotations of organizations and time entities inside legal references were allowed. 

- ann_LEGAL_PER_LOC_ORG_TIME 
    Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. 
    Overlapping annotations were not allowed and only the longest named entities were annotated. 

- ann_PER_LOC_ORG_TIME 
    Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. 
    There are no overlapping annotations. 

- conllup_LEGAL_PER_LOC_ORG_TIME 
    Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. 
    Overlapping annotations were not allowed and only the longest named entities were annotated. 
    The annotation of these files was enhanced with GEONAMES codes (where linking was possible).  

- conllup_PER_LOC_ORG_TIME 
    Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. 
    Overlapping annotations were not allowed and only the longest named entities were annotated. 
    The annotation of these files was enhanced with GEONAMES codes (where linking was possible).

- rdf 
    Folder containing the corpus in RDF-Turtle format.
    All the annotations are available here in both span and token format.

- text 
    Folder containing the raw texts.


LICENSING

This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International).
The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ 
and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode . 


CONTACT

Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy
Web: http://www.racai.ro 
Contact emails: vasile@racai.ro , maria@racai.ro

Files

legalnero.zip

Files (21.5 MB)

Name Size Download all
md5:887c3cb461ff60a28b7dff45471c7c74
21.5 MB Preview Download