Published May 20, 2019 | Version v1
Conference paper Open

Automatic Detection of Language and Annotation Model Information in CoNLL Corpora

  • 1. Goethe University Frankfurt

Description

We introduce AnnoHub, an on-going effort to automatically complement existing language resources with metadata about the languages they cover and the annotation schemes (tagsets) that they apply, to provide a web interface for their curation and evaluation by means of domain experts, and to publish them as a RDF dataset and as part of the (Linguistic) Linked Open Data (LLOD) cloud. In this paper, we focus on tabular formats with tab-separated values (TSV), a de-facto standard for annotated corpora as popularized as part of the CoNLL Shared Tasks. By extension, other formats for which a converter to CoNLL and/or TSV formats does exist, can be processed analoguously. We describe our implementation and its evaluation against a sample of 93 corpora from the Universal Dependencies, v.2.3.

Notes

The research described in this paper was conducted in the context of the Specialized Information Service Linguistics, funded by German Research Foundation (DFG/LIS, 2017-2019). The contributions of the second author were conducted with additional support from the Horizon 2020 Research and Innovation Action "Pret-a-LLOD. Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors" (H2020-ICT-2018-2, 2019-2021).

Files

OASIcs-LDK-2019-23.pdf

Files (489.8 kB)

Name Size Download all
md5:3e58201998d64c734c27e817bd782f18
489.8 kB Preview Download

Additional details

Funding

European Commission
Pret-a-LLOD - Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors 825182