Published October 9, 2025 | Version Full text

Enhancing complex XML documents with linguistic annotation

Authors/Creators

  • 1. Charles University, Faculty of Arts

Description

When building corpora from annotated XML documents, the compilers are usually confronted with the incapability of most tools for linguistic analysis and parsing (tokenizers, lemmatizers, PoS-taggers, etc.) to process more than just plain text input. Various single purpose solutions have been created for this purpose. We tried to develop a general set of scripts to assist with the task of enriching documents containing complex XML annotation with linguistic annotation generated by automatic analyzers. We present the challenges we met and solutions we chose, discussing their advantages, disadvantages and limits.

Files

Xml_annotation_LREC2026.pdf

Files (132.0 kB)

Name Size Download all
md5:270a28ede0a54d64c444995af8139926
132.0 kB Preview Download

Additional details

Dates

Submitted
2025-10-03
LREC 2026