There is a newer version of the record available.

Published October 8, 2025 | Version v1

Enhancing complex XML documents with linguistic annotation

Authors/Creators

  • 1. Charles University, Faculty of Arts

Description

When building corpora from annotated XML documents, the compilers are usually confronted with the incapability of most tools for linguistic analysis and parsing (tokenizers, lemmatizers, PoS-taggers, etc.) to process more than just plain text input. Various single purpose solutions have been created for this purpose. We tried to develop a general set of scripts to assist with the task of enriching documents containing complex XML annotation with linguistic annotation generated by automatic analyzers. We present the challenges we met and solutions we chose, discussing their advantages, disadvantages and limits.

Files

Xml_annotation_LREC2026.pdf

Files (130.6 kB)

Name Size Download all
md5:04f49976ace9df961941eb9f1eb61a2b
130.6 kB Preview Download

Additional details

Dates

Submitted
2025-10-03
LREC 2026