Published May 28, 2024 | Version v1
Presentation Open

Experience and Challenges with Named Entities - Workshop at DHBenelux 2024

  • 1. ROR icon Furman University
  • 2. ROR icon KU Leuven
  • 3. ROR icon Leipzig University
  • 4. ROR icon University of Lausanne
  • 5. University of Southern Denmark
  • 6. Università Cattolica del Sacro Cuore

Description

This release contains all the materials for the Workshop Do all roads lead to {Rome: LOC}? Experiences and challenges with named entities in ancient and medieval corpora, presented at DHBenelux 2024 (KU Leuven, June 4). 

Abstract (English)

Motivation

Named Entity Recognition (NER) is a core operation in NLP and one of the fundamental aspects in automatic information retrieval. It is tremendously useful in many areas of literary and historical exegesis, by providing essential contextual information on people, places and other relevant entities, and by allowing large-scale types of analysis, such as social and family networks, spatial footprint of a source, geographical simulations and mapping, patterns of movement and of transmission of ideas. 

Premodern sources, including Classical sources like Ancient Greek literary works, but also early modern itineraries, Medieval sources, and even historical commentaries, all lack an adequate infrastructure for the training of automatic NER models. The recent innovations in the field of transformer-based language models, such as BERT, offer an important opportunity to improve the performance of NER methods in low-resourced languages (Ehrmann et al., 2022), but these models are very data-hungry and require training and evaluation datasets in order to perform optimally. 

Named Entities are a particularly complex domain because they are often difficult to define: their boundaries are not always clear, and there are additional issues with historical uncertainty, OCR and spelling variation noise (Burns 2023), nested entities (see for instance Chastang et al. and 2021 and Torres Aguilar 2022), and metonymic uses (e.g. group names used as proxy for locations). Moreover, in Digital Humanities there is a substantial lack of best practices in the design of Named Entities datasets (for a recent attempt on secondary literature, cf. Romanello and Najem-Meyer 2022): current DH projects tend to adopt internal strategies that create issues with data exchange and introduce noise when it comes to training models. The lack of shared tagsets and guidelines makes the automation of NER tasks even more complicated.

Scholars who want to start annotating a corpus within their research venture might be faced with a lack of guidance on the methodological level and the multiplicity of tools and formats available. With this workshop, we aim at providing researchers with a platform where they can get started with the annotation process, or, in case they are familiar with the task, exchange with other experts on the best practices to make their data as sharable and “reusable” as possible.

Description

This workshop aims at addressing these challenges by bringing together scholars with interest and expertise in premodern Named Entities. We will organize the work around a shared annotation task using INCEpTION (https://inception-project.github.io/), providing a predefined tagset designed for premodern sources. Participants will be able to use their own corpus if they wish, or to choose among a series of texts that will be provided. 

In the second part of the workshop, we will organize a discussion on the application of the tagset and its generalization to different linguistic and textual domains, on issues in the recognition and annotation of Named Entities, their classification, and other common problems like entity boundary, nested entities, and so on. The task will serve as a starting point to discuss current challenges in premodern documents, and to plan for shared best practices. 

The workshop is addressed to scholars who wish to learn how to annotate Named Entities in premodern texts and requires minimal familiarity with existing platforms. Participants will gain an essential overview into the topic of Named Entities and will learn a generalizable annotation workflow with a customizable tagset. Moreover, the workshop will foster collaboration across experts in premodern traditions, who will be able to assess common challenges and ways to address them. 

Structure of the Workshop

Preparation: Before the workshop, the organizers will prepare an INCEpTION annotation environment and upload a predefined corpus of texts to use during the task (this instance of INCEpTION will be hosted and managed by the EPFL - École Polytechnique Fédérale de Lausanne). Given the expertise of the organisers, the prepared corpus will be constituted by Ancient Greek and Latin texts and their English translation. However, the participants will be able to work on their own text. The guidelines and corpus will be shared in advance with the participants, alongside optional preparatory readings. Participants working on their own corpus will be invited to share a .txt version beforehand, so that it can be uploaded in advance with minimal disruption. 

Duration and preliminary program: ca. 3 hours in total (half-day).

  • Introduction and break (1 hour): introductions, brief demo of the INCEpTION environment, illustration of the task and of the corpus. 

  • Annotation task: participants work on their texts, individually or in groups, with the support of the organizers. (1 hour and 15 minutes, with breaks). 

  • Discussion (45 minutes). 

Preferred linguistic coverage of the workshop: we encourage scholars from all areas of ancient and Medieval scholarship, including but not limited to: Latin, Ancient Greek, Classical Arabic and Persian, Medieval Dutch, Medieval Spanish, Medieval French, Old English. Participants may also bring translations of original texts if they prefer. 

Participants are encouraged to prepare their corpus ahead of time in txt format and send it to us so that it can be set on the platform in advance. 

Audience: Scholars with all levels of expertise with respect to Named Entities annotation are welcome: those new to the field will benefit from an introduction to an annotation tool and to general guidelines, and the experts will gain from the discussion of annotation choices with colleagues dealing with similar challenges.

References

Ehrmann, M., Romanello, M., Najem-Meyer, S., Doucet, A., & Clematide, S. (2022). Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents. In A. Barrón-Cedeño et al. (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction (Vol. 13390, pp. 423–446). Springer International Publishing. https://doi.org/10.1007/978-3-031-13643-6_26

Burns, P. J. (2023). Latincy: Synthetic trained pipelines for latin nlp. (arXiv:2305.04365). ArXiv:2305.04365 [cs].

Chastang, P., Torres Aguilar, S. & Tannier, X.. (2021). A Named Entity Recognition Model for Medieval Latin Charters. Digital Humanities Quarterly, 015(4).

Torres Aguilar, S. (2022). Multilingual named entity recognition for medieval charters using stacked embeddings and bert-based models. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages (pp. 119–128). European Language Resources Association.

Romanello, M., & Najem-Meyer, S. (2022). Guidelines for the Annotation of Named Entities in the Domain of Classics. https://doi.org/10.5281/zenodo.6368101. 

Files

DHBenelux 2024 - Named Entity Annotation Guidelines.pdf

Files (1.4 MB)

Name Size Download all
md5:fe2f8548dec13b7f5849e99ec04c2cac
108.7 kB Preview Download
md5:ffe1176d7b10ca0a6ca1a8fd761fac81
903.8 kB Preview Download
md5:b91f94b566128fd27e4b3f6937cf6e05
394.6 kB Preview Download

Additional details

Additional titles

Subtitle (English)
Named Entity Annotation Guidelines and Tutorials

Dates

Accepted
2024-06
Workshop