Published June 3, 2024 | Version v1
Presentation Open

Multilingual Automated Subject Indexing: a comparative study of LLMs vs alternative approaches in the context of the EHRI project

  • 1. King's College London
  • 2. ROR icon NIOD Institute for War, Holocaust and Genocide Studies
  • 3. ROR icon Vienna Wiesenthal Institute for Holocaust Studies
  • 4. Kazerne Dossin Memoriaal Museum en Documentatiecentrum over Holocaust en Mensenrechten

Description

The European Holocaust Research Infrastructure (EHRI) facilitates transnational Holocaust research by making information about dispersed archival material accessible through the EHRI Portal. An important aspect of effectively integrating this information is the indexing of collection metadata from institutions worldwide based on a common domain-specific controlled vocabulary (EHRI Terms). However, challenges persist in harmonising the use of subject headings across institutions. Hence, this paper explores approaches for Automated Subject Indexing (ASI), including statistical, lexical, and fusion approaches, as well as the use of Large Language Models (LLMs) for zero-shot classification. We describe how we use the metadata of EHRI-Portal-ingested archival descriptions as training data for Machine Learning (ML) algorithms, including for fine-tuning an LLM. We evaluate LLM- and non-LLM-based tools quantitatively and qualitatively, with our focus being to explore if and to what extent LLMs can play a role in developing reliable ASI tools. We conclude that although some tools under evaluation achieve high scores, presently, these tools can only be considered as part of semi-ASI workflows with human oversight. However, this study charts a promising course towards developing tools that are reliable enough to facilitate subject indexing processes and enhance metadata according to FAIR Data Principles.

Files

DHB24_abstract_Dermentzi-et-al_Multilingual-Automated-Subject-Indexing.pdf

Additional details

Funding

European Commission
EHRI-3 – European Holocaust Research Infrastructure 871111