Multilingual Automated Subject Indexing: a comparative study of LLMs vs alternative approaches in the context of the EHRI project
Creators
Description
The European Holocaust Research Infrastructure (EHRI) facilitates transnational Holocaust research by making information about dispersed archival material accessible through the EHRI Portal. An important aspect of effectively integrating this information is the indexing of collection metadata from institutions worldwide based on a common domain-specific controlled vocabulary (EHRI Terms). However, challenges persist in harmonising the use of subject headings across institutions. Hence, this paper explores approaches for Automated Subject Indexing (ASI), including statistical, lexical, and fusion approaches, as well as the use of Large Language Models (LLMs) for zero-shot classification. We describe how we use the metadata of EHRI-Portal-ingested archival descriptions as training data for Machine Learning (ML) algorithms, including for fine-tuning an LLM. We evaluate LLM- and non-LLM-based tools quantitatively and qualitatively, with our focus being to explore if and to what extent LLMs can play a role in developing reliable ASI tools. We conclude that although some tools under evaluation achieve high scores, presently, these tools can only be considered as part of semi-ASI workflows with human oversight. However, this study charts a promising course towards developing tools that are reliable enough to facilitate subject indexing processes and enhance metadata according to FAIR Data Principles.
Files
DHB24_abstract_Dermentzi-et-al_Multilingual-Automated-Subject-Indexing.pdf
Files
(3.6 MB)
Name | Size | Download all |
---|---|---|
md5:f9334d33f39d9bf591d3ea31e0c3f797
|
245.9 kB | Preview Download |
md5:8ca762e0a22403e639d7c871e84c0fa1
|
3.3 MB | Preview Download |