Published June 2, 2025 | Version v1
Presentation Open

Development and Evaluation of Named Entity Linking Models for Serbian Language with Wikidata Integration

  • 1. ROR icon University of Belgrade
  • 2. University of Belgrade, Faculty of Mining and Geology

Description

TESLA-NER-NEL-gold+  is a dataset designed for training models for Named Entity Recognition (NER) and Linking (NEL). Includes:
srpELTeC-gold+ corpus extendion of srpELTeC-gold corpus, novels from the SrpKor corpus  (Jules Verne, Around the World in 80 Days; George Orwell, 1984)
Sentences from the It-Sr-NER corpus  (The Name of the Rose by Umberto Eco; Those Who Leave and Those Who Stay by Elena Ferrante; One, No One, and One Hundred Thousand by Luigi Pirandello; and works by Ivo Andrić such as Anika’s Times and The Bridge on the Drina, as well as Impure Blood by Borisav Stanković)
Newspaper articles (Politika, Blic, etc.), srELEXSIS (Krstev, et al., 2024) (2024 sentences from a parallel Serbian-English corpus), The Intera corpus (Stanković et al., 2017) of legal documents,...
Named entities that were not recognized did not have entries on Wikidata. Entries created and linked through an automated process using the Leximirka lexicographic knowledge base and Wikidata. The SrpCNNeL model was trained to primarily link locations and organizations to Wikidata entries. Future research will focus on developing new transformer-based models. A BERT model for locations has been developed, while a model for recognizing and linking all classes of named entities is already under development. Annotated parallel corpora will be made available on noske.jerteh.rs, with the Serbian language corpora prepared using the previously described procedure, while English corpora will utilize models available at https://spacy.io/models.

Files

AI_conference_Development and Evaluation of NEL Models for Serbian with Wikidata.pdf