Published March 4, 2026 | Version v1

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

  • 1. ROR icon Brno University of Technology

Description

CzechTopic is a benchmark dataset of historical Czech documents designed for topic localization and document classification in a zero-shot setting. Each document contains 768–1024 characters and is written in Czech.

The dataset consists of two parts: a development set and a test set. The development set contains 15,245 documents and 19,107 topics. Each topic is annotated in 10 documents. The annotations for the development set were generated using the GPT-5-2 model. The test set contains 525 documents and 364 human-created topics, with each topic annotated in five documents. All annotations are provided as character spans, indicating the exact locations in the text where a topic appears.

We evaluate models at two levels: text level and word level. At the text level, the task is to determine whether a given topic is present in a document. At the word level, the task is to identify which words correspond to a given topic.

Files

CzechTopic.zip

Files (67.5 MB)

Name Size Download all
md5:05730873183bff7bdeba812543481d48
67.5 MB Preview Download

Additional details

Related works

Is supplement to
Publication: arXiv:2603.03884 (arXiv)

Funding

Ministry of Culture
NAKI III project semANT - Semantic Document Exploration DH23P03OVV060

Software

Repository URL
https://github.com/dcgm/czechtopic
Programming language
Python
Development Status
Active