CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Kostelník, Martin; Hradiš, Michal; Dočekal, Martin

doi:10.5281/zenodo.18877204

Published March 4, 2026 | Version v1

Dataset Open

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

1. Brno University of Technology

CzechTopic is a benchmark dataset of historical Czech documents designed for topic localization and document classification in a zero-shot setting. Each document contains 768–1024 characters and is written in Czech.

The dataset consists of two parts: a development set and a test set. The development set contains 15,245 documents and 19,107 topics. Each topic is annotated in 10 documents. The annotations for the development set were generated using the GPT-5-2 model. The test set contains 525 documents and 364 human-created topics, with each topic annotated in five documents. All annotations are provided as character spans, indicating the exact locations in the text where a topic appears.

We evaluate models at two levels: text level and word level. At the text level, the task is to determine whether a given topic is present in a document. At the word level, the task is to identify which words correspond to a given topic.

Files

CzechTopic.zip

Files (67.5 MB)

Name	Size	Download all
CzechTopic.zip md5:05730873183bff7bdeba812543481d48	67.5 MB	Preview Download

Additional details

Is supplement to: Publication: arXiv:2603.03884 (arXiv)

Ministry of Culture
NAKI III project semANT - Semantic Document Exploration DH23P03OVV060

Repository URL: https://github.com/dcgm/czechtopic
Programming language: Python
Development Status: Active

	All versions	This version
Views	37	37
Downloads	6	6
Data volume	405.2 MB	405.2 MB

CzechTopic.zip

Files (67.5 MB)

Related works

Funding

Software

CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Authors/Creators

Description

Files

CzechTopic.zip

Files (67.5 MB)

Additional details

Related works

Funding

Software