CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents
Authors/Creators
Description
CzechTopic is a benchmark dataset of historical Czech documents designed for topic localization and document classification in a zero-shot setting. Each document contains 768–1024 characters and is written in Czech.
The dataset consists of two parts: a development set and a test set. The development set contains 15,245 documents and 19,107 topics. Each topic is annotated in 10 documents. The annotations for the development set were generated using the GPT-5-2 model. The test set contains 525 documents and 364 human-created topics, with each topic annotated in five documents. All annotations are provided as character spans, indicating the exact locations in the text where a topic appears.
We evaluate models at two levels: text level and word level. At the text level, the task is to determine whether a given topic is present in a document. At the word level, the task is to identify which words correspond to a given topic.
Files
CzechTopic.zip
Files
(67.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:05730873183bff7bdeba812543481d48
|
67.5 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Publication: arXiv:2603.03884 (arXiv)
Funding
- Ministry of Culture
- NAKI III project semANT - Semantic Document Exploration DH23P03OVV060
Software
- Repository URL
- https://github.com/dcgm/czechtopic
- Programming language
- Python
- Development Status
- Active