Published May 21, 2026 | Version v1
Dataset Open

BLM-CausT (Blackbird Language Matrices Causative and Passive Alternation in Turkish)

  • 1. ROR icon Idiap Research Institute
  • 2. ROR icon University of Geneva

Description

Description

BLM-CausT is a dataset in Turkish for learning the causative alternation developed in the Blackbird Language Matrices (BLM) framework. In this task, an instance consists of sequences of sentences with specific attributes. To predict the correct answer as the next element of the sequence, a model must correctly detect the underlying generative rules used to produce the dataset.

The data for Turkish are collected from news and non-fiction sources (Penn v. 2.163; 183,555 tokens, 16,396 trees) and grammar and dictionary examples (Kenet v. 2.164; 178,658 tokens, 18,687 trees). The query collects sentences where the main verb is annotated with the VOICE parameter. 

The data comes grouped by target voice, in two groups SENT (full sentences) and VERB (verb only) and each subset is split into train/test. The statistics of the current iteration of the dataset are (train:test split information):

Akt-SENT 1800:200
Akt-VERB 1800:200
Pass-SENT 1800:200
Pass-VERB 1800:200
CausAkt-SENT 1800:200
CausAkt-VERB 1800:200
CausPass-SENT 1800:200
CausPass-VERB 1800:200



Reference

If you use this dataset, please cite the following publication:

Giuseppe Samo, Paola Merlo, Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew, paper accepted at the SigTurk – SIGTURK 2026 Workshop

Files

Files (2.4 MB)

Name Size Download all
md5:a7e7319ae850dbe0bf2ad7decac4307e
2.4 MB Download

Additional details

Funding

Swiss National Science Foundation
Disentangling linguistic intelligence: automatic generalisation of structure and meaning across languages 209426