Published October 30, 2024 | Version v1
Dataset Open

BLM-AgrI (Blackbird Language Matrices Subject-Verb agreement in Italian)

  • 1. ROR icon Idiap Research Institute
  • 2. ROR icon University of Geneva

Description

Description

BLM-AgrI is a dataset in Italian for learning the underlying rules of subject-verb agreement in sentences, developed in the Blackbird Language Matrices (BLM) framework. In this task, an instance consists of sequences of sentences with specific attributes. To predict the correct answer as the next element of the sequence, a model must correctly detect the generative rules used to produce the dataset. BLM-AgrI is the Italian version of BLM-AgrF (but not an exact translation).

Blackbird Language Matrices (BLMs) are multiple-choice problems, where the input is a sequence of sentences built using specific generating rules, and the answer set consists of a correct answer that continues the input sequence, and several incorrect contrastive options, built by violating the underlying generating rules of the sentences. In a BLM matrix, all sentences share the targeted linguistic phenomenon (in this case subject-verb agreement), but differ in other aspects relevant for the phenomenon in question.   

BLM datasets also have a lexical variation dimension, to explore the impact of lexical variation on detecting relevant structures: type I – minimal lexical variation for sentences within an instance, type II – one word difference across the sentences within an instance, type III – maximal lexical variation within an instance.

The data comes grouped by lexical variation (i.e. type I/II/III) and each subset is split into train/test. The statistics of the current iteration of the dataset (v2.0) are (train:test split information):

type I 2052:230 
type II  5000:4121
type III  5000:4121

 

Disclaimer

2026-01-08: This data has been regenerated to fix an error in the type III data, where for some examples the templates did not match their corresponding sentences.

 

Reference

If you use this dataset,please cite the following publication:

Nastase, Vivi & Jiang, Chunyang & Samo, Giuseppe & Merlo, Paola. (2024). Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement. DOI: 10.48550/arXiv.2409.06567.

Files

Files (18.3 MB)

Name Size Download all
md5:b4cf374fe6b260eef75cb7e27cbbd5af
18.3 MB Download

Additional details

Related works

Is described by
Conference paper: 10.48550/arXiv.2409.06567 (DOI)

Funding

Swiss National Science Foundation
Disentangling linguistic intelligence: automatic generalisation of structure and meaning across languages TMAG-1_209426