BLM-AgrF (Blackbird Language Matrices Subject-Verb agreement in French)

Nastase, Vivi; Merlo, Paola

doi:10.34777/8r5s-m125

Published October 30, 2024 | Version v1

Dataset Open

BLM-AgrF (Blackbird Language Matrices Subject-Verb agreement in French)

1. Idiap Research Institute
2. University of Geneva

Description

BLM-AgrF is a dataset in French for learning the underlying rules of subject-verb agreement in sentences, developed in the Blackbird Language Matrices (BLM) framework. In this task, an instance consists of sequences of sentences with specific attributes. To predict the correct answer as the next element of the sequence, a model must correctly detect the generative rules used to produce the dataset.

Blackbird Language Matrices (BLMs) are multiple-choice problems, where the input is a sequence of sentences built using specific generating rules, and the answer set consists of a correct answer that continues the input sequence, and several incorrect contrastive options, built by violating the underlying generating rules of the sentences. In a BLM matrix, all sentences share the targeted linguistic phenomenon (in this case subject-verb agreement), but differ in other aspects relevant for the phenomenon in question.

BLM datasets also have a lexical variation dimension, to explore the impact of lexical variation on detecting relevant structures: type I – minimal lexical variation for sentences within an instance, type II – one word difference across the sentences within an instance, type III – maximal lexical variation within an instance.

The data comes grouped by lexical variation (i.e. type I/II/III) and each subset is split into train/test. The statistics of the current iteration of the dataset (v2.0) are (train:test split information):

type I	2052:252
type II	5000:4927
type III	5000:4810

This dataset is built based on a previous version of the dataset (with a different answer set and different type II and type III), described in

Aixiu An, Chunyang Jiang, Maria A Rodriguez, Vivi Nastase, Paola Merlo
BLM-AgrF: A new French benchmark to investigate generalization of agreement in neural networks, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, (EACL 2023), pages 1363-1374, 2023
https://aclanthology.org/2023.eacl-main.99.pdf

Reference

If you use this dataset,please cite the following publication:

Nastase, Vivi & Merlo, Paola. (2024). Are there identifiable structural parts in the sentence embedding whole? DOI: 10.48550/arXiv.2409.16563.

Files

Files (27.0 MB)

Name	Size	Download all
BLM-AgrF.tar.gz md5:178e1208498dfe73a880332a7921d8ff	27.0 MB	Download

Additional details

Is described by: Conference paper: 10.48550/arXiv.2409.16563 (DOI)

Disentangling linguistic intelligence: automatic generalisation of structure and meaning across languages TMAG-1_209426: Swiss National Science Foundation

	All versions	This version
Views	20	20
Downloads	3	3
Data volume	81.1 MB	81.1 MB

BLM-AgrF (Blackbird Language Matrices Subject-Verb agreement in French)

Files

Files (27.0 MB)

Additional details

Related works

Funding

BLM-AgrF (Blackbird Language Matrices Subject-Verb agreement in French)

Creators

Description

Files

Files (27.0 MB)

Additional details

Related works

Funding