Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation"

Rodriguez Portela, Johan David; Manrique Piramanrique, Rubén Francisco; Perez Terán, Nicolás

doi:10.5281/zenodo.15002575

Published March 10, 2025 | Version v1

Dataset Open

Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation"

ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

This is the complete code, model and datasets for the preprint ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships. THE FINAL PAPER WILL in a ICAI by Springer CCIS volume of 2025.

Installation

This repository is a poetry project, which means that it can be installed easily by executing the following command from a shell in the repository folder:

poetry install

As this repository is script based, the README.md file contains all the commands executed to generate the dataset and train models.

----------------------------------------------------------------------------------------------

Core code

The core code used for all the experiments is in the folder auto-nli and all the calls to the core code with the parameters requested are found in README.md

----------------------------------------------------------------------------------------------

Parameters

All the parameters to create datasets and train models with the core code are found in the folder parameters.

----------------------------------------------------------------------------------------------

Models

Model types

For BERT based models, all in pytorch, there are two types of models from huggingfaces that were used for training and also are required to load a dataset because of the tokenizer:

RoBERTa (BERTIN): https://huggingface.co/bertin-project/bertin-roberta-base-spanish
XLMRoBERTa: https://huggingface.co/FacebookAI/xlm-roberta-base

Model folder

The model folder contains all the trained models for the paper. There are three types of models:

baseline: An XGBoost model that can be loaded with pickle.
roberta: BERTIN based models in pytorch. You can load them with the model_path <project_path>/model/roberta/model
xlmroberta: XLMRoBERTa based models in pytorch. You can load them with the model_path <project_path>/model/xlmroberta/model

Models with the suffix _annot are models trained with the premise (first sentence) only. Apart from the pytorch model folder, each model result folder (ex: <project_path>/model/xlmroberta/) contains the test results for the test set and the stress test sets (ex: <project_path>/model/xlmroberta/test)

Load model

Models are found in the folder model and all of them are pytorch models which can be loaded with the huggingface interface:

from transformers import AutoModel

model = AutoModel.from_pretrained('<model_path>',local_files_only=True)

----------------------------------------------------------------------------------------------

Dataset

labeled_final_dataset.jsonl

This file is included outside the ZIP containing all other files, and it contains the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Other datasets:

The datasets can be found in the folder data that is divided in the following folders:

base_dataset

The splits to train, validate and test the models.

splits_data

Splits of train-val-test extracted for each corpora. They are used to generate base_dataset.

sentence_data

Pairs of sentences found in each corpus. They are used to generate splits_data.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

sentence_1: First sentence of the pair.
sentence_2: Second sentence of the pair.
connector: Linking phrase used to extract pair.
connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
distance: How many sentences before the connector is the sentence_1
sentence_1_position: Number of sentence for sentence_1 in the source document
sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
sentence_2_position: Number of sentence for sentence_2 in the source document
sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
id: Unique identifier for the example
dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
genre: Writing genre of the dataset.
domain: Domain genre of the dataset.

Example:

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews__spanish_pd_news__531537","dataset":"esnews__spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset load

To load a dataset/split as a pytorch object used to train-validate-test models you must use the custom class dataset

from auto_nli.model.bert_based.dataset import BERTDataset

dataset = BERTDataset(

os.path.join(dataset_folder, <Path to jsonl>),

max_len=<max length of sentences>,

model_type=

<type of model to use for tokenizer>,

only_premise=<True to load a dataset with only the first sentences>,

max_samples=<Maximum number of examples in the dataset>)

----------------------------------------------------------------------------------------------

Notebooks

The folder notebooks contains a collection of jupyter notebooks used to preprocess datasets and visualize results.

Files

ESNLIR_complete.zip

Files (15.2 GB)

Name	Size	Download all
dataset_metadata.xlsx md5:ef9498315a118cbc1ddf3b856b45e198	23.4 kB	Download
ESNLIR_complete.zip md5:25fa34391abac96d6af32d5a4bbcb230	15.2 GB	Preview Download
labeled_final_dataset.jsonl md5:d179e75d0dda23ebd68b905df5072c8c	781.3 kB	Download
Preprint_ESNLIR___Arxiv.pdf md5:21e290cc14cefc4d098cb2a110481df8	472.9 kB	Preview Download

Additional details

Programming language: Python

	All versions	This version
Views	122	122
Downloads	172	172
Data volume	701.1 GB	701.1 GB

Complete code and datasets for "ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation"

Creators

Description

ESNLIR: Expanding Spanish NLI Benchmarks with Multi-Genre and Causal Annotation

Installation

Core code

Parameters

Models

Model types

Model folder

Load model

Dataset

labeled_final_dataset.jsonl

Other datasets:

base_dataset

splits_data

sentence_data

Dataset dictionary

Dataset load

Notebooks

Files

ESNLIR_complete.zip

Files (15.2 GB)

Additional details

Software