Published March 10, 2025 | Version v1
Dataset Open

Datasets for "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships"

Description

ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships

These are the datasets for the paper ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships.

Dataset dictionary

This repository contains the splits that resulted from the research project "ESNLIR: A Spanish Multi-Genre Dataset with Causal Relationships". All the splits are in JSONL format and have the same fields per example:

  • sentence_1: First sentence of the pair.
  • sentence_2: Second sentence of the pair.
  • connector: Linking phrase used to extract pair.
  • connector_type: NLI label, between "contrasting", "entailment", "reasoning" or "neutral"
  • extraction_strategy: "linking_phrase" for "contrasting", "entailment", "reasoning" and "none" for neutral.
  • distance: How many sentences before the connector is the sentence_1
  • sentence_1_position: Number of sentence for sentence_1 in the source document
  • sentence_1_paragraph: Number of paragraph for sentence_1 in the source document
  • sentence_2_position: Number of sentence for sentence_2 in the source document
  • sentence_2_paragraph: Number of paragraph for sentence_2 in the source document
  • id: Unique identifier for the example
  • dataset: Source corpus of the pair. Metadata of corpus, including source can be found in dataset_metadata.xlsx.
  • genre: Writing genre of the dataset.
  • domain: Domain genre of the dataset. 

Example: 

{"sentence_1":"sefior Bcajavides no es moderado, tampoco lo convertirse e\u00f1 declarada divergencia de miras polileido en griego","sentence_2":"era mayor claricomentarios, as\u00ed de los peri\u00f3dicos como de los homes dado \u00e1 la voluntad de los hombres, sin que sobreticas","connector":"por consiguiente,","connector_type":"reasoning","extraction_strategy":"linking_phrase","distance":1.0,"sentence_1_paragraph":4,"sentence_1_position":86,"sentence_2_paragraph":4,"sentence_2_position":87,"id":"esnews__spanish_pd_news__531537","dataset":"esnews__spanish_pd_news","genre":"news","domain":"spanish_public_domain_news"}

Dataset files

  • ESNLIR_datasets.zip: Contains the splits used for BERT-based model training, validation and testing, including stress test splits.
  • labeled_final_dataset.jsonl: Is the final test dataset with 974 examples selected by human majority label matching the original linking phrase label.

Files

ESNLIR_datasets.zip

Files (651.2 MB)

Name Size Download all
md5:ef9498315a118cbc1ddf3b856b45e198
23.4 kB Download
md5:133f30b8abef3974fdeafbff48c0e6e0
650.4 MB Preview Download
md5:d179e75d0dda23ebd68b905df5072c8c
781.3 kB Download