Published March 10, 2021 | Version 1.0.2
Dataset Open

TECA: Textual Entailment Catalan dataset

Description

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

TECA són dos subsets de TE en Català, catalan_TE1 i vilaweb_TE, que contenen 14997 i 6166 parells de premisses i hipòtesis, anotades segons la relació d'inferència que tenen (implicació, contradicció o neutra).

TECa contains two Catalan TE sub-datasets, catalan_TE1 and vilaweb_TE, containing 14997 and 6166 annotated pairs of sentences.

"Textual entailment (TE) in natural language processing is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively." From Wikpedia.

In TECa datasets, each sentence has three hypotheses, annotated as follows:

* "0": positive TE (Inference, text entails hypothesis)

* "1": non-TE (Neutral, text does not entail nor contradict)

* "2": negative TE (Contradiction, text contradicts hypothesis).

Source sentences are extracted from the Catalan Textual Corpus (https://doi.org/10.5281/zenodo.4519349), and from Vilaweb newswire.

Both sub-datasets are released under CC-by-4.0 licence.

Files

TECA_v.1.0.2.zip

Files (1.0 MB)

Name Size Download all
md5:b6fa4a1e5443868f4e58918460a76883
1.0 MB Preview Download