Published February 25, 2021 | Version v2
Dataset Open

VilaQuAD: an extractive QA dataset from Catalan newswire

Description

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
    title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
    author = "Armengol-Estap{\'e}, Jordi  and
      Carrino, Casimiro Pio  and
      Rodriguez-Penagos, Carlos  and
      de Gibert Bonet, Ona  and
      Armentano-Oller, Carme  and
      Gonzalez-Agirre, Aitor  and
      Melero, Maite  and
      Villegas, Marta",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-acl.437",
    doi = "10.18653/v1/2021.findings-acl.437",
    pages = "4933--4946",
}

Dataset de QA extractiu amb 6282 parells de pregunta-resposta desenvolupats a partir de paràgrafs del diari en línia Vilaweb (https://www.vilaweb.cat) usats sota llicència CC-BY-NC-ND 4.0.

This dataset contains 2095 of Catalan language news articles along with 1 to 5 questions referring to each fragment (or context).
VilaQuad articles are extracted from the daily Vilaweb (www.vilaweb.cat) and used under CC-by-nc-sa-nd (https://creativecommons.org/licenses/by-nc-nd/3.0/deed.ca) licence.
This dataset can be used to build extractive-QA and Language Models.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

VilaQuAD.zip

Files (1.2 MB)

Name Size Download all
md5:8b873ffe07211bca5a4bf75f32f731a8
1.2 MB Preview Download