Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published February 25, 2021 | Version v1
Dataset Open

ViquiQuAD: an extractive QA dataset from Catalan Wikipedia

Description

Dataset de QA extractiu amb 6282 parells de pregunta-resposta desenvolupats a partir d'articles de la Viquipèdia (https://ca.wikipedia.org) utilitzats sota la llicència Creative Commons Reconeixement i Compartir-Igual.

“ViquiQuAD: an extractive QA dataset from Catalan Wikipedia” es distribueix sota la llicència CC-BY-SA per Carlos Rodríguez y Carme Armentano de la Unitat de Text Mining del BSC - CNS.

From a set of high quality original articles in the Catalan Wikipedia (the Viquipedia), 597 were randomly chosen, and from them 3111, 5 to 8 sentence contexts were extracted. Creation of between 1 and 5 questions for each context was commissioned, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250. In total, 15153 pairs of a question and an context fragment that contains the answer were annotated.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

ViquiQuAD.zip

Files (1.5 MB)

Name Size Download all
md5:99888bd50255a2a5da9788f648d5caed
1.5 MB Preview Download