ViquiQuAD: an extractive QA dataset from Catalan Wikipedia

Rodriguez-Penagos, Carlos Gerardo; Armentano-Oller, Carme

doi:10.5281/zenodo.4562345

Published February 25, 2021 | Version v1

Dataset Open

ViquiQuAD: an extractive QA dataset from Catalan Wikipedia

1. BSC

Dataset de QA extractiu amb 6282 parells de pregunta-resposta desenvolupats a partir d'articles de la Viquipèdia (https://ca.wikipedia.org) utilitzats sota la llicència Creative Commons Reconeixement i Compartir-Igual.

“ViquiQuAD: an extractive QA dataset from Catalan Wikipedia” es distribueix sota la llicència CC-BY-SA per Carlos Rodríguez y Carme Armentano de la Unitat de Text Mining del BSC - CNS.

From a set of high quality original articles in the Catalan Wikipedia (the Viquipedia), 597 were randomly chosen, and from them 3111, 5 to 8 sentence contexts were extracted. Creation of between 1 and 5 questions for each context was commissioned, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250. In total, 15153 pairs of a question and an context fragment that contains the answer were annotated.

Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

ViquiQuAD.zip

Files (1.5 MB)

Name	Size	Download all
ViquiQuAD.zip md5:99888bd50255a2a5da9788f648d5caed	1.5 MB	Preview Download

	All versions	This version
Views	833	401
Downloads	120	64
Data volume	186.6 MB	100.9 MB

ViquiQuAD: an extractive QA dataset from Catalan Wikipedia

Creators

Description

Files

ViquiQuAD.zip

Files (1.5 MB)