ViquiQuAD: an extractive QA dataset from Catalan Wikipedia
Description
Dataset de QA extractiu amb 6282 parells de pregunta-resposta desenvolupats a partir d'articles de la Viquipèdia (https://ca.wikipedia.org) utilitzats sota la llicència Creative Commons Reconeixement i Compartir-Igual.
“ViquiQuAD: an extractive QA dataset from Catalan Wikipedia” es distribueix sota la llicència CC-BY-SA per Carlos Rodríguez y Carme Armentano de la Unitat de Text Mining del BSC - CNS.
From a set of high quality original articles in the Catalan Wikipedia (the Viquipedia), 597 were randomly chosen, and from them 3111, 5 to 8 sentence contexts were extracted. Creation of between 1 and 5 questions for each context was commissioned, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250. In total, 15153 pairs of a question and an context fragment that contains the answer were annotated.
Funded by the Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA), MT4ALL and Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
Files
ViquiQuAD.zip
Files
(1.5 MB)
Name | Size | Download all |
---|---|---|
md5:99888bd50255a2a5da9788f648d5caed
|
1.5 MB | Preview Download |