# ViquiQuAD, An extractive QA dataset for catalan, from the Wikipedia

## Methodology
From a set of high quality, non-translation, articles in the Catalan Wikipedia (the Viquipedia), 597 were randomly chosen, and from them 3111 5-8 sentence contexts were extracted. Creation of between 1 and 5 questions for each context was commissioned, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250. In total, 15153 pairs of a question and an extracted fragment that contains the answer were created.


<pre>
          "context": "L'historiador Frederick W. Mote va escriure que l'ús del terme \"classes socials\" per a aquest sistema era enganyós i que la posició de les persones dins del sistema de quatre classes no era una indicació del seu poder social i riquesa reals, sinó que només implicava \"graus de privilegi\" als quals tenien dret institucionalment i legalment, de manera que la posició d'una persona dins de les classes no era una garantia de la seva posició, ja que hi havia xinesos rics i amb bona reputació social, però alhora hi havia menys mongols i semu rics que mongols i semu que vivien en la pobresa i eren maltractats.",
          "qas": [
            {
              "answers": [
                {
                  "text": "Frederick W. Mote",
                  "answer_start": 14
                }
              ],
              "id": "5728848cff5b5019007da298",
              "question": "Qui creia que el sistema de classes socials de Yuan no s’hauria d’anomenar classes socials?"
            },
	
</pre>

## Contents:

* ViquiQuADv2.0.json - json-formatted file with the dataset
* QA Guidelines in catalan
* README.md

## Content analysis

### Number of articles, paragraphs and questions

* Number of articles: 597
* Number of contexts: 3111
* Number of questions: 15153
* Questions/context: 4.87
* Number of sentences in contexts: 15100
* Sentences/context: 4.85

### Number of tokens

* tokens in context: 469335
* tokens/context 150.86
* tokens in questons: 145249
* tokens/questions: 9.58
* tokens in answers: 63246
* tokens/answers: 4.17

### Lexical variation

After filtering (tokenization, stopwords, puntuation, case), we obtain that 83,88% of the words in the Question can be found in the Context

### Question type 

| Question | Count | % |
|--------|-----|------|
| què |  4220 | 27.85 % |
| qui |  2239 | 14.78 % |
| com |  1964 | 12.96 % |
| quan |  1133 | 7.48 % |
| on |  1580 | 10.43 % |
| quant |  925 | 6.1 % |
| quin |  3399 | 22.43 % |
| no question mark | 21 | 0.14 % |

### Question-answer relationships

From 100 randomly selected samples:
* Lexical variation: 33.0%
* World knowledge: 16.0%
* Syntactic variation: 35.0%
* Multiple sentence: 17.0%

## License

Copyright (c) 2021 Text Mining Unit at BSC

Funded by the <a href="https://politiquesdigitals.gencat.cat/ca/inici">Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA)</a>, <a href="https://www.bsc.es/ca/research-and-development/projects/mt4all-unsupervised-mt-low-resourced-language-pairs">MT4ALL</a> and <a href="https://plantl.mineco.gob.es">Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)</a>. <br/><br/>
<a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/"><img alt="Attribution-ShareAlike 4.0 International License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/">Attribution-ShareAlike 4.0 International License</a>.
