# VilaQuAD, An extractive QA dataset for catalan, from Vilaweb newswire text

## Digital Object Identifier (DOI) and access to dataset files

https://doi.org/10.5281/zenodo.4562337


## Introduction

This dataset contains 2095 of Catalan language news articles along with 1 to 5 questions referring to each fragment (or context).
VilaQuad articles are extracted from the daily Vilaweb (www.vilaweb.cat) and used under CC-by-nc-sa-nd (https://creativecommons.org/licenses/by-nc-nd/3.0/deed.ca) licence. 
This dataset can be used to build extractive-QA and Language Models.

### Supported Tasks and Leaderboards

Extractive-QA, Language Model

### Languages

CA- Catalan

### Directory structure

* VilaQuAD1.0.json - json-formatted file with the dataset
* Guidelines for catalan QA
* README.md

## Dataset Structure

### Data Instances

One json file

### Data Fields

Follows ((Rajpurkar, Pranav et al., 2016) for squad v1 datasets. (see below for full reference)

### Example:
<pre>
{
  "data": [
    {
      "title": "Com celebrar el Cap d'Any 2020? Deu propostes per a acomiadar-se del 2019",
      "paragraphs": [
        {
          "context": "Hi ha moltes propostes per a acomiadar-se d'aquest 2019. Els uns es queden a casa, els altres volen anar lluny o sortir al teatre. També s'organitzen festes o festivals a l'engròs, fins i tot hi ha propostes diürnes. Tot és possible per Cap d'Any. Encara no sabeu com celebrar l'entrada el 2020? Us oferim una llista amb deu propostes variades arreu dels Països Catalans: Festivern El Festivern enguany celebra quinze anys.",
          "qas": [
            
            {
              "answers": [
                {
                  "text": "festes o festivals",
                  "answer_start": 150
                }
              ],
              "id": "P_23_C_23_Q2",
              "question": "Què s'organitza a l'engròs per acomiadar el 2019?"
            },
            ...
          ]
        }
      ]
    }, 
    ...
   ]
} 

</pre>

### Data Splits

One

## Content analysis

### Number of articles, paragraphs and questions

* Number of contexts: 2095
* Number of questions: 6282
* Questions/context: 2.99
* Number of sentences in contexts: 11901
* Sentences/context: 5.6

### Number of tokens

* tokens in context: 422477
* tokens/context 201.66
* tokens in questons: 65849
* tokens/questions: 10.48
* tokens in answers: 27716
* tokens/answers: 4.41

### Question type 

| Question | Count | % |
|--------|-----|------|
| què | 1698 | 27.03 % |
| qui | 1161 | 18.48 % |
| com |  574 | 9.14 % |
| quan |  468  | 7.45 % |
| on |  559 | 8.9 % |
| quant |  601 | 9.57 % |
| quin |  1301 | 20.87 % |
| no question mark | 0 | 0.0 % |


### Question-answer relationships

From 100 randomly selected samples:

* Lexical variation: 32.0%
* World knowledge: 16.0%
* Syntactic variation: 22.0%
* Multiple sentence: 16.0%

## Dataset Creation

### Methodology
From a the online edition of the catalan newspaper Vilaweb (https://www.vilaweb.cat), 2095 articles were randomnly selected. These headlines were also used to create a Textual Entailment dataset. For the extractive QA dataset, creation of between 1 and 5 questions for each news context was commissioned, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250. In total, 6282 pairs of a question and an extracted fragment that contains the answer were created.

### Curation Rationale

For compatibility with similar datasets in other languages, we followed as close as possible existing curation guidelines. We also created another QA dataset with wikipedia to ensure thematic and stylistic variety.

### Source Data

- https://www.vilaweb.cat/

#### Initial Data Collection and Normalization

The source data are scraped articles from archives of Catalan newspaper website Vilaweb (https://www.vilaweb.cat). 

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

We comissioned the creation of 1 to 5 questions for each context, following an adaptation of the guidelines from SQUAD 1.0 (Rajpurkar, Pranav et al. “SQuAD: 100, 000+ Questions for Machine Comprehension of Text.” EMNLP (2016)), http://arxiv.org/abs/1606.05250.

#### Who are the annotators?

Annotation was commissioned to an specialized company that hired a team of native language speakers.

### Dataset Curators

Carlos Rodríguez and Carme Armentano, from BSC-CNS

### Personal and Sensitive Information

No personal or sensitive information included.

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]


## Contact

Carlos Rodríguez-Penagos (carlos.rodriguez1@bsc.es) and Carme Armentano-Oller (carme.armentano@bsc.es)


## License

<a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/"><img alt="Attribution-ShareAlike 4.0 International License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by-sa/4.0/">Attribution-ShareAlike 4.0 International License</a>.

