# Enriched CONLLU Ancora for ML training

## Digital Object Identifier (DOI) and access to dataset files

DOI 10.5281/zenodo.4762030


## Introduction

This is an enriched version for Machine Learning purposes of the CONLLU adaptation of <a href="http://clic.ub.edu/corpus/">Ancora corpus</a> .

This version of the corpus was developed by BSC TeMU as part of the AINA project, and has been used to do multi-task learning for the Catalan language Spacy 3.0 models.

### Supported Tasks and Leaderboards

Lemmatization, POS tagging, Dependencies, Named Entities Recognition, Language Model

### Languages

CA- Catalan

### Directory structure
 
* dev_docs.conllu
* test_docs.conllu
* train_docs.conllu
* README.txt 

## Dataset Structure

### Data Instances

Three ten-column files, one for each split.

### Data Fields

Following a revised version of the CoNLL-X format called CoNLL-U  (https://universaldependencies.org/format.html), with added NERC annotations in IOB format:

 (Next section copied for convenience from the universal dependencies site) 

Annotations are encoded in plain text files (UTF-8, normalized to NFC, using only the LF character as line break, including an LF character at the end of file) with three types of lines:

* Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
* Blank lines marking sentence boundaries.
* Comment lines starting with hash (#).

   

Sentences consist of one or more word lines, and word lines contain the following fields:

    1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
    2. FORM: Word form or punctuation symbol.
    3. LEMMA: Lemma or stem of word form.
    4. UPOS: Universal part-of-speech tag.
    5. XPOS: Language-specific part-of-speech tag; underscore if not available.
    6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
    7. HEAD: Head of the current word, which is either a value of ID or zero (0).
    8. DEPREL: Universal dependency relation to the HEAD (root if HEAD = 0) or a defined language-specific subtype of one.
    9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
    10. MISC: Any other annotation. (We have used this column to add NERC annotations)
 
    
NERC annotations and "SpaceAfter=No" tags are in the 10th column.


### Example:
<pre>
# sent_id = test-s8
# text = Els alumnes que vulguin acabar els seus estudis musicals a Reus o a Tortosa, ja no caldrà que es desplacin a Tarragona o a Vilaseca per cursar l'últim cicle de Grau Mitjà de Música.
# orig_file_sentence 001#8
1	Els	el	DET	DET	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	2	det	_	O
2	alumnes	alumne	NOUN	NOUN	Gender=Masc|Number=Plur	18	nsubj	_	O
3	que	que	PRON	PRON	PronType=Rel	4	nsubj	_	O
4	vulguin	voler	VERB	VERB	Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	2	acl	_	O
5	acabar	acabar	VERB	VERB	VerbForm=Inf	4	xcomp	_	O
6	els	el	DET	DET	Gender=Masc|Number=Plur|PronType=Art	8	det	_	O
7	seus	seu	DET	DET	Gender=Masc|Number=Plur|Person=3|Poss=Yes|PronType=Prs	8	det	_	O
8	estudis	estudi	NOUN	NOUN	Gender=Masc|Number=Plur	5	obj	_	O
9	musicals	musical	ADJ	ADJ	Number=Plur	8	amod	_	O
10	a	a	ADP	ADP	AdpType=Prep	11	case	_	O
11	Reus	Reus	PROPN	PROPN	_	5	obl	_	B-LOC
12	o	o	CCONJ	CCONJ	_	14	cc	_	O
13	a	a	ADP	ADP	AdpType=Prep	14	case	_	O
14	Tortosa	Tortosa	PROPN	PROPN	_	11	conj	_	SpaceAfter=No|B-LOC
15	,	,	PUNCT	PUNCT	PunctType=Comm	2	punct	_	O
16	ja	ja	ADV	ADV	_	18	advmod	_	O
17	no	no	ADV	ADV	Polarity=Neg	18	advmod	_	O
18	caldrà	caldre	VERB	VERB	Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin	0	root	_	O
19	que	que	SCONJ	SCONJ	_	21	mark	_	O
20	es	se	PRON	PRON	Case=Acc,Dat|Person=3|PrepCase=Npr|PronType=Prs|Reflex=Yes	21	obj	_	O
21	desplacin	desplaçar	VERB	VERB	Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	18	csubj	_	O
22	a	a	ADP	ADP	AdpType=Prep	23	case	_	O
23	Tarragona	Tarragona	PROPN	PROPN	_	21	obl	_	B-LOC
24	o	o	CCONJ	CCONJ	_	26	cc	_	O
25	a	a	ADP	ADP	AdpType=Prep	26	case	_	O
26	Vilaseca	Vilaseca	PROPN	PROPN	_	23	conj	_	B-LOC
27	per	per	ADP	ADP	AdpType=Prep	28	mark	_	O
28	cursar	cursar	VERB	VERB	VerbForm=Inf	21	advcl	_	O
29	l'	el	DET	DET	Definite=Def|Number=Sing|PronType=Art	31	det	_	SpaceAfter=No|O
30	últim	últim	ADJ	ADJ	Gender=Masc|Number=Sing|NumType=Ord	31	amod	_	O
31	cicle	cicle	NOUN	NOUN	Gender=Masc|Number=Sing	28	obj	_	O
32	de	de	ADP	ADP	AdpType=Prep	33	case	_	O
33	Grau	Grau	PROPN	PROPN	_	31	nmod	_	B-MISC
34	Mitjà	Mitjà	PROPN	PROPN	_	33	flat	_	I-MISC
35	de	de	ADP	ADP	AdpType=Prep	36	case	_	O
36	Música	Música	PROPN	PROPN	_	31	nmod	_	SpaceAfter=No|B-MISC
37	.	.	PUNCT	PUNCT	PunctType=Peri	18	punct	_	O
</pre>


### Data Splits

One for each sub-dataset for train, evaluation and test, as in the Universal Dependencies project (https://universaldependencies.org/treebanks/ca_ancora/index.html).

## Dataset Creation

### Methodology

We adapted the NER labels from AnCora corpus to the conllu format, splitting them to align with word-per-line .conllu format, and added conventional to mark and classify Named Entities. 
We added this NER tags to the 10th column, together with the already existing "SpaceAfter=No" tags. We eliminated the rest of the information in this column.
We changed the tokenization of enclitical pronouns, adding an "-" to the word form and the "SpaceAfter=No" tag to the preceding verb.
We also normalized the form of some pronoun and preposition lemmas.

### Curation Rationale

Following closely CONLLU conventions.

### Source Data

#### Initial Data Collection and Normalization

AnCora consists of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 tokens (some multi-word). The corpora are annotated for linguistic phenomena at different levels.
AnCora corpus is mainly based on newswire texts. For more information, refer to Taulé, M., M.A. Martí, M. Recasens (2009). “AnCora: Multilevel Annotated Corpora for Catalan and Spanish”, Proceedings of 6th International Conference on language Resources and Evaluation. http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf

#### Who are the source language producers?

Catalan Ancora corpus is compiled from articles from the following news outlets: <a href="https://www.efe.com">EFE</a>, <a href="https://www.acn.cat">ACN</a>, <a href="https://www.elperiodico.cat/ca/">El Periodico</a>.

### Annotations

#### Annotation process

It is an enriched adaptation of the CONLLU adaptation of the Ancora corpus. 

#### Who are the annotators?

The annotators of the original AnCora Catalan 2.0.0 are: Oriol Borrega, Isabel Briz, Núria Bufí, Montserrat Civit, María Jesús Díaz, Silvia Garcia, Raquel Hernández, Marina Lloberes, Raquel Marcos, Difda Monterde, Montserrat Nofre, Aina Peris, Lourdes Puiggròs, Marta Recasens, Bàrbara Soriano, Rita Zaragoza.

Carlos Rodríguez and Carme Armentano, from BSC-CNS, did the conversion of the labels.

### Dataset Curators

The curators of the original AnCora Catalan 2.0.0 are:  M. Antònia Martí, Mariona Taulé and Marta Recasens, from UB.

Carlos Rodríguez and Carme Armentano, from BSC-CNS, adapted the corpus to this new version.

### Personal and Sensitive Information

No personal or sensitive information included.

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]


## Contact

Carlos Rodríguez-Penagos (carlos.rodriguez1@bsc.es) and Carme Armentano-Oller (carme.armentano@bsc.es)


## License

<a rel="license" href="https://creativecommons.org/licenses/by/4.0/"><img alt="Attribution 4.0 International License" style="border-width:0" src="https://chriszabriskie.com/img/cc-by.png" /></a><br />This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by/4.0/">Attribution 4.0 International License</a>.
