﻿# Semantic Textual Similarity in Catalan

STS corpus is a benchmark for evaluating Semantic Text Similarity in Catalan.
It consists of more than 3000 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence). It is done manually by 4 different people following our guidelines based on previous work from the SemEval challenges (https://www.aclweb.org/anthology/S13-1004.pdf).

## Methodology
Random sentences were extracted from 3 Catalan corpus: ACN, Oscar and Wikipedia, and we generated candidate pairs using a combination of metrics from Doc2Vec, Jaccard and a BERT-like model (“distiluse-base-multilingual-cased-v2”, [link](https://huggingface.co/distilbert-base-multilingual-cased)) . Finally, we  manually reviewed the generated pairs to reject non-relevant pairs (identical or ungrammatical sentences, etc.) before providing them to the annotation team.
The average of the four annotations was selected as a “ground truth” for each sentence pair, except when an annotator diverged in more than one unit from the average. In these cases, we discarded the divergent annotation and recalculated the average without it. We also discarded 45 sentence pairs because the annotators disagreed too much.

This dataset was developed by BSC TeMU as part of the AINA project. 

This is the version 1.0.1 of the dataset with the complete human and automatic annotations and the analysis scripts. It also has a more accurate license.

## Contents:
    * COMPLET4anotadors.tsv: dataset with the sentences pair, 4 individual human annotations, average, difference between every individual annotation and average and bert, doc2Vec and jaccard measures
    * sts_dataset.tsv: dataset with the sentences pair, 4 individual human annotations, average, average used as ground truth, list of excluded annotators, list of excluded sentences pair
    * sts_ground_truth.tsv: dataset with the sentences pair and average used as ground truth
    * splits: directory with sts_ground_truth.tsv splitted in test, train and development.
    * analitzaSTS.*: script used to create sts_dataset.tsv from COMPLET4anotadors.tsv
    * create_dataset_gt.py: script used to create sts_ground_truth.tsv from sts_dataset.tsv 
    * report.html: pandas-profiling report 
    * CC-BY4.0.txt
    * Guidelines STS (Catalan)
    * README



## License

Copyright (c) 2021 Text Mining Unit at BSC

Funded by the <a href="https://politiquesdigitals.gencat.cat/ca/inici">Generalitat de Catalunya, Departament de Polítiques Digitals i Administració Pública (AINA)</a>, <a href="https://www.bsc.es/ca/research-and-development/projects/mt4all-unsupervised-mt-low-resourced-language-pairs">MT4ALL</a> and <a href="https://plantl.mineco.gob.es">Plan de Impulso de las Tecnologías del Lenguaje (Plan TL)</a>. <br/><br/>
<a rel="license" href="https://creativecommons.org/licenses/by/4.0/legalcode"><img alt="Attribution 4.0 International License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="https://creativecommons.org/licenses/by/4.0/legalcode" target="_blank">Attribution 4.0 International License</a>.
