﻿README

STS corpus is a benchmark for evaluating Semantic Text Similarity in Catalan.
It consists of 3079 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence). It is done manually by 4 different people following our guidelines based on previous work from the SemEval challenges (https://www.aclweb.org/anthology/S13-1004.pdf).
Random sentences were extracted from 3 Catalan corpuses: ACN, Oscar and Wikipedia, and we generated candidate pairs using a combination of metrics from Doc2Vec, Jaccard and a BERT-likemodel (“distiluse-base-multilingual-cased-v2” https://huggingface.co/distilbert-base-multilingual-cased) . Finally, we reviewed manually the generated pairs to reject non-relevant pairs (identical or ungrammatical sentences, etc.) before providing them to the annotation team.
The average of the four annotations was selected as a “ground truth” for each sentence pair, except when an annotator diverged in more than one unit from the average. In these cases, we discarded the divergent annotation and recalculated the average without it. We also discarded two sentence pairs because the annotators highly disagreed between them.

This dataset was developed by BSC TeMU as part of the AINA project. 

This is the version 0.9 of the dataset. The version 1 of the dataset with the complete human and automatic annotations and the analysis scripts will be released soon.

Contact:
carlos.rodriguez1@bsc.es
carme.armentano@bsc.es
marta.villegas@bsc.es




