Semantic Textual Similarity in Catalan

Rodriguez-Penagos, Carlos Gerardo; Armentano-Oller, Carme; Gonzalez-Agirre, Aitor; Gibert Bonet, Ona

doi:10.5281/zenodo.4761434

Published February 10, 2021 | Version 1.0.1

Dataset Open

Semantic Textual Similarity in Catalan

1. BSC

If you use this resource in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}

STS corpus is a benchmark for evaluating Semantic Text Similarity in Catalan.
It consists of more than 3000 sentence pairs, annotated with the semantic similarity between them, using a scale from 0 (no similarity at all) to 5 (semantic equivalence). It is done manually by 4 different annotators following our guidelines based on previous work from the SemEval challenges (https://www.aclweb.org/anthology/S13-1004.pdf).

The source data are scraped sentences from the Catalan Textual Corpus (https://doi.org/10.5281/zenodo.4519349), used under CC-by-SA-4.0 licence (https://creativecommons.org/licenses/by-sa/4.0/). The dataset is released under the same licence.

This dataset was developed by BSC TeMU as part of the AINA project, and to enrich the Catalan Language Understanding Benchmark (CLUB).

This is the version 1.0.2 of the dataset with the complete human and automatic annotations and the analysis scripts. It also has a more accurate license.

This dataset can be used to build and score semantic similiarity models.

Corpus per evaluar STS en català.

Consta de 3079 parells de frases, anotades segons el grau de similitud semàntica que tenen, segons una escala que va de 0 (no s'assemblen gens) a 5 (són equivalents). L'anotació ha estat feta manualment per 4 persones segons les nostres guies, basades en els SemEval Callenges (https://www.aclweb.org/anthology/S13-1004.pdf)

Aquest dataset ha estat desenvolupat pel la unitat de Text mining del BSC en el marc del projecte Aina.

Files

STS-ca_v.1.0.2.zip

Files (1.3 MB)

Name	Size	Download all
STS-ca_v.1.0.2.zip md5:3d83441f9ba96f3ee61661a97543b39a	1.3 MB	Preview Download

	All versions	This version
Views	1,542	574
Downloads	187	79
Data volume	188.1 MB	105.5 MB

Semantic Textual Similarity in Catalan

Authors/Creators

Description

Files

STS-ca_v.1.0.2.zip

Files (1.3 MB)