Expert annotations for the Catalan Common Voice (v13)

Language Technologies Unit

doi:10.5281/zenodo.11104388

Published May 2, 2024 | Version v1

Dataset Open

Expert annotations for the Catalan Common Voice (v13)

Language Technologies Unit (Research group)¹

1. Barcelona Supercomputing Center

Dataset Description

- Homepage: https://projecteaina.cat/tech/]
- Point of Contact: langech@bsc.es

Dataset Summary

These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).

The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.

The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.

See annotations for more details.

Supported Tasks and Leaderboards

Gender classification, Accent classification.

Languages

The dataset is in Catalan (ca).

Dataset Structure

Instances

Two xlsx documents are published, one for each round of annotations.

The following information is available in each of the documents:

{
'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b',
'idx': '31',
'same speaker': {'AN1': 'SI',
'AN2': 'SI',
'AN3': 'SI',
'agreed': 'SI',
'percentage': '100'},
'gender': {'AN1': 'H',
'AN2': 'H',
'AN3': 'H',
'agreed': 'H',
'percentage': '100'},
'accent': {'AN1': 'Central',
'AN2': 'Central',
'AN3': 'Central',
'agreed': 'Central',
'percentage': '100'},
'audio quality': {'AN1': '4.0',
'AN2': '3.0',
'AN3': '3.0',
'agreed': '3.0',
'percentage': '66',
'mean quality': '3.33',
'stdev quality': '0.58'},
'comments': {'AN1': '',
'AN2': 'pujades i baixades de volum',
'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'},
}

We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.

Data Fields

speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
idx (int): Id in this corpus
AN1 (string): Annotations from Annotator 1
AN2 (string): Annotations from Annotator 2
AN3 (string): Annotations from Annotator 3
agreed (string): Annotation from the majority of the annotators
percentage (int): Percentage of annotators that agree with the agreed annotation
mean quality (float): Mean of the quality annotation
stdev quality (float): Standard deviation of the mean quality

Data Splits

The corpus remains undivided into splits, as its purpose does not involve training models.

Dataset Creation

Curation Rationale

During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.

In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Source Data

The original data comes from the [Catalan sentences of the Common Voice corpus](https://commonvoice.mozilla.org/en/datasets).

Initial Data Collection and Normalization

We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.

Who are the source language producers?

The original data comes from the Catalan sentences of the Common Voice corpus.

Annotations

Annotation process

Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.

A team of three annotators was tasked with annotating:

if all the recordings correspond to the same person
the gender of the speaker
the accent of the speaker
the quality of the recording

They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.

We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Who are the annotators?

The annotation was entrusted to the [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic) team from the University of Barcelona.
They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.

The annotation team was composed of:

Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.

To do the annotation they used a Google Drive spreadsheet

Personal and Sensitive Information

The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

Considerations for Using the Data

Social Impact of Dataset

The ID come from the Common Voice dataset, that consists of people who have donated their voice online.

You agree to not attempt to determine the identity of speakers in the Common Voice dataset.

The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.

Discussion of Biases

Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.

For the gender annotation, we have only considered "H" (male) and "D" (female).

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset is licensed under a CC BY 4.0 license.

It can be used for any purpose, whether academic or commercial, under the terms of the license.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The annotation was entrusted to the STeL team from the University of Barcelona.

Files

expert_annotations_catalan_common_voice_v13.zip

Files (185.6 kB)

Name	Size	Download all
expert_annotations_catalan_common_voice_v13.zip md5:489224749e26cc6d6093dc665f214bb8	185.6 kB	Preview Download

	All versions	This version
Views	221	221
Downloads	35	35
Data volume	6.5 MB	6.5 MB

Expert annotations for the Catalan Common Voice (v13)

Authors/Creators

Description

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Files

expert_annotations_catalan_common_voice_v13.zip

Files (185.6 kB)