Expert annotations for the Catalan Common Voice (v13)
Description
Dataset Description
- Homepage: https://projecteaina.cat/tech/]
- Point of Contact: langech@bsc.es
Dataset Summary
These are the annotations made by a team of experts on the speakers with more than 1200 seconds recorded in the Catalan set of the Common Voice dataset (v13).
The annotators were initially tasked with evaluating all recordings associated with the same individual. Following that, they were instructed to annotate the speaker's accent, gender, and the overall quality of the recordings.
The accents and genders taken into account are the ones used until version 8 of the Common Voice corpus.
See annotations for more details.
Supported Tasks and Leaderboards
Gender classification, Accent classification.
Languages
The dataset is in Catalan (ca).
Dataset Structure
Instances
Two xlsx documents are published, one for each round of annotations.
The following information is available in each of the documents:
{ 'speaker ID': '1b7fc0c4e437188bdf1b03ed21d45b780b525fd0dc3900b9759d0755e34bc25e31d64e69c5bd547ed0eda67d104fc0d658b8ec78277810830167c53ef8ced24b', 'idx': '31', 'same speaker': {'AN1': 'SI', 'AN2': 'SI', 'AN3': 'SI', 'agreed': 'SI', 'percentage': '100'}, 'gender': {'AN1': 'H', 'AN2': 'H', 'AN3': 'H', 'agreed': 'H', 'percentage': '100'}, 'accent': {'AN1': 'Central', 'AN2': 'Central', 'AN3': 'Central', 'agreed': 'Central', 'percentage': '100'}, 'audio quality': {'AN1': '4.0', 'AN2': '3.0', 'AN3': '3.0', 'agreed': '3.0', 'percentage': '66', 'mean quality': '3.33', 'stdev quality': '0.58'}, 'comments': {'AN1': '', 'AN2': 'pujades i baixades de volum', 'AN3': 'Deu ser d'alguna zona de transició amb el central, perquè no fa una reducció total vocàlica, però hi té molta tendència'}, }
We also publish the document Guia anotació parlants.pdf, with the guidelines the annotators recieved.
Data Fields
- speaker ID (string): An id for which client (voice) made the recording in the Common Voice corpus
- idx (int): Id in this corpus
- AN1 (string): Annotations from Annotator 1
- AN2 (string): Annotations from Annotator 2
- AN3 (string): Annotations from Annotator 3
- agreed (string): Annotation from the majority of the annotators
- percentage (int): Percentage of annotators that agree with the agreed annotation
- mean quality (float): Mean of the quality annotation
- stdev quality (float): Standard deviation of the mean quality
Data Splits
The corpus remains undivided into splits, as its purpose does not involve training models.
Dataset Creation
Curation Rationale
During 2022, a campaign was launched to promote the Common Voice corpus within the Catalan-speaking community, achieving remarkable success. However, not all participants provided their demographic details such as age, gender, and accent. Additionally, some individuals faced difficulty in self-defining their accent using the standard classifications established by specialists.
In order to obtain a balanced corpus with reliable information, we have seen the the necessity of enlisting a group of experts from the University of Barcelona to provide accurate annotations.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Source Data
The original data comes from the [Catalan sentences of the Common Voice corpus](https://commonvoice.mozilla.org/en/datasets).
Initial Data Collection and Normalization
We have selected speakers who have recorded more than 1200 seconds of speech in the Catalan set of the version 13 of the Common Voice corpus.
Who are the source language producers?
The original data comes from the Catalan sentences of the Common Voice corpus.
Annotations
Annotation process
Starting with version 13 of the Common Voice corpus we identified the speakers (273) who have recorded more than 1200 seconds of speech.
A team of three annotators was tasked with annotating:
- if all the recordings correspond to the same person
- the gender of the speaker
- the accent of the speaker
- the quality of the recording
They conducted an initial round of annotation, discussed their varying opinions, and subsequently conducted a second round.
We release the complete annotations because transparency is fundamental to our project. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Who are the annotators?
The annotation was entrusted to the [CLiC (Centre de Llenguatge i Computació)](https://clic.ub.edu/en/que-es-clic) team from the University of Barcelona.
They selected a group of three annotators (two men and one woman), who received a scholarship to do this work.
The annotation team was composed of:
- Annotator 1: 1 female annotator, aged 18-25, L1 Catalan, student in the Modern Languages and Literatures degree, with a focus on Catalan.
- Annotators 2 & 3: 2 male annotators, aged 18-25, L1 Catalan, students in the Catalan Philology degree.
- 1 female supervisor, aged 40-50, L1 Catalan, graduate in Physics and in Linguistics, Ph.D. in Signal Theory and Communications.
To do the annotation they used a Google Drive spreadsheet
Personal and Sensitive Information
The Common Voice dataset consists of people who have donated their voice online. We don't share here their voices, but their gender and accent.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
Considerations for Using the Data
Social Impact of Dataset
The ID come from the Common Voice dataset, that consists of people who have donated their voice online.
You agree to not attempt to determine the identity of speakers in the Common Voice dataset.
The information from this corpus will allow us to train and evaluate well balanced Catalan ASR models. Furthermore, we believe they hold philological value for studying dialectal and gender variants.
Discussion of Biases
Most of the voices of the common voice in Catalan correspond to men with a central accent between 40 and 60 years old. The aim of this dataset is to provide information that allows to minimize the biases that this could cause.
For the gender annotation, we have only considered "H" (male) and "D" (female).
Other Known Limitations
[N/A]
Additional Information
Dataset Curators
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
Licensing Information
This dataset is licensed under a CC BY 4.0 license.
It can be used for any purpose, whether academic or commercial, under the terms of the license.
Give appropriate credit, provide a link to the license, and indicate if changes were made.
Citation Information
Contributions
The annotation was entrusted to the STeL team from the University of Barcelona.
Files
expert_annotations_catalan_common_voice_v13.zip
Files
(185.6 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:489224749e26cc6d6093dc665f214bb8
|
185.6 kB | Preview Download |