Nos_TranscriSpeech-GL: Galician ASR corpus

Vladu, Adina Ioana; Vázquez Abuín, Marta; Fernández Rei, Elisa; García Díaz, Noelia; Vidal Miguéns, Adrián; Magariños, Carmen

doi:10.5281/zenodo.7717140

Published March 10, 2023 | Version 1.0.0.

Dataset Restricted

Nos_TranscriSpeech-GL: Galician ASR corpus

1. Universidade de Santiago de Compostela

This corpus is publicly accessible upon accepting T&Cs and requesting access.

(Galician description below)

Nos_TranscriSpeech-GL is a manually transcribed and speech-to-text aligned Galician ASR corpus containing 53 hours of multi-domain speech. The corpus is intended both for automatic speech recognition tasks and for linguistic research.

The corpus is divided into four subcorpora according to the type of audio: conferences, debates, speeches, and interviews.

The file naming scheme of both the audio and the transcription files consists of an ID indicating the speaker, followed, if necessary, by a number indicating successive audios by the same speaker. Parts of the same audio are marked by a number separated by an underscore from the speaker ID (e.g., Alberti1_1.wav).

The audio files are released in 44.1 kHz 16-bit WAV format and the transcriptions are available in .stm and .trs. Moreover, the corpus is accompanied by the corresponding speaker metadata and the guide detailing the conventions used for the manual transcription.

The dataset includes detailed linguistic annotation. Transcriptions were carried out following the included guideline, with conventions such as: orthographic transcription in standard Galician, marking of dialectal, non-standard and non-Galician forms, annotation of paralinguistic and oral features (e.g. laughter, pauses, overlaps, interruptions, singing), explicit marking of speaker turns and audience interventions, and standardized representation of numerals and acronyms.

Funding and acknowledgements

“The Nós project: Galician in the society and economy of Artificial Intelligence” is possible thanks to the funding resulting from the agreement 2021-CP080 between the Xunta de Galicia and the University of Santiago de Compostela, and thanks to the Investigo program, within the National Recovery, Transformation and Resilience Plan, within the framework of the European Recovery Fund (NextGenerationEU).

We would like to thank the Corpus Oral Informatizado da Lingua Galega (CORILGA) project for their kind collaboration in providing the original audio data.

For more information, please go to https://nos.gal/ or contact the Nós project at proxecto.nos@usc.gal.

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

Hugging Face version

A version of this dataset, with no linguistic annotations and optimized for ASR model training, is also available in Hugging Face: https://huggingface.co/datasets/proxectonos/Nos_Transcrispeech-GL.

----------------------------------------------------------

TL;DR data sheet:

Access level: Restricted access.

The dataset is available upon request through Zenodo. Users must request access, accept the Terms and Conditions, and use the dataset only for research and development purposes compatible with the protection of individual rights, privacy and dignity.

Authentication: Zenodo account required.

Access protocol: HTTPS via Zenodo.

Data content

The dataset content is available through the files attached to this Zenodo record:

- Dataset landing page: https://doi.org/10.5281/zenodo.7717140
- Zenodo record: https://zenodo.org/records/7717140
- Audio files: https://zenodo.org/records/7717140/files/AUDIOS.zip
- Segment-aligned transcript files: https://zenodo.org/records/7717140/files/TRANSCRICIONS.zip

The restricted-access archive contains:

- Audio files in WAV (WAVE Audio File Format), 44.1 kHz, 16-bit.
- Manual transcriptions in TRS format.
- Segment-level transcriptions in STM format.
- Speaker metadata files.
- Documentation and transcription guidelines.

Main data formats:
- `.wav`: audio recordings.
- `.trs`: Transcriber XML transcription files.
- `.stm`: Segment Time Mark files for ASR evaluation.
- `.tsv` / `.csv`: tabular metadata, when provided.
- `.pdf` / `.md`: documentation files.

Data object identifiers

The dataset contains the following logical data objects:

- original and curated audio files in WAV format.
- manual aligned transcriptions in Transcriber TRS format.
- segment-level STM transcriptions.
- speaker-level and file-level metadata.
- documentation of transcription and annotation conventions.

These objects are contained in the restricted-access Zenodo record identified by DOI: 10.5281/zenodo.7717140.

Metadata standards and vocabularies

The dataset metadata are described using concepts compatible with:

- DataCite Metadata Schema, for citation metadata, creators, publication date, DOI and related identifiers.
- Dublin Core, for title, description, language, rights and publisher.
- Schema.org/Dataset, for web-indexable dataset metadata.
- ISO 639-3, for language identification: `glg` = Galician.
- Creative Commons Rights Expression Language, for license information.
- PROV-O-compatible provenance concepts, for describing the creation, curation and publication process.

License

License: Creative Commons Attribution 4.0 International (CC BY 4.0).

Access conditions

The dataset is under restricted access because it contains voice recordings, transcripts and speaker metadata. Access is granted upon request through Zenodo after acceptance of the Terms and Conditions.

Ethical conditions of reuse

Users must not use the dataset in ways that infringe the rights, privacy or dignity of the individuals represented in the data. Misuse for discriminatory, harmful, unlawful, surveillance or re-identification purposes is prohibited.

Related resources

- Hugging Face version: version optimized for ASR training, without linguistic annotations.
Relation: `HasVersion` / `IsVariantFormOf`.

- Proxecto Nós Zenodo community.
Relation: `IsPartOf`.

-------------------------------------------------------------

Nos_TranscriSpeech-GL é un corpus ASR galego transcrito e aliñado manualmente, que contén 53 horas de fala multidominio. O corpus está pensado tanto para tarefas de recoñecemento automático da fala como para investigación lingüística.

Este corpus divídese en catro subcorpus segundo o tipo de audio: conferencias, debates, discursos e entrevistas.

O esquema de nomenclatura dos ficheiros de audio e de transcrición consiste nun ID que indica o falante, seguido, se é necesario, dun número que indica os audios sucesivos do mesmo falante. As partes do mesmo audio márcanse cun número separado por un guión baixo do ID do falante (por exemplo: Alberti1_1.wav).

Os ficheiros de audio publícanse en formato WAV de 16 bits e 44,1 kHz e as transcricións están dispoñibles en .stm e .trs. Ademais, o corpus vai acompañado dos metadatos correspondentes dos falantes e da guía que detalla as convencións empregadas para a transcrición manual.

O conxunto de datos inclúe anotación lingüística detallada. As transcricións leváronse a cabo seguindo a guía incluída, con convencións como: transcrición ortográfica en galego estándar, marcaxe de formas dialectais non estándar e non galegas, anotación de características paralingüísticas e típicas da oralidade (por exemplo: risas, pausas, solapamentos, interrupcións, canto), marcaxe explícita das quendas de fala dos falantes e das intervencións do público e representación normalizada de números e acrónimos.

Financiamento e agradecementos

“O Proxecto Nós: o galego na sociedade e na economía da Intelixencia Artificial” desenvolveuse grazas ao financiamento resultante do acordo 2021-CP080 entre a Xunta de Galicia e a Universidade de Santiago de Compostela, e grazas ao programa Investigo, dentro do Plan Nacional de Recuperación, Transformación e Resiliencia, no marco do Fondo Europeo de Recuperación (NextGenerationEU).

Gustaríanos agradecer ao proxecto Corpus Oral Informatizado da Lingua Galega (CORILGA) a súa amable colaboración ao proporcionar os datos sonoros orixinais.

Para máis información, visite https://nos.gal/ ou póñase en contacto co Proxecto Nós en proxecto.nos@usc.gal.

Termos e Condicións

Ao acceder e usar este conxunto de datos, acepta cumprir todas as leis e normas éticas aplicables relativas á protección dos dereitos individuais. O conxunto de datos contén arquivos de voz, transcricións e metadatos, incluída información sobre a identidade dos participantes, proporcionados exclusivamente con fins de investigación e desenvolvemento. Está estritamente prohibido empregar o conxunto de datos de xeito que vulnere os dereitos, a privacidade ou a dignidade de calquera individuo representado nel. Calquera uso indebido, incluíndo, entre outros, as posibles actividades discriminatorias, prexudiciais ou ilícitas, está expresamente prohibido.

Versión en Hugging Face

Unha versión deste conxunto de datos, sen anotación lingüística e optimizada para o adestramento de modelos de recoñecemento automático da fala, pódese atopar en Hugging Face: https://huggingface.co/datasets/proxectonos/Nos_Transcrispeech-GL.

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/7717140">Log in</a> to check if you have access.

Request access

If you would like to request access to these files, please fill out the form below.

This dataset is publicly accessible, but you have to accept the Terms and Conditions to access its files and content.

Terms and Conditions

By accessing and using this dataset, you agree to comply with all applicable laws and ethical standards regarding the protection of individual rights. The dataset contains voice files, transcripts, and metadata, including participant identity information, provided solely for research and development purposes. Users are strictly prohibited from using the dataset in any way that infringes upon the rights, privacy, or dignity of any individual represented within it. Any misuse, including but not limited to attempts to engage in discriminatory, harmful, or unlawful activities, is expressly forbidden.

This document is a legal contract between you and the dataset provider. By accessing and using the dataset, you acknowledge and agree to abide by the Terms and Conditions.

Our team may take a few days to process your access request. Thank you in advance for your patience.

You are currently not logged in. Do you have an account? Log in here

	All versions	This version
Views	846	845
Downloads	39	39
Data volume	206.9 GB	206.9 GB

Nos_TranscriSpeech-GL: Galician ASR corpus

Authors/Creators

Description

Files

Restricted

Request access