Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

Matheus Gauy, Marcelo; Finger, Marcelo; Aluisio, Sandra Maria; Svartman, Flaviane Romani Fernandes; Candido Junior, Arnaldo; Casanova, Edresson; Leite, Marli Quadros; Soares, Anderson; Oliveira, Frederico Santos de; Oliveira, Lucas; Fernandes Jr, Ricardo; Silva, Daniel da; Fayet, Fernando Gorgulho; Carlotto, Bruno Baldissera; Gris, Lucas R; Santos, Vinícius Gonçalves dos

doi:10.5281/zenodo.6794924

Published July 4, 2022 | Version v1

Dataset Open

Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

1. Universidade de São Paulo - USP
2. Universidade Estadual Paulista - UNESP
3. Universidade Federal de Goiás - UFG
4. Universidade Tecnológica Federal do Paraná - UTFPR

This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech.

The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets.

Files

Files (31.7 GB)

Name	Size
ALIP_Corpus_mp3.tar.gz md5:53a53dacaaa9058c65c857390d1d1baa	1.1 GB	Download
coral_mp3.tar.gz md5:279eaf14d13a7e98794c99ff7e8c429d	228.2 MB	Download
NURC_RE_D2_mp3.tar.gz md5:67062451c2dc4b98512d75b829adcb06	886.4 MB	Download
NURC_RE_DID_mp3.tar.gz md5:bbfb669a953f2037d3ad572489308dfd	2.0 GB	Download
NURC_RE_EF_mp3.tar.gz md5:a8aec28a9d43828c9e452246f2dd9031	266.7 MB	Download
nurcsp_mp3_1_FASE.tar.gz md5:e2cc26c966308439b3e932cbd928c95f	2.6 GB	Download
nurcsp_mp3_2_FASE.tar.gz md5:3eedba20947f2cc05d95aee9acec9a49	2.7 GB	Download
nurcsp_mp3_3_FASE_1.tar.gz md5:7891b019a599d9ecac6263734b0d8c19	2.9 GB	Download
nurcsp_mp3_3_FASE_2.tar.gz md5:c7792fdad239af7505a54c4e8cfb6abe	2.8 GB	Download
nurcsp_mp3_3_FASE_3.tar.gz md5:f807235c0996acc05783221d415e55bb	2.8 GB	Download
nurcsp_mp3_3_FASE_4.tar.gz md5:9e204447cb78eb8b1d5c69cbfa1b3d5b	2.9 GB	Download
nurcsp_mp3_3_FASE_5.tar.gz md5:cd3985f1ad278f57eee43f48c712e78f	2.8 GB	Download
programa_certas_palavras_fitas_de_rolo.tar.gz md5:a9ba4a060b4f471fda0d1f081fe04046	401.2 MB	Download
programa_certas_palavras_fitas_K7_1.tar.gz md5:81a6381cfc801d0704d2f913890dae26	3.3 GB	Download
programa_certas_palavras_fitas_K7_2.tar.gz md5:e8ac8419393d8f4616bc5bfd876c99af	3.2 GB	Download
sp2010_mp3.tar.gz md5:5dd53bc3f92dc225595b024fe594ea72	682.2 MB	Download

Additional details

Is supplemented by: Dataset: 10.5281/zenodo.6672451 (DOI)

Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Gauy, Marcelo e Finger, Marcelo 2022
MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>.
Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.
RASO, T. ; MELLO, H. . The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portugues. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.
Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.
A linguagem falada culta na cidade de São Paulo: materiais para seu estudo - Castilho, Ataliba Teixeira de e Pretti, Dino 1986
Acervo Certas Palavras- Cat\'{a}logo 1981-1996 - Teixeira, Carmem Silva P. 1997

	All versions	This version
Views	941	937
Downloads	1,542	1,542
Data volume	3.3 TB	3.3 TB

Files (31.7 GB)

Related works

References

Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

Authors/Creators

Description

Files

Files (31.7 GB)

Additional details

Related works

References