Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA
Creators
- Matheus Gauy, Marcelo1
- Finger, Marcelo1
- Aluisio, Sandra Maria1
- Svartman, Flaviane Romani Fernandes1
- Candido Junior, Arnaldo2
- Casanova, Edresson1
- Leite, Marli Quadros1
- Soares, Anderson3
- Oliveira, Frederico Santos de3
- Oliveira, Lucas4
- Fernandes Jr, Ricardo4
- Silva, Daniel da4
- Fayet, Fernando Gorgulho1
- Carlotto, Bruno Baldissera1
- Gris, Lucas R4
- Santos, Vinícius Gonçalves dos1
- 1. Universidade de São Paulo - USP
- 2. Universidade Estadual Paulista - UNESP
- 3. Universidade Federal de Goiás - UFG
- 4. Universidade Tecnológica Federal do Paraná - UTFPR
Description
This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech.
The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets.
Files
Files
(31.7 GB)
Name | Size | Download all |
---|---|---|
md5:53a53dacaaa9058c65c857390d1d1baa
|
1.1 GB | Download |
md5:279eaf14d13a7e98794c99ff7e8c429d
|
228.2 MB | Download |
md5:67062451c2dc4b98512d75b829adcb06
|
886.4 MB | Download |
md5:bbfb669a953f2037d3ad572489308dfd
|
2.0 GB | Download |
md5:a8aec28a9d43828c9e452246f2dd9031
|
266.7 MB | Download |
md5:e2cc26c966308439b3e932cbd928c95f
|
2.6 GB | Download |
md5:3eedba20947f2cc05d95aee9acec9a49
|
2.7 GB | Download |
md5:7891b019a599d9ecac6263734b0d8c19
|
2.9 GB | Download |
md5:c7792fdad239af7505a54c4e8cfb6abe
|
2.8 GB | Download |
md5:f807235c0996acc05783221d415e55bb
|
2.8 GB | Download |
md5:9e204447cb78eb8b1d5c69cbfa1b3d5b
|
2.9 GB | Download |
md5:cd3985f1ad278f57eee43f48c712e78f
|
2.8 GB | Download |
md5:a9ba4a060b4f471fda0d1f081fe04046
|
401.2 MB | Download |
md5:81a6381cfc801d0704d2f913890dae26
|
3.3 GB | Download |
md5:e8ac8419393d8f4616bc5bfd876c99af
|
3.2 GB | Download |
md5:5dd53bc3f92dc225595b024fe594ea72
|
682.2 MB | Download |
Additional details
Related works
- Is supplemented by
- Dataset: 10.5281/zenodo.6672451 (DOI)
References
- Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Gauy, Marcelo e Finger, Marcelo 2022
- MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>.
- Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.
- RASO, T. ; MELLO, H. . The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portugues. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.
- Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.
- A linguagem falada culta na cidade de São Paulo: materiais para seu estudo - Castilho, Ataliba Teixeira de e Pretti, Dino 1986
- Acervo Certas Palavras- Cat\'{a}logo 1981-1996 - Teixeira, Carmem Silva P. 1997