Published July 4, 2022 | Version v1
Dataset Open

Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Pretraining Datasets raw audios from CORAA

Description

This repository contains all the pretraining datasets used in the paper: Acoustic models of Brazilian Portuguese Speech based on Neural Transformers by Marcelo Gauy and Marcelo Finger. These datasets are part of a collection of datasets from the TaRSila project (see https://sites.google.com/view/tarsila-c4ai). The audios published here were in part also published with annotations and transcriptions as the CORAA dataset (see https://github.com/nilc-nlp/CORAA). Here we publish the original raw audios from the following datasets (without transcriptions) - ALIP, C-Oral, SP2010, NURC-Recife, NURC-São Paulo and Programa Certas Palavras. In total, the datasets contain about 800 hours of Brazilian Portuguese Speech.

The audios have been converted to mp3 to facilitate the upload. ALIP, C-Oral and SP2010 are integrally contained in one file each. Programa Certas Palavras and NURC-Recife are split in 3 parts each, while NURC-SP is split in 7 parts of roughly equal size. More information on the datasets can be found in the paper Acoustic models of Brazilian Portuguese Speech based on Neural Transformers as well as on the original references which created these datasets.

Files

Files (31.7 GB)

Name Size Download all
md5:53a53dacaaa9058c65c857390d1d1baa
1.1 GB Download
md5:279eaf14d13a7e98794c99ff7e8c429d
228.2 MB Download
md5:67062451c2dc4b98512d75b829adcb06
886.4 MB Download
md5:bbfb669a953f2037d3ad572489308dfd
2.0 GB Download
md5:a8aec28a9d43828c9e452246f2dd9031
266.7 MB Download
md5:e2cc26c966308439b3e932cbd928c95f
2.6 GB Download
md5:3eedba20947f2cc05d95aee9acec9a49
2.7 GB Download
md5:7891b019a599d9ecac6263734b0d8c19
2.9 GB Download
md5:c7792fdad239af7505a54c4e8cfb6abe
2.8 GB Download
md5:f807235c0996acc05783221d415e55bb
2.8 GB Download
md5:9e204447cb78eb8b1d5c69cbfa1b3d5b
2.9 GB Download
md5:cd3985f1ad278f57eee43f48c712e78f
2.8 GB Download
md5:a9ba4a060b4f471fda0d1f081fe04046
401.2 MB Download
md5:81a6381cfc801d0704d2f913890dae26
3.3 GB Download
md5:e8ac8419393d8f4616bc5bfd876c99af
3.2 GB Download
md5:5dd53bc3f92dc225595b024fe594ea72
682.2 MB Download

Additional details

Related works

Is supplemented by
Dataset: 10.5281/zenodo.6672451 (DOI)

References

  • Acoustic models of Brazilian Portuguese Speech based on Neural Transformers - Gauy, Marcelo e Finger, Marcelo 2022
  • MENDES, R.B. (2013) Projeto SP2010: Amostra da fala paulistana. Disponível em <http://projetosp2010.fflch.usp.br>.
  • Gonçalves, S. C. L. Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. ESTUDOS LINGUÍSTICOS (SÃO PAULO. 1978), v. 48, p. 276-297, 2019.
  • RASO, T. ; MELLO, H. . The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portugues. Lecture Notes on Artificial Intelligence, v. 7243, p. 362-368, 2012.
  • Oliviera Jr., M. (2016). NURC Digital Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista De Corpus De Lenguas Romances Y Estudios Lingüísticos, 3(2), 149–174. Recuperado a partir de https://revistas.uam.es/chimera/article/view/6519.
  • A linguagem falada culta na cidade de São Paulo: materiais para seu estudo - Castilho, Ataliba Teixeira de e Pretti, Dino 1986
  • Acervo Certas Palavras- Cat\'{a}logo 1981-1996 - Teixeira, Carmem Silva P. 1997