USPDATRO: Underrepresented Speech Dataset from Romanian language Open Data

doi:10.5281/zenodo.7898233

Published May 5, 2023 | Version v1

Dataset Open

USPDATRO: Underrepresented Speech Dataset from Romanian language Open Data

1. Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy

USPDATRO
==========

Underrepresented Speech Dataset from Open Data: Case Study on the Romanian Language (USPDATRO) is a manually created Romanian language speech corpus.
It was created specifically using speech types that are underrepresented in other speech datasets.
Sources for this dataset are represented by open data available on multimedia platforms under a Creative Commons license.
The data was manually transcribed and aligned at segment level.
In addition to the text and audio files, we offer text annotations (lemmatization, part of speech tags, dependency parsing) in CoNLL-U Plus format.

Each datasource is mentioned by URL in the metadata.csv file with associated license (a Creative Commons variant).

Dataset structure:
- audio: Folder with audio segments in WAV format
- text: Folder with corresponding transcriptions
- conllup: Folder with corresponding token-based annotations
- metadata.csv: Contains information about each segment

LICENSING

This work (transcriptions, alignment, metadata, annotations) is provided under the license CC BY-NC-SA 4.0 (Attribution-NonCommercial-ShareAlike 4.0 International).
The license can be viewed online here: https://creativecommons.org/licenses/by-nc-sa/4.0/
and the full text here: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode .
The original works considered for audio sources are available under their respective licenses (Creative Commons variants) as described in the metadata.csv file.

CONTACT

Research Institute for Artificial Intelligence "Mihai Drăgănescu", Romanian Academy
Web: http://www.racai.ro
Contact emails: vasile@racai.ro

Files

uspdatro.zip

Files (424.7 MB)

Name	Size	Download all
uspdatro.zip md5:ac56ff52e57f896db2d28d8a01f500f8	424.7 MB	Preview Download

	All versions	This version
Views	145	144
Downloads	31	30
Data volume	13.2 GB	12.7 GB

USPDATRO: Underrepresented Speech Dataset from Romanian language Open Data

Creators

Description

Files

uspdatro.zip

Files (424.7 MB)