Dataset Open Access

ParlamentParla - Speech corpus of Catalan Parliamentary sessions

Külebi, Baybars

DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="" xmlns="" xsi:schemaLocation="">
  <identifier identifierType="DOI">10.5281/zenodo.5541827</identifier>
      <creatorName>Külebi, Baybars</creatorName>
      <affiliation>Col·lectivaT SCCL</affiliation>
    <title>ParlamentParla - Speech corpus of Catalan Parliamentary sessions</title>
    <subject>speech recognition</subject>
    <subject>parliamentary sessions</subject>
    <date dateType="Issued">2021-10-05</date>
  <resourceType resourceTypeGeneral="Dataset"/>
    <alternateIdentifier alternateIdentifierType="url"></alternateIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.5541826</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsPartOf"></relatedIdentifier>
    <rights rightsURI="">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
    <description descriptionType="Abstract">&lt;p&gt;This is the &lt;a href=""&gt;ParlamentParla&lt;/a&gt; speech corpus for Catalan prepared by &lt;a href=""&gt;Col&amp;middot;lectivaT&lt;/a&gt;. The audio segments were extracted from recordings the Catalan Parliament (&lt;a href=""&gt;Parlament de Catalunya&lt;/a&gt;) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus. The content belongs to the Catalan Parliament and the data is released conforming their &lt;a href=""&gt;terms of use&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Preparation of this corpus was partly supported by the &lt;a href=""&gt;Department of Culture&lt;/a&gt; of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of the project &lt;a href=""&gt;AINA&lt;/a&gt; of the &lt;a href=""&gt;Departament de Pol&amp;iacute;tiques Digitals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As of v2.0 the corpus is separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender. The statistics are detailed in the readme file.&lt;/p&gt;

&lt;p&gt;For more information, go to &lt;a href=""&gt;;/a&gt; or mail;/p&gt;

&lt;p&gt;&lt;strong&gt;Revision log:&lt;/strong&gt;&lt;/p&gt;

	&lt;p&gt;&lt;em&gt;2.0:&lt;/em&gt; Major changes in the file structure; speaker ids with respective&lt;br&gt;
	genders added. The speakers of train, test and dev corpora do not overlap.&lt;br&gt;
	A major increase in size with a total time of 611 hours 43 minutes.&lt;/p&gt;
	&lt;p&gt;&lt;em&gt;1.0:&lt;/em&gt; Much better quality due to improved segmentation, corpus separated&lt;br&gt;
	into clean and other.&lt;/p&gt;
	&lt;p&gt;&lt;em&gt;0.2:&lt;/em&gt; First public release of approx. 320 hours.&lt;/p&gt;
All versions This version
Views 281281
Downloads 1,4231,423
Data volume 36.0 TB36.0 TB
Unique views 230230
Unique downloads 3434


Cite as