Dataset Open Access
Preparation of this corpus was partly supported by the Department of Culture of the Catalan autonomous government, and the v2.0 was supported by the Barcelona Supercomputing Center, within the framework of the project AINA of the Departament de Polítiques Digitals.
As of v2.0 the corpus is separated into 211 hours of clean and 400 hours of other quality segments. Furthermore, each speech segment is tagged with its speaker and each speaker with their gender. The statistics are detailed in the readme file.
For more information, go to https://github.com/CollectivaT-dev/ParlamentParla or mail email@example.com.
2.0: Major changes in the file structure; speaker ids with respective
genders added. The speakers of train, test and dev corpora do not overlap.
A major increase in size with a total time of 611 hours 43 minutes.
1.0: Much better quality due to improved segmentation, corpus separated
into clean and other.
0.2: First public release of approx. 320 hours.