Translation Augmented LibriSpeech Corpus

Kocabiyikoglu, Ali Can; Bérard, Alexandre; Besacier, Laurent; Kraif, Olivier

doi:10.5281/zenodo.6482585

Published July 9, 2022 | Version v1

Dataset Open

Translation Augmented LibriSpeech Corpus

1. Berger-Levrault
2. Naver Labs
3. UGA

Large scale (>200h) and publicly available read audio book corpus. This corpus is an augmentation of LibriSpeech ASR Corpus (1000h) and contains English utterances (from audiobooks) automatically aligned with French text. Our dataset offers ~236h of speech aligned to translated text. Speech recordings and source texts are originally from Gutenberg Project, which is a digital library of public domain books read by volunteers. Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech. We gathered open domain e-books in French and extracted individual chapters available in LibriSpeech Corpus. Furthermore, we aligned chapters in French with English utterances in order to provide a corpus of speech recordings aligned with their translations.

====================================================

Large scale (>200h) and publicly available read audio book corpus. This corpus is an augmentation of LibriSpeech ASR Corpus (1000h)[1] and contains English utterances (from audiobooks) automatically aligned with French text. Our dataset offers ~236h of speech aligned to translated text.

Overview of the corpus:
+----------+-------+--------------+----------------+
| Chapters | Books | Duration (h) | Total Segments |
+----------+-------+--------------+----------------+
| 1408 | 247 | ~236h | 131395 |
+----------+-------+--------------+----------------+

Speech recordings and source texts are originally from Gutenberg Project[2], which is a digital library of public domain books read by volunteers. Our augmentation of LibriSpeech is straightforward: we automatically aligned e-books in a foreign language (French) with English utterances of LibriSpeech.

We gathered open domain e-books in French and extracted individual chapters available in LibriSpeech Corpus. Furthermore, we aligned chapters in French with English utterances in order to provide a corpus of speech recordings aligned with their translations. Our corpus is licensed under a Creative Commons Attribution 4.0 License.

Further information on how the corpus was obtained can be found in [3].

Details on the 100h subset:
===========================

This 100h subset was specifically designed for direct speech translation training and evaluation.
It was used for the first time in [4] (end-to-end automatic speech recognition of audiobooks).
In this subset, we extracted the best 100h according to cross language alignment scores. Dev and Test sets are composed of clean speech segments only.
Since English (source) transcriptions are initially available for LibriSpeech, we also translated them using Google Translate. To summarize, for each utterance of our corpus, the following quadruplet is available: English speech signal, English transcription (should not be used for direct speech translation experiments), French text translation 1 (from alignment of e-books) and translation 2 (from MT of English transcripts).

+---------+----------+--------+-----------------------------+-----------------+
| Corpus |   Total |        |       Source(per seg)       | Target(per seg) |
+---------+----------+--------+-----------------------------+-----------------+
|         | segments | hours | frames | chars | (sub)words |      chars      |
+---------+----------+--------+--------+-------+------------+-----------------+
| train 1 |   47271 | 100:00 |   762 | 111 |    20.7    |       143       |
| train 2 |          |        |        |       |            |       126       |
+---------+----------+--------+--------+-------+------------+-----------------+
|   dev   |   1071   | 2:00 |   673 |   93 |    17.9    |       110       |
+---------+----------+--------+--------+-------+------------+-----------------+
|   test |   2048   | 3:44 |   657 |   95 |    18.3    |       112       |
+---------+----------+--------+--------+-------+------------+-----------------+

The following archives correspond to the 100h subset used in [4]:

For audio files:

- train_100h.zip (~8.7GB)
- dev.zip(~180MB)
- test.zip(~330MB)
- train_130h_additional.zip (~10.6GB)

For aligned text files:

- train_100h_txt.zip
- dev_txt.zip
- test_txt.zip
- train130h_additional_txt.zip

Other archives provided:
========================

Following archives are available to download for other potential use of the corpus:

- database.zip(~50MB): Database describing the corpus (sqlite3)
- alignments.zip(~1.86GB): All of the intermediate processing files created in the cross-lingual alignment process along with English and French raw e-books
- audio_files.zip(~23GB): All of the speech segments organized as books and chapters
- interface.zip(~72MB): Contains static html files for alignment visualisation. With the interface, speech utterances can be listened while visualizing each sentence alignment

Note: In order to listen to speech segments with the html interface, 'audio_files' folder should be placed inside the 'Interface' folder
./Interface
./audio_files (audio_files.zip)
./css (interface.zip)
./js (interface.zip)
(..)

Github Page
===========
We provide a python script to interact with the database and to extract the corpus with different queries. This script along with all of the code used for the alignment process can be found at:
https://github.com/alicank/Translation-Augmented-LibriSpeech-Corpus

Detailed Corpus Structure
=========================

Folders name convention corresponds to book id's from LibriSpeech and Gutenberg projects. For instance folder name "11" corresponds to the id number of "Alice's Adventures in Wonderland by Lewis Carroll" in both Gutenberg Project and LibriSpeech Project.

This corpus is composed of three sections:
- Audio Files: resegmented audio files for each book id in the project
- HTML alignment visualisation interface : HTML visualisation for textual alignments with audio files avaliable to listen
- Alignments folder: all of the processing steps: pre-processing, alignment, forced transcriptions, forced alignments, etc.

   -Interface
       - audio_files/ : folder contains ~130.000 audio segments aligned with their translations
           - book id/
               - Chapter id/
                   - book_id-chapter_id-sentence_number.wav
                   - reader_id-chapter_id-sentence_number.wav **if the corpus comes from the dev/test pool of LibriSpeech**

- Alignments/ : Folder contains processing steps used in different alignment stages (reading [3] is mandatory to understand where these files come from)

- en/ : Folder contains preprocessing steps for English chapters used before alignment

- fr/ Folder contains preprocessing steps for French chapters used before alignment

- ls_book_id.txt (Gutenberg original text)
- lc_book_id.format (pdf,epub,txt,...)

- db/ Folder contains the database containing alignments, metadata and other information
-TA-LibriSpeechCorpus.sqlite3

index.html (Main html page of the Interface)

Database Structure
==================

Corpus is provided with different tables containing useful information provided with the corpus. Database structure is organized as follows:

Alignment Tables
- alignments: Table containing transcriptions, textual alignments and name of the audio file associated with a given alignment. Each row corresponds to an aligned sentence.
- audio: Table that contains duration of each speech segment (seconds)
- alignments_evaluations: 200 sentences manually annotated (for alignement evaluation see [3])
- alignments_excluded: Table used to mark sentences to be excluded from the corpus (bad alignments)
- alignments_gTranslate: automatic translation output from Google translate for each segment (transcriptions)
- alignments_scores: different cross lingual alignment score calculations provided with the corpus which could be used to sort the corpus from highest scores to the lowest

Metadata Tables
- Table librispeech: This table contains all the books from LibriSpeech project for which a downloadable link could be found (might be a dead/wrong link if it disappeared after our work)
- Table csv,clean100,other: Metadata completion for books provided with LibriSpeech project.
- Table nosLivres: some French e-book links gathered from http://www.nosLivres.net

References
==========

[1] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on (pp. 5206-5210). IEEE.
[2] https://www.gutenberg.org/
[3] Ali Can Kocabiyikoglu, Laurent Besacier and Olivier Kraif, "Augmenting LibriSpeech with French Translations : A Multimodal Corpus for Direct Speech Translation Evaluation" in submitted to LREC, 2018.
[4] Aléxandre Bérard, Laurent Besacier, Ali Can Kocabiyikoglu and Olivier Pietquin, "End-to-End Automatic Speech Translation of Audiobooks" in submitted to ICASSP, 2018.

Files

alignments.zip

Files (47.0 GB)

Name	Size
alignments.zip md5:db92738504c75739de04d414258949e7	2.0 GB	Preview Download
audio_files.zip md5:6dc4cecf71aa6ccc67a065cd07791de2	23.6 GB	Preview Download
database.zip md5:e2900b18a6fdfd4816f1248449c71ccd	47.9 MB	Preview Download
dev.zip md5:a30d9c0503148f1121f76129a86b91e4	188.5 MB	Preview Download
dev_txt.zip md5:97f23a5df39855ebbf18f46ac7de5350	125.5 kB	Preview Download
Interface.zip md5:4846a45ae71289cfcd28a8b928a95213	75.5 MB	Preview Download
test.zip md5:244dc308b0dce6d76f4270f798285f89	345.3 MB	Preview Download
test_txt.zip md5:fd11354e0bd7b054a8405dbfc4d0cb96	241.2 kB	Preview Download
train130h_additional_txt.zip md5:173b895ddf1114dedb7de38e8da0d985	8.6 MB	Preview Download
train_100h.zip md5:c5773858b37dd380bb97f0ea942dad1b	9.4 GB	Preview Download
train_100h_txt.zip md5:c63de3fbb0e449c3f43969a60e373cd4	6.8 MB	Preview Download
train_130h_additional.zip md5:db6a62f1f484d00295b7e9c2e2fc8658	11.4 GB	Preview Download

	All versions	This version
Views	1,138	1,122
Downloads	1,359	1,350
Data volume	26.7 TB	26.5 TB

Translation Augmented LibriSpeech Corpus

Authors/Creators

Description

Files

alignments.zip

Files (47.0 GB)