Word Alignments for the Kathbath Multilingual Indic Corpus
Description
This repository provides word-level alignments for the Kathbath dataset [1], a multilingual speech corpus containing approximately 1500 hours of audio across 11 Indian languages.
The alignments were generated using the Montreal Forced Aligner (MFA) with pre-trained acoustic models specific to each language. To simplify reproducibility and save you the effort of running MFA yourself, we are releasing these alignments as part of our experimental setup.
If you find these alignments or any other aspect of our work useful, please consider citing the following paper:
- Anup Singh, Kris Demuynck, and Vipul Arora, "Language-Agnostic Speech Tokenizer for Spoken Term Detection with
Efficient Retrieval", Interspeech 2025.
[1] IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages
Data Structure:
<split>_word_alignments.pkl: {<lang>: {<word>: [<filename>: (start_time, end_time), ..., <filename>: (start_time, end_time)]}}
Files
Files
(349.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:1c66e2aae701c00d89eb86e895c9cc26
|
13.8 MB | Download |
|
md5:34b9bb6848a43722a1e6f4d5f97c889c
|
8.7 MB | Download |
|
md5:4e388ee7b9c6deeb8e11b1b0d0a93657
|
313.5 MB | Download |
|
md5:ae73f042d7d9098b64f914125270bc8b
|
13.8 MB | Download |