Published May 24, 2025 | Version v1
Dataset Open

Word Alignments for the Kathbath Multilingual Indic Corpus

  • 1. ROR icon Ghent University

Description

This repository provides word-level alignments for the Kathbath dataset [1], a multilingual speech corpus containing approximately 1500 hours of audio across 11 Indian languages.

The alignments were generated using the Montreal Forced Aligner (MFA) with pre-trained acoustic models specific to each language. To simplify reproducibility and save you the effort of running MFA yourself, we are releasing these alignments as part of our experimental setup.

If you find these alignments or any other aspect of our work useful, please consider citing the following paper:

  • Anup Singh, Kris Demuynck, and Vipul Arora, "Language-Agnostic Speech Tokenizer for Spoken Term Detection with
    Efficient Retrieval", Interspeech 2025.

[1] IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

 

Data Structure:

<split>_word_alignments.pkl: {<lang>: {<word>: [<filename>: (start_time, end_time), ..., <filename>: (start_time, end_time)]}}

 

Files

Files (349.8 MB)

Name Size Download all
md5:1c66e2aae701c00d89eb86e895c9cc26
13.8 MB Download
md5:34b9bb6848a43722a1e6f4d5f97c889c
8.7 MB Download
md5:4e388ee7b9c6deeb8e11b1b0d0a93657
313.5 MB Download
md5:ae73f042d7d9098b64f914125270bc8b
13.8 MB Download