Word Alignments for the Kathbath Multilingual Indic Corpus

Anup, Singh

doi:10.5281/zenodo.15502222

Published May 24, 2025 | Version v1

Dataset Open

Word Alignments for the Kathbath Multilingual Indic Corpus

Anup, Singh (Contact person)¹

1. Ghent University

This repository provides word-level alignments for the Kathbath dataset [1], a multilingual speech corpus containing approximately 1500 hours of audio across 11 Indian languages.

The alignments were generated using the Montreal Forced Aligner (MFA) with pre-trained acoustic models specific to each language. To simplify reproducibility and save you the effort of running MFA yourself, we are releasing these alignments as part of our experimental setup.

If you find these alignments or any other aspect of our work useful, please consider citing the following paper:

Anup Singh, Kris Demuynck, and Vipul Arora, "Language-Agnostic Speech Tokenizer for Spoken Term Detection with
Efficient Retrieval", Interspeech 2025.

[1] IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

Data Structure:

<split>_word_alignments.pkl: {<lang>: {<word>: [<filename>: (start_time, end_time), ..., <filename>: (start_time, end_time)]}}

Files

Files (349.8 MB)

Name	Size	Download all
test_known_word_alignments.pkl md5:1c66e2aae701c00d89eb86e895c9cc26	13.8 MB	Download
test_word_alignments.pkl md5:34b9bb6848a43722a1e6f4d5f97c889c	8.7 MB	Download
train_word_alignments.pkl md5:4e388ee7b9c6deeb8e11b1b0d0a93657	313.5 MB	Download
valid_word_alignments.pkl md5:ae73f042d7d9098b64f914125270bc8b	13.8 MB	Download

	All versions	This version
Views	47	47
Downloads	34	34
Data volume	2.8 GB	2.8 GB

Word Alignments for the Kathbath Multilingual Indic Corpus

Creators

Description

Files

Files (349.8 MB)