The Unsolved Problem of Language Identification: A GMM-based Approach

doi:10.5281/zenodo.5796854

Published December 21, 2021 | Version v1

Conference paper Open

The Unsolved Problem of Language Identification: A GMM-based Approach

Mi, Maggie¹

1. University of Lancaster

Language identification (LID) systems attempt to identify a language from a series of randomly spoken utterances (Das & Roy, 2019), and this provides the foundation of many natural language processing (NLP) applications, such as multimedia mining, spoken-document retrieval, as well as multilingual spoken dialogue systems (Navratil, 2006). However, presently, the LID task is still an unsolved problem, often with increasing equal error rate (EER) as the duration and quality of the dataset decreases (Ambikairajah et al., 2011). The HMM-GMM (Hidden Markov Model-Gassian Mixture Model) approach taken in this paper involves building an acoustic model that uses probabilistic representations of speech datasets across 10 languages (Dutch, Russian, Italian, Portuguese, German, English, French, Turkish, and Greek). Through the exploration of the crosslinguistic features present in language families and the effect of the experimental parameters on the performance of the system, i.e., the length of the data recording, areas of weaknesses and corresponding means of improvement are therefore revealed.

Files

Mi, The Unsolved Problem of Language Identification - A GMM-based Approach.pdf

Files (1.0 MB)

Name	Size	Download all
Mi, The Unsolved Problem of Language Identification - A GMM-based Approach.pdf md5:f647a0485a892b070a4bfadaefcb7b15	1.0 MB	Preview Download

	All versions	This version
Views	90	90
Downloads	41	41
Data volume	43.1 MB	43.1 MB

The Unsolved Problem of Language Identification: A GMM-based Approach

Creators

Description

Files

Mi, The Unsolved Problem of Language Identification - A GMM-based Approach.pdf

Files (1.0 MB)