Unsupervised Deep Language and Dialect Identification for Short Texts

Koustava Goswami; Rajdeep Sarkar; Bharathi Raja Chakravarthi; Theodorus Fransen; John P. McCrae

doi:10.5281/zenodo.4320719

Published December 8, 2020 | Version v1

Conference paper Open

Unsupervised Deep Language and Dialect Identification for Short Texts

1. National University of Ireland Galway

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

Files

goswami2020unsupervised.pdf

Files (539.3 kB)

Name	Size	Download all
goswami2020unsupervised.pdf md5:d1224354851408c11f6540a6b33a4dff	539.3 kB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	68	68
Downloads	90	89
Data volume	51.2 MB	50.7 MB

Unsupervised Deep Language and Dialect Identification for Short Texts

Creators

Description

Files

goswami2020unsupervised.pdf

Files (539.3 kB)