Language lexicons for Hindi-English multilingual text processing

Mohd Zeeshan Ansari; Tanvir Ahmad; Mirza Mohd Sufyan Beg; Noaima Bari

doi:10.11591/ijai.v11.i2.pp641-648

Published June 1, 2022 | Version v1

Journal article Open

Language lexicons for Hindi-English multilingual text processing

1. Jamia Millia Islamia
2. Aligarh Muslim University

Language identification (LI) in textual documents is the process of automatically detecting the language contained in a document based on its content. The present language identification techniques presume that a document contains text in one of the fixed set of languages. However, this presumption is incorrect when dealing with multilingual document which includes content in more than one possible language. Due to the unavailability of standard corpora for Hindi-English mixed lingual language processing tasks, we propose the language lexicons, a novel kind of lexical database that augments several bilingual language processing tasks. These lexicons are built by learning classifiers over English and transliterated Hindi vocabulary. The designed lexicons possess condensed quantitative characteristics which reflect their linguistic strength in respect of Hindi and English language. On evaluating the lexicons, it is observed that words of the same language tend to cluster together and are separable over language classes. On comparing the classifier performance with existing works, the proposed lexicon models exhibit the better performance.

Files

25 21499 1570753764.pdf

Files (360.0 kB)

Name	Size	Download all
25 21499 1570753764.pdf md5:246f8fb7c2339bc33d9939771729d148	360.0 kB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	16	16
Downloads	36	36
Data volume	13.3 MB	13.3 MB

Language lexicons for Hindi-English multilingual text processing

Creators

Description

Files

25 21499 1570753764.pdf

Files (360.0 kB)