Published May 14, 2026 | Version v1
Dataset Open

Sindhi Open Lexicon Dataset - SindhiLanguage.org

Description

Sindhi Open Lexicon Dataset (223K+ Entries) for AI, NLP & Computational Linguistics. This project is a large-scale structured lexical dataset for the Sindhi language containing over 223,000 entries including definitions, linguistic metadata, and normalized forms. Sindhi is a historically rich but low-resource language in AI. This dataset aims to support NLP, AI systems, and computational linguistics. Objectives - Provide AI-ready Sindhi dataset - Support NLP research - Enable search engines, chatbots, OCR, and language tools - Preserve linguistic heritage digitally Dataset Features - 223,000+ entries - Definitions in Sindhi - Variants with/without diacritics - Normalized text - Domain classification - Formats: CSV, JSONL, SQLite

Files

LICENSE.txt

Files (341.2 MB)

Name Size Download all
md5:5805d2a4d7905ed324028aa4797cf987
733 Bytes Preview Download
md5:ff6727ed521acaeaafa42409250b8f67
2.6 kB Preview Download
md5:fee6fcb21880a6a28ebf38d11bd0f13d
283.7 kB Preview Download
md5:5904e41a047139258423d0a779f533a2
84.5 MB Preview Download
md5:8282ca3a656275d282bd98470de2a8fc
149.4 MB Download
md5:e17533fd16c705dff77824ea285532de
107.0 MB Download
md5:958201824dcf21321a63f9446c92987f
765 Bytes Preview Download
md5:28281547a0809e48a1155ae6e55723ac
876 Bytes Preview Download