A Dataset for Sanskrit Word Segmentation

Krishna, Amrith; Satuluri, Pavankumar; Goyal, Pawan

doi:10.5281/zenodo.803508

Published June 7, 2017 | Version v1

Dataset Open

A Dataset for Sanskrit Word Segmentation

1. IIT Kharagpur
2. Chinmaya Vishwavidyapeeth, CEG Campus

The work was accepted in Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, colocated with ACL 2017

The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.

Notes

The work was accepted in Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, colocated with ACL 2017

Files

DCS_pick.zip

Files (852.7 MB)

Name	Size
DCS_999.p md5:b2a95744ed82b35e9be55e6a06444a59	397 Bytes	Download
DCS_pick.zip md5:556395ddc0f087fbf0a199a0581775d0	199.8 MB	Preview Download
graphFiles md5:230d651dc203eeaedd5973ee519fcc29	1.8 MB	Download
paper.pdf md5:422c8ee62d17c2daea9d080e93c27c81	318.9 kB	Preview Download
pickleReader.py md5:0f8fd7b758a1d179c2aaadbdf1233eba	737 Bytes	Download
sample_999.graphml md5:0a5130686d65b820f80a520dfbf55ae0	32.6 kB	Download
skt.zip md5:28dd25046e118747ade96525e96d617d	650.8 MB	Preview Download

	All versions	This version
Views	3,571	3,553
Downloads	2,297	2,289
Data volume	395.5 GB	395.5 GB

A Dataset for Sanskrit Word Segmentation

Authors/Creators

Description

Notes

Files

DCS_pick.zip

Files (852.7 MB)