There is a newer version of this record available.

Software Open Access

Buddhist Sanskrit Segmenter

Ligeia Lugli

Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="" xmlns:oai_dc="" xmlns:xsi="" xsi:schemaLocation="">
  <dc:creator>Ligeia Lugli</dc:creator>
  <dc:description>This folder contains R code for a rule-based Buddhist Sanskrit Segmenter and Lemmatiser, as well as data necessary to use and evaluate the Segmenter and explanatory materials.

The segmenter has been tested on 639 sentences from 13 Buddhist text (9 sūtras, 4 śāstra) and has been evaluated as achieving 97% accuracy.

The code and materials contained in this folder have been developed as part of a Newton International Fellowship at King's College London, funded by the British Academy (NF161436)



R code for segmentation, lemmatisation and evaluation (includes instructions to run code)

powerpoint presentation with background and explanation of project

Wordlists and Wordlists documentation

ngrams and stems frequency tables necessary for segmentation

gold standard set of manually segmented and stemmed sentences for evaluation

set of raw sentences for evaluation

evaluation of Krisha et al. seq2seq segmenter on Buddhist sentences for reference purposes


This segmenter has been used to prepare the Sanskrit Corpus at DOI 10.5281/zenodo.3457822</dc:description>
  <dc:subject>Buddhist Sanskrit</dc:subject>
  <dc:subject>Natural Language Processing</dc:subject>
  <dc:title>Buddhist Sanskrit Segmenter</dc:title>
All versions This version
Views 11133
Downloads 557463
Data volume 577.2 MB383.3 MB
Unique views 10130
Unique downloads 448406


Cite as