Software Open Access

Buddhist Sanskrit Segmenter

Ligeia Lugli


Dublin Core Export

<?xml version='1.0' encoding='utf-8'?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:creator>Ligeia Lugli</dc:creator>
  <dc:date>2019-09-24</dc:date>
  <dc:description>This folder contains R code for a rule-based Buddhist Sanskrit Segmenter and Lemmatiser, as well as data necessary to use and evaluate the Segmenter and explanatory materials.

The segmenter has been tested on 639 sentences from 13 Buddhist text (9 sūtras, 4 śāstra) and has been evaluated as achieving 97% accuracy.

The code and materials contained in this folder have been developed as part of a Newton International Fellowship at King's College London, funded by the British Academy (NF161436)

 

Contents

R code for segmentation, lemmatisation, normalization and evaluation (includes instructions to run code)

powerpoint presentation with background and explanation of project

Wordlists and Wordlists documentation

ngrams and stems frequency tables necessary for segmentation

gold standard set of manually segmented and stemmed sentences for evaluation

set of raw sentences for evaluation

evaluation of Krisha et al. seq2seq segmenter on Buddhist sentences for reference purposes

 

This segmenter has been used to prepare the Sanskrit Corpus at DOI 10.5281/zenodo.3457822 and  its later version at 10.5281/zenodo.3526035</dc:description>
  <dc:identifier>https://zenodo.org/record/3526469</dc:identifier>
  <dc:identifier>10.5281/zenodo.3526469</dc:identifier>
  <dc:identifier>oai:zenodo.org:3526469</dc:identifier>
  <dc:language>eng</dc:language>
  <dc:relation>doi:10.5281/zenodo.3459218</dc:relation>
  <dc:rights>info:eu-repo/semantics/openAccess</dc:rights>
  <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights>
  <dc:subject>Buddhist Sanskrit</dc:subject>
  <dc:subject>Natural Language Processing</dc:subject>
  <dc:title>Buddhist Sanskrit Segmenter</dc:title>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>software</dc:type>
</oai_dc:dc>
109
549
views
downloads
All versions This version
Views 10976
Downloads 54985
Data volume 548.8 MB164.6 MB
Unique views 9970
Unique downloads 44538

Share

Cite as