Software Open Access

Buddhist Sanskrit Segmenter

Ligeia Lugli

This folder contains R code for a rule-based Buddhist Sanskrit Segmenter and Lemmatiser, as well as data necessary to use and evaluate the Segmenter and explanatory materials.

The segmenter has been tested on 639 sentences from 13 Buddhist text (9 sūtras, 4 śāstra) and has been evaluated as achieving 97% accuracy.

The code and materials contained in this folder have been developed as part of a Newton International Fellowship at King's College London, funded by the British Academy (NF161436)

 

Contents

R code for segmentation, lemmatisation, normalization and evaluation (includes instructions to run code)

powerpoint presentation with background and explanation of project

Wordlists and Wordlists documentation

ngrams and stems frequency tables necessary for segmentation

gold standard set of manually segmented and stemmed sentences for evaluation

set of raw sentences for evaluation

evaluation of Krisha et al. seq2seq segmenter on Buddhist sentences for reference purposes

 

This segmenter has been used to prepare the Sanskrit Corpus at DOI 10.5281/zenodo.3457822 and  its later version at 10.5281/zenodo.3526035

Files (20.5 MB)
Name Size
Lugli2019_HorizontalNormalizer.R
md5:9e2098321870bda7a82c0fc314449795
4.9 kB Download
Lugli_BuddhFoundCorpusNgramsRedux.csv
md5:ebeceb54230207b55968c113486c979f
192.8 kB Download
Lugli_BuddhSktSegmenterLemmatiser2019.R
md5:432313287a5a2d084ac64b70b14b9a2a
244.6 kB Download
Lugli_CL2019_BuddhistSanskritSegmenterPresentation.pptx
md5:394107767ce92f6be6b64e9c8cec9923
9.2 MB Download
Lugli_FiveTextsSegmentedTokensDFWithCleanFreq.csv
md5:37ce1893b1f8a9f0aa55b3f6a850e3f0
212.5 kB Download
Lugli_GretilBuddhRelLit_NgramsRedux.csv
md5:5e169775fc20db5ea684bb35015fe11a
473.7 kB Download
Lugli_GretilBuddhSastraSastra_NgramsRedux.csv
md5:63a22b8b1d18c08c6b12506d90c3fc16
311.9 kB Download
Lugli_NonStemmedWordlist.csv
md5:baee76cc1ec672d92cdb8deb6ba52a51
3.0 MB Download
Lugli_Segmenter_Eva_AllGoldSent.csv
md5:56a7ab6ba81ceac3954c38c5ad6a7525
75.3 kB Download
Lugli_SegmenterEva_RawOneSentencePerLine.zip
md5:4aa0a4e3672b4ce21e927a7420b07e5f
34.6 kB Download
Lugli_StemmedWordlist.csv
md5:dc9996dd5b97530e194f64add6f913e1
1.0 MB Download
Lugli_Wordlist_ReadMe.html
md5:a48598508f02a794ee6fd021c937962c
62.9 kB Download
Lugli_WordlistNoA_June2019.csv
md5:d64b2a4b9e10dd9e45e95d4f2f701648
1.1 MB Download
Lugli_WordlistWithStemmedAndNotStemmedLemmata.csv
md5:fb8504f60355452420e87d2f9953fa1c
4.5 MB Download
Seq2Seq_segmentertest-full-vocabulary_GeoffroyNoel.txt
md5:5d507c0ac8219998e5150944db8461e5
25.3 kB Download
60
499
views
downloads
All versions This version
Views 6033
Downloads 49953
Data volume 460.1 MB81.2 MB
Unique views 5632
Unique downloads 3989

Share

Cite as