00000nmm##2200000uu#4500 3526469 doi 10.5281/zenodo.3526469 oai:zenodo.org:3526469 Buddhist Sanskrit Segmenter Ligeia Lugli (orcid)0000-0003-0473-4290 King's College London info:eu-repo/semantics/openAccess Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 spdx Buddhist Sanskrit Natural Language Processing This folder contains R code for a rule-based Buddhist Sanskrit Segmenter and Lemmatiser, as well as data necessary to use and evaluate the Segmenter and explanatory materials. The segmenter has been tested on 639 sentences from 13 Buddhist text (9 sūtras, 4 śāstra) and has been evaluated as achieving 97% accuracy. The code and materials contained in this folder have been developed as part of a Newton International Fellowship at King's College London, funded by the British Academy (NF161436)   Contents R code for segmentation, lemmatisation, normalization and evaluation (includes instructions to run code) powerpoint presentation with background and explanation of project Wordlists and Wordlists documentation ngrams and stems frequency tables necessary for segmentation gold standard set of manually segmented and stemmed sentences for evaluation set of raw sentences for evaluation evaluation of Krisha et al. seq2seq segmenter on Buddhist sentences for reference purposes   This segmenter has been used to prepare the Sanskrit Corpus at DOI 10.5281/zenodo.3457822 and  its later version at 10.5281/zenodo.3526035 eng Zenodo 2019-09-24 info:eu-repo/semantics/other 20200125072651.0 9160483 md5:394107767ce92f6be6b64e9c8cec9923 https://zenodo.org/records/3526469/files/Lugli_CL2019_BuddhistSanskritSegmenterPresentation.pptx 25325 md5:5d507c0ac8219998e5150944db8461e5 https://zenodo.org/records/3526469/files/Seq2Seq_segmentertest-full-vocabulary_GeoffroyNoel.txt 212507 md5:37ce1893b1f8a9f0aa55b3f6a850e3f0 https://zenodo.org/records/3526469/files/Lugli_FiveTextsSegmentedTokensDFWithCleanFreq.csv 473717 md5:5e169775fc20db5ea684bb35015fe11a https://zenodo.org/records/3526469/files/Lugli_GretilBuddhRelLit_NgramsRedux.csv 311888 md5:63a22b8b1d18c08c6b12506d90c3fc16 https://zenodo.org/records/3526469/files/Lugli_GretilBuddhSastraSastra_NgramsRedux.csv 3037610 md5:baee76cc1ec672d92cdb8deb6ba52a51 https://zenodo.org/records/3526469/files/Lugli_NonStemmedWordlist.csv 34650 md5:4aa0a4e3672b4ce21e927a7420b07e5f https://zenodo.org/records/3526469/files/Lugli_SegmenterEva_RawOneSentencePerLine.zip 244573 md5:432313287a5a2d084ac64b70b14b9a2a https://zenodo.org/records/3526469/files/Lugli_BuddhSktSegmenterLemmatiser2019.R 62929 md5:a48598508f02a794ee6fd021c937962c https://zenodo.org/records/3526469/files/Lugli_Wordlist_ReadMe.html 192850 md5:ebeceb54230207b55968c113486c979f https://zenodo.org/records/3526469/files/Lugli_BuddhFoundCorpusNgramsRedux.csv 4541833 md5:fb8504f60355452420e87d2f9953fa1c https://zenodo.org/records/3526469/files/Lugli_WordlistWithStemmedAndNotStemmedLemmata.csv 1105346 md5:d64b2a4b9e10dd9e45e95d4f2f701648 https://zenodo.org/records/3526469/files/Lugli_WordlistNoA_June2019.csv 4895 md5:9e2098321870bda7a82c0fc314449795 https://zenodo.org/records/3526469/files/Lugli2019_HorizontalNormalizer.R 1001881 md5:dc9996dd5b97530e194f64add6f913e1 https://zenodo.org/records/3526469/files/Lugli_StemmedWordlist.csv 75297 md5:56a7ab6ba81ceac3954c38c5ad6a7525 https://zenodo.org/records/3526469/files/Lugli_Segmenter_Eva_AllGoldSent.csv open 10.5281/zenodo.3459218 isVersionOf doi