Ligeia Lugli
2019-09-23
<p>This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.</p>
<p>It comprises:</p>
<ul>
<li> 73 metadata-enriched Buddhist Sanskrit texts for a total of ~ 3 million tokens</li>
<li> a 4 million tokens reference corpus comprising 30 metadata-enriched non-Buddhist Sanskrit texts. </li>
</ul>
<p>The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:</p>
<ol>
<li> segmented (with dash-separated words)</li>
<li> segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).</li>
<li>segmented, stemmed and normalised (normalisation treats some spelling variation and solves sandhi of stems' initials in most cases), recommended for Word Sketches.</li>
</ol>
<p>The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.</p>
<p><strong>Limitations</strong><br>
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not normalised, lemmatised or PoS tagged). <br>
A funding bid has been submitted to expand and lemmatise the corpus.</p>
<p><strong>Data Quality</strong><br>
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature. No evaluation has been performed on non-Buddhist materials and the quality of the segmentation may be worse in the non-Buddhist section of the corpus.</p>
<p><strong>Acknowledgments</strong><br>
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby. </p>
<p>Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.</p>
Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine
https://doi.org/10.5281/zenodo.3526665
oai:zenodo.org:3526665
san
Zenodo
https://doi.org/10.5281/zenodo.3457821
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
corpus
Sanskrit
Buddhist Sanskrit
segmented Sanskrit corpus (proof of concept)
info:eu-repo/semantics/other