Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

Citation Style Language JSON Export

  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3903262", 
  "language": "san", 
  "title": "segmented Sanskrit corpus (proof of concept)", 
  "issued": {
    "date-parts": [
  "abstract": "<p>This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.</p>\n\n<p>It comprises:</p>\n\n<ul>\n\t<li>&nbsp;172&nbsp;metadata-enriched Buddhist&nbsp;Sanskrit texts for a total of ~ 5&nbsp;million words. The corpus contains all Mah\u0101y\u0101na and &#39;mainstream&#39; Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).</li>\n</ul>\n\n<p>The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three&nbsp;configurations:</p>\n\n<ol>\n\t<li>&nbsp;segmented (with dash-separated words)</li>\n\t<li>&nbsp;segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).</li>\n\t<li>segmented, stemmed and normalised (normalisation treats some spelling variation and&nbsp;solves sandhi of stems&#39; initials in most cases), recommended for Word Sketches.</li>\n</ol>\n\n<p>The latter version can be used to generate word sketches&nbsp;in Sketch Engine if used in&nbsp;conjunction with the included sketch grammar, which&nbsp;infers likely syntactic dependencies from morphological cues.</p>\n\n<p>**<em>avagraha</em> has been replaced with <em>a</em>** in the stemmed versions</p>\n\n<p><strong>Limitations</strong><br>\nAs a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).&nbsp;<br>\nA funding bid has been submitted to expand and lemmatise the corpus.</p>\n\n<p><strong>Data Quality</strong><br>\nThe corpus has been segmented with Lugli&#39;s Sanskrit segmenter (10.5281/zenodo.3459215).&nbsp;The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.</p>\n\n<p>Please refer to the segmenter documentation stored at&nbsp;10.5281/zenodo.3459215 for details on evaluation and stemming conventions.</p>\n\n<p><strong>Acknowledgments</strong><br>\nThe corpus has been realised as part of the project &#39;Lexis and Tradition: variation in the vocabulary of Sanskrit Mah\u0101y\u0101na literature&#39;. This project was&nbsp;funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King&#39;s College London&nbsp;under the supervision of Prof. Henrietta Kate Crosby.&nbsp;</p>\n\n<p>Dr. Bruno Galasek-Hul has contributed to versions 1.4 &amp; 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.</p>\n\n<p>Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their&nbsp;permission to include automatically processed versions of some of their editions&nbsp;in this corpus.</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Changelog</strong></p>\n\n<p>version 1.5&nbsp;adds more&nbsp;Buddhist texts, removes the reference corpus&nbsp;and improves segmentation</p>\n\n<p>version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors</p>\n\n<p>version 1.4.1 corrects some spacing and sentence parsing errors</p>", 
  "author": [
      "family": "Ligeia Lugli"
  "note": "Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine", 
  "version": "1.5", 
  "type": "article", 
  "id": "3903262"
All versions This version
Views 24086
Downloads 12526
Data volume 683.7 MB147.6 MB
Unique views 21681
Unique downloads 5314


Cite as