Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

JSON-LD (schema.org) Export

  "inLanguage": {
    "alternateName": "san", 
    "@type": "Language", 
    "name": "Sanskrit"
  "description": "<p>This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.</p>\n\n<p>It comprises:</p>\n\n<ul>\n\t<li>&nbsp;172&nbsp;metadata-enriched Buddhist&nbsp;Sanskrit texts for a total of ~ 5&nbsp;million words. The corpus contains all Mah\u0101y\u0101na and &#39;mainstream&#39; Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).</li>\n</ul>\n\n<p>The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three&nbsp;configurations:</p>\n\n<ol>\n\t<li>&nbsp;segmented (with dash-separated words)</li>\n\t<li>&nbsp;segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).</li>\n\t<li>segmented, stemmed and normalised (normalisation treats some spelling variation and&nbsp;solves sandhi of stems&#39; initials in most cases), recommended for Word Sketches.</li>\n</ol>\n\n<p>The latter version can be used to generate word sketches&nbsp;in Sketch Engine if used in&nbsp;conjunction with the included sketch grammar, which&nbsp;infers likely syntactic dependencies from morphological cues.</p>\n\n<p>**<em>avagraha</em> has been replaced with <em>a</em>** in the stemmed versions</p>\n\n<p><strong>Limitations</strong><br>\nAs a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).&nbsp;<br>\nA funding bid has been submitted to expand and lemmatise the corpus.</p>\n\n<p><strong>Data Quality</strong><br>\nThe corpus has been segmented with Lugli&#39;s Sanskrit segmenter (10.5281/zenodo.3459215).&nbsp;The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.</p>\n\n<p>Please refer to the segmenter documentation stored at&nbsp;10.5281/zenodo.3459215 for details on evaluation and stemming conventions.</p>\n\n<p><strong>Acknowledgments</strong><br>\nThe corpus has been realised as part of the project &#39;Lexis and Tradition: variation in the vocabulary of Sanskrit Mah\u0101y\u0101na literature&#39;. This project was&nbsp;funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King&#39;s College London&nbsp;under the supervision of Prof. Henrietta Kate Crosby.&nbsp;</p>\n\n<p>Dr. Bruno Galasek-Hul has contributed to versions 1.4 &amp; 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.</p>\n\n<p>Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their&nbsp;permission to include automatically processed versions of some of their editions&nbsp;in this corpus.</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Changelog</strong></p>\n\n<p>version 1.5&nbsp;adds more&nbsp;Buddhist texts, removes the reference corpus&nbsp;and improves segmentation</p>\n\n<p>version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors</p>\n\n<p>version 1.4.1 corrects some spacing and sentence parsing errors</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
      "affiliation": "Mangalam Research Center", 
      "@id": "https://orcid.org/0000-0003-0473-4290", 
      "@type": "Person", 
      "name": "Ligeia Lugli"
  "url": "https://zenodo.org/record/3903262", 
  "datePublished": "2019-09-23", 
  "keywords": [
    "Buddhist Sanskrit"
  "version": "1.5", 
  "contributor": [
      "@type": "Person", 
      "name": "Bruno Galasek-Hul"
  "@context": "https://schema.org/", 
  "identifier": "https://doi.org/10.5281/zenodo.3903262", 
  "@id": "https://doi.org/10.5281/zenodo.3903262", 
  "@type": "CreativeWork", 
  "name": "segmented Sanskrit corpus (proof of concept)"
