Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli


JSON Export

{
  "files": [
    {
      "links": {
        "self": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d/Lugli2019_BuddhSktSketchGrammar.txt"
      }, 
      "checksum": "md5:c58791f712ac0174c2363523179d12f7", 
      "bucket": "4d199c3b-f1c6-45ca-91bc-77319618034d", 
      "key": "Lugli2019_BuddhSktSketchGrammar.txt", 
      "type": "txt", 
      "size": 15085
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d/Lugli_BuddhistSanskritCorpusMetadata2020-06-22.csv"
      }, 
      "checksum": "md5:aa2c9f2071623329468796ac90f39ce0", 
      "bucket": "4d199c3b-f1c6-45ca-91bc-77319618034d", 
      "key": "Lugli_BuddhistSanskritCorpusMetadata2020-06-22.csv", 
      "type": "csv", 
      "size": 50788
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d/Lugli_BuddhistSanskritCorpusSegmented_v1_5.zip"
      }, 
      "checksum": "md5:43e8793746f43c4d86be12936c0d0c9c", 
      "bucket": "4d199c3b-f1c6-45ca-91bc-77319618034d", 
      "key": "Lugli_BuddhistSanskritCorpusSegmented_v1_5.zip", 
      "type": "zip", 
      "size": 12660562
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d/Lugli_BuddhistSanskritCorpusStemmedNormalisedForGramrels_v1_5.zip"
      }, 
      "checksum": "md5:e63c00b914b6d5f62db1829ea39d4be5", 
      "bucket": "4d199c3b-f1c6-45ca-91bc-77319618034d", 
      "key": "Lugli_BuddhistSanskritCorpusStemmedNormalisedForGramrels_v1_5.zip", 
      "type": "zip", 
      "size": 13768128
    }, 
    {
      "links": {
        "self": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d/Lugli_BuddhistSanskritCorpusStemmed_v1_5.zip"
      }, 
      "checksum": "md5:5aa6c8d2acabc99cc387a2a6c544514a", 
      "bucket": "4d199c3b-f1c6-45ca-91bc-77319618034d", 
      "key": "Lugli_BuddhistSanskritCorpusStemmed_v1_5.zip", 
      "type": "zip", 
      "size": 13772551
    }
  ], 
  "owners": [
    76604
  ], 
  "doi": "10.5281/zenodo.3903262", 
  "stats": {
    "version_unique_downloads": 53.0, 
    "unique_views": 81.0, 
    "views": 86.0, 
    "version_views": 240.0, 
    "unique_downloads": 14.0, 
    "version_unique_views": 216.0, 
    "volume": 147620141.0, 
    "version_downloads": 125.0, 
    "downloads": 26.0, 
    "version_volume": 683694980.0
  }, 
  "links": {
    "doi": "https://doi.org/10.5281/zenodo.3903262", 
    "conceptdoi": "https://doi.org/10.5281/zenodo.3457821", 
    "bucket": "https://zenodo.org/api/files/4d199c3b-f1c6-45ca-91bc-77319618034d", 
    "conceptbadge": "https://zenodo.org/badge/doi/10.5281/zenodo.3457821.svg", 
    "html": "https://zenodo.org/record/3903262", 
    "latest_html": "https://zenodo.org/record/3903262", 
    "badge": "https://zenodo.org/badge/doi/10.5281/zenodo.3903262.svg", 
    "latest": "https://zenodo.org/api/records/3903262"
  }, 
  "conceptdoi": "10.5281/zenodo.3457821", 
  "created": "2020-06-22T11:47:44.398113+00:00", 
  "updated": "2020-09-19T09:40:34.439990+00:00", 
  "conceptrecid": "3457821", 
  "revision": 7, 
  "id": 3903262, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.3903262", 
    "description": "<p>This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.</p>\n\n<p>It comprises:</p>\n\n<ul>\n\t<li>&nbsp;172&nbsp;metadata-enriched Buddhist&nbsp;Sanskrit texts for a total of ~ 5&nbsp;million words. The corpus contains all Mah\u0101y\u0101na and &#39;mainstream&#39; Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).</li>\n</ul>\n\n<p>The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three&nbsp;configurations:</p>\n\n<ol>\n\t<li>&nbsp;segmented (with dash-separated words)</li>\n\t<li>&nbsp;segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).</li>\n\t<li>segmented, stemmed and normalised (normalisation treats some spelling variation and&nbsp;solves sandhi of stems&#39; initials in most cases), recommended for Word Sketches.</li>\n</ol>\n\n<p>The latter version can be used to generate word sketches&nbsp;in Sketch Engine if used in&nbsp;conjunction with the included sketch grammar, which&nbsp;infers likely syntactic dependencies from morphological cues.</p>\n\n<p>**<em>avagraha</em> has been replaced with <em>a</em>** in the stemmed versions</p>\n\n<p><strong>Limitations</strong><br>\nAs a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).&nbsp;<br>\nA funding bid has been submitted to expand and lemmatise the corpus.</p>\n\n<p><strong>Data Quality</strong><br>\nThe corpus has been segmented with Lugli&#39;s Sanskrit segmenter (10.5281/zenodo.3459215).&nbsp;The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.</p>\n\n<p>Please refer to the segmenter documentation stored at&nbsp;10.5281/zenodo.3459215 for details on evaluation and stemming conventions.</p>\n\n<p><strong>Acknowledgments</strong><br>\nThe corpus has been realised as part of the project &#39;Lexis and Tradition: variation in the vocabulary of Sanskrit Mah\u0101y\u0101na literature&#39;. This project was&nbsp;funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King&#39;s College London&nbsp;under the supervision of Prof. Henrietta Kate Crosby.&nbsp;</p>\n\n<p>Dr. Bruno Galasek-Hul has contributed to versions 1.4 &amp; 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.</p>\n\n<p>Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their&nbsp;permission to include automatically processed versions of some of their editions&nbsp;in this corpus.</p>\n\n<p>&nbsp;</p>\n\n<p><strong>Changelog</strong></p>\n\n<p>version 1.5&nbsp;adds more&nbsp;Buddhist texts, removes the reference corpus&nbsp;and improves segmentation</p>\n\n<p>version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors</p>\n\n<p>version 1.4.1 corrects some spacing and sentence parsing errors</p>", 
    "contributors": [
      {
        "type": "ProjectMember", 
        "name": "Bruno Galasek-Hul"
      }
    ], 
    "title": "segmented Sanskrit corpus (proof of concept)", 
    "license": {
      "id": "CC-BY-4.0"
    }, 
    "notes": "Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine", 
    "relations": {
      "version": [
        {
          "count": 7, 
          "index": 6, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "3457821"
          }, 
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "3903262"
          }
        }
      ]
    }, 
    "language": "san", 
    "version": "1.5", 
    "keywords": [
      "corpus", 
      "Sanskrit", 
      "Buddhist Sanskrit"
    ], 
    "publication_date": "2019-09-23", 
    "creators": [
      {
        "orcid": "0000-0003-0473-4290", 
        "affiliation": "Mangalam Research Center", 
        "name": "Ligeia Lugli"
      }
    ], 
    "access_right": "open", 
    "resource_type": {
      "type": "other", 
      "title": "Other"
    }, 
    "related_identifiers": [
      {
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.3457821", 
        "relation": "isVersionOf"
      }
    ]
  }
}
240
125
views
downloads
All versions This version
Views 24086
Downloads 12526
Data volume 683.7 MB147.6 MB
Unique views 21681
Unique downloads 5314

Share

Cite as