Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.3903262</identifier>
  <creators>
    <creator>
      <creatorName>Ligeia Lugli</creatorName>
      <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-0473-4290</nameIdentifier>
      <affiliation>Mangalam Research Center</affiliation>
    </creator>
  </creators>
  <titles>
    <title>segmented Sanskrit corpus (proof of concept)</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2019</publicationYear>
  <subjects>
    <subject>corpus</subject>
    <subject>Sanskrit</subject>
    <subject>Buddhist Sanskrit</subject>
  </subjects>
  <dates>
    <date dateType="Issued">2019-09-23</date>
  </dates>
  <language>sa</language>
  <resourceType resourceTypeGeneral="Other"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/3903262</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.3457821</relatedIdentifier>
  </relatedIdentifiers>
  <version>1.5</version>
  <rightsList>
    <rights rightsURI="https://creativecommons.org/licenses/by/4.0/legalcode">Creative Commons Attribution 4.0 International</rights>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;p&gt;This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.&lt;/p&gt;

&lt;p&gt;It comprises:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&amp;nbsp;172&amp;nbsp;metadata-enriched Buddhist&amp;nbsp;Sanskrit texts for a total of ~ 5&amp;nbsp;million words. The corpus contains all Mahāyāna and &amp;#39;mainstream&amp;#39; Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three&amp;nbsp;configurations:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;&amp;nbsp;segmented (with dash-separated words)&lt;/li&gt;
	&lt;li&gt;&amp;nbsp;segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).&lt;/li&gt;
	&lt;li&gt;segmented, stemmed and normalised (normalisation treats some spelling variation and&amp;nbsp;solves sandhi of stems&amp;#39; initials in most cases), recommended for Word Sketches.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The latter version can be used to generate word sketches&amp;nbsp;in Sketch Engine if used in&amp;nbsp;conjunction with the included sketch grammar, which&amp;nbsp;infers likely syntactic dependencies from morphological cues.&lt;/p&gt;

&lt;p&gt;**&lt;em&gt;avagraha&lt;/em&gt; has been replaced with &lt;em&gt;a&lt;/em&gt;** in the stemmed versions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;br&gt;
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).&amp;nbsp;&lt;br&gt;
A funding bid has been submitted to expand and lemmatise the corpus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Quality&lt;/strong&gt;&lt;br&gt;
The corpus has been segmented with Lugli&amp;#39;s Sanskrit segmenter (10.5281/zenodo.3459215).&amp;nbsp;The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.&lt;/p&gt;

&lt;p&gt;Please refer to the segmenter documentation stored at&amp;nbsp;10.5281/zenodo.3459215 for details on evaluation and stemming conventions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;&lt;br&gt;
The corpus has been realised as part of the project &amp;#39;Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature&amp;#39;. This project was&amp;nbsp;funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King&amp;#39;s College London&amp;nbsp;under the supervision of Prof. Henrietta Kate Crosby.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Dr. Bruno Galasek-Hul has contributed to versions 1.4 &amp;amp; 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.&lt;/p&gt;

&lt;p&gt;Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their&amp;nbsp;permission to include automatically processed versions of some of their editions&amp;nbsp;in this corpus.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changelog&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;version 1.5&amp;nbsp;adds more&amp;nbsp;Buddhist texts, removes the reference corpus&amp;nbsp;and improves segmentation&lt;/p&gt;

&lt;p&gt;version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors&lt;/p&gt;

&lt;p&gt;version 1.4.1 corrects some spacing and sentence parsing errors&lt;/p&gt;</description>
    <description descriptionType="Other">Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine</description>
  </descriptions>
</resource>
240
125
views
downloads
All versions This version
Views 24086
Downloads 12526
Data volume 683.7 MB147.6 MB
Unique views 21681
Unique downloads 5314

Share

Cite as