Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli


MARC21 XML Export

<?xml version='1.0' encoding='UTF-8'?>
<record xmlns="http://www.loc.gov/MARC21/slim">
  <leader>00000nam##2200000uu#4500</leader>
  <datafield tag="041" ind1=" " ind2=" ">
    <subfield code="a">san</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">corpus</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Sanskrit</subfield>
  </datafield>
  <datafield tag="653" ind1=" " ind2=" ">
    <subfield code="a">Buddhist Sanskrit</subfield>
  </datafield>
  <controlfield tag="005">20200919094034.0</controlfield>
  <datafield tag="500" ind1=" " ind2=" ">
    <subfield code="a">Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine</subfield>
  </datafield>
  <controlfield tag="001">3903262</controlfield>
  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="4">rtm</subfield>
    <subfield code="a">Bruno Galasek-Hul</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">15085</subfield>
    <subfield code="z">md5:c58791f712ac0174c2363523179d12f7</subfield>
    <subfield code="u">https://zenodo.org/record/3903262/files/Lugli2019_BuddhSktSketchGrammar.txt</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">50788</subfield>
    <subfield code="z">md5:aa2c9f2071623329468796ac90f39ce0</subfield>
    <subfield code="u">https://zenodo.org/record/3903262/files/Lugli_BuddhistSanskritCorpusMetadata2020-06-22.csv</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">12660562</subfield>
    <subfield code="z">md5:43e8793746f43c4d86be12936c0d0c9c</subfield>
    <subfield code="u">https://zenodo.org/record/3903262/files/Lugli_BuddhistSanskritCorpusSegmented_v1_5.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">13768128</subfield>
    <subfield code="z">md5:e63c00b914b6d5f62db1829ea39d4be5</subfield>
    <subfield code="u">https://zenodo.org/record/3903262/files/Lugli_BuddhistSanskritCorpusStemmedNormalisedForGramrels_v1_5.zip</subfield>
  </datafield>
  <datafield tag="856" ind1="4" ind2=" ">
    <subfield code="s">13772551</subfield>
    <subfield code="z">md5:5aa6c8d2acabc99cc387a2a6c544514a</subfield>
    <subfield code="u">https://zenodo.org/record/3903262/files/Lugli_BuddhistSanskritCorpusStemmed_v1_5.zip</subfield>
  </datafield>
  <datafield tag="542" ind1=" " ind2=" ">
    <subfield code="l">open</subfield>
  </datafield>
  <datafield tag="260" ind1=" " ind2=" ">
    <subfield code="c">2019-09-23</subfield>
  </datafield>
  <datafield tag="909" ind1="C" ind2="O">
    <subfield code="o">oai:zenodo.org:3903262</subfield>
  </datafield>
  <datafield tag="100" ind1=" " ind2=" ">
    <subfield code="u">Mangalam Research Center</subfield>
    <subfield code="0">(orcid)0000-0003-0473-4290</subfield>
    <subfield code="a">Ligeia Lugli</subfield>
  </datafield>
  <datafield tag="245" ind1=" " ind2=" ">
    <subfield code="a">segmented Sanskrit corpus (proof of concept)</subfield>
  </datafield>
  <datafield tag="540" ind1=" " ind2=" ">
    <subfield code="u">https://creativecommons.org/licenses/by/4.0/legalcode</subfield>
    <subfield code="a">Creative Commons Attribution 4.0 International</subfield>
  </datafield>
  <datafield tag="650" ind1="1" ind2="7">
    <subfield code="a">cc-by</subfield>
    <subfield code="2">opendefinition.org</subfield>
  </datafield>
  <datafield tag="520" ind1=" " ind2=" ">
    <subfield code="a">&lt;p&gt;This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.&lt;/p&gt;

&lt;p&gt;It comprises:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;&amp;nbsp;172&amp;nbsp;metadata-enriched Buddhist&amp;nbsp;Sanskrit texts for a total of ~ 5&amp;nbsp;million words. The corpus contains all Mahāyāna and &amp;#39;mainstream&amp;#39; Buddhist based on Sanskrit editions texts available on GRETIL (reconstructed editions based on Tibetan translations have been filtered out).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three&amp;nbsp;configurations:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;&amp;nbsp;segmented (with dash-separated words)&lt;/li&gt;
	&lt;li&gt;&amp;nbsp;segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).&lt;/li&gt;
	&lt;li&gt;segmented, stemmed and normalised (normalisation treats some spelling variation and&amp;nbsp;solves sandhi of stems&amp;#39; initials in most cases), recommended for Word Sketches.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The latter version can be used to generate word sketches&amp;nbsp;in Sketch Engine if used in&amp;nbsp;conjunction with the included sketch grammar, which&amp;nbsp;infers likely syntactic dependencies from morphological cues.&lt;/p&gt;

&lt;p&gt;**&lt;em&gt;avagraha&lt;/em&gt; has been replaced with &lt;em&gt;a&lt;/em&gt;** in the stemmed versions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;br&gt;
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not lemmatised or PoS tagged).&amp;nbsp;&lt;br&gt;
A funding bid has been submitted to expand and lemmatise the corpus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Quality&lt;/strong&gt;&lt;br&gt;
The corpus has been segmented with Lugli&amp;#39;s Sanskrit segmenter (10.5281/zenodo.3459215).&amp;nbsp;The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature.&lt;/p&gt;

&lt;p&gt;Please refer to the segmenter documentation stored at&amp;nbsp;10.5281/zenodo.3459215 for details on evaluation and stemming conventions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;&lt;br&gt;
The corpus has been realised as part of the project &amp;#39;Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature&amp;#39;. This project was&amp;nbsp;funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King&amp;#39;s College London&amp;nbsp;under the supervision of Prof. Henrietta Kate Crosby.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Dr. Bruno Galasek-Hul has contributed to versions 1.4 &amp;amp; 1.5 thanks to funding from the Mangalam Research Center for Buddhist Languages.&lt;/p&gt;

&lt;p&gt;Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their&amp;nbsp;permission to include automatically processed versions of some of their editions&amp;nbsp;in this corpus.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changelog&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;version 1.5&amp;nbsp;adds more&amp;nbsp;Buddhist texts, removes the reference corpus&amp;nbsp;and improves segmentation&lt;/p&gt;

&lt;p&gt;version 1.4 adds 59 Buddhist texts and fixes some recurrent segmentation errors&lt;/p&gt;

&lt;p&gt;version 1.4.1 corrects some spacing and sentence parsing errors&lt;/p&gt;</subfield>
  </datafield>
  <datafield tag="773" ind1=" " ind2=" ">
    <subfield code="n">doi</subfield>
    <subfield code="i">isVersionOf</subfield>
    <subfield code="a">10.5281/zenodo.3457821</subfield>
  </datafield>
  <datafield tag="024" ind1=" " ind2=" ">
    <subfield code="a">10.5281/zenodo.3903262</subfield>
    <subfield code="2">doi</subfield>
  </datafield>
  <datafield tag="980" ind1=" " ind2=" ">
    <subfield code="a">other</subfield>
  </datafield>
</record>
245
125
views
downloads
All versions This version
Views 24591
Downloads 12526
Data volume 683.7 MB147.6 MB
Unique views 22085
Unique downloads 5314

Share

Cite as