There is a newer version of this record available.

Other Open Access

segmented Sanskrit corpus (proof of concept)

Ligeia Lugli

This is a proof-of-concept Sanskrit corpus developed for the study of Buddhist Sanskrit lexicology.

It comprises:

  •  73 metadata-enriched Buddhist Sanskrit texts for a total of ~ 3 million tokens
  •  a 4 million tokens reference corpus comprising 30 metadata-enriched non-Buddhist Sanskrit texts. 

The corpus is in romanised Sanskrit (UTF-8 encoding) and is available in three configurations:

  1.  segmented (with dash-separated words)
  2.  segmented and stemmed (with capitalised word stem and compounds separated by an @ sign).
  3. segmented, stemmed and normalised (normalisation treats some spelling variation and solves sandhi of stems' initials in most cases), recommended for Word Sketches.

The latter version can be used to generate word sketches in Sketch Engine if used in conjunction with the included sketch grammar, which infers likely syntactic dependencies from morphological cues.

Limitations
As a proof of concept, this corpus suffers from several limitations. It is very small by contemporary standards, it has not been proof-read and it is currently only segmented and stemmed (not normalised, lemmatised or PoS tagged). 
A funding bid has been submitted to expand and lemmatise the corpus.

Data Quality
The corpus has been segmented with Lugli's Sanskrit segmenter (10.5281/zenodo.3459215). The accuracy of this segmenter has been evaluated at 97% on a sample of Buddhist Sanskrit literature. No evaluation has been performed on non-Buddhist materials and the quality of the segmentation may be worse in the non-Buddhist section of the corpus.

Acknowledgments
The corpus has been realised as part of the project 'Lexis and Tradition: variation in the vocabulary of Sanskrit Mahāyāna literature'. This project was funded by the British Academy through a Newton International Fellowship (NF161436) and hosted at the Department of Theology and Religious Studies at King's College London under the supervision of Prof. Henrietta Kate Crosby. 

Thanks to GRETIL, Dr. Vinita Tseng and Prof. Steinkellner for kindly giving their permission to include automatically processed versions of some of their editions in this corpus.

Also included: bibliography cum metadata summary and a sketch grammar + corpus configuration file for use in Sketch Engine
Files (39.5 MB)
Name Size
CorpusConfigurationFile_SketchEngine_Skt_Lugli.txt
md5:aef5a40e3f0c8e3f68b944cb312f3d77
2.6 kB Download
Lugli_Metadata_SegmentedSanskritCorpusUpdatedV1_1.txt
md5:6116763f85c4c13256c5101add2e4b2e
30.9 kB Download
Lugli_Metadata_SegmentedSanskritCorpusUpdatedV1_1_2.txt
md5:6116763f85c4c13256c5101add2e4b2e
30.9 kB Download
Lugli_SegmentedSanskritCorpusV1_1_2.zip
md5:7d900195dfb80579b94df85826760a8b
12.4 MB Download
Lugli_SegmentedSanskritSketchGrammarV1_1_2.txt
md5:841d044b145dc43e952828ce686e5892
14.8 kB Download
Lugli_SegmentedStemmedNormalizedSanskritCorpus_V1_1_2.zip
md5:3246a317cfa651fc676549fbf1f91224
13.5 MB Download
LugliSegmentedStemmedSanskritCorpus_V1_1_2.zip
md5:0072b162e7334aa38775c735ad8d773b
13.5 MB Download
SegmentedSanskritCorpusMetadataCumBibliography.csv
md5:6dcc0fcf94acc3497d48afb4e5fa6696
39.8 kB Download
245
125
views
downloads
All versions This version
Views 24541
Downloads 12515
Data volume 683.7 MB53.0 MB
Unique views 22038
Unique downloads 538

Share

Cite as